Most often I see developers struggling to mimic Linux find command for hadoop files especially based on size or size range. No wonder all this pain is because there was no such find command in hadoop before version 2.7.x and even the latest hadoop release 3.0.0-beta1 leaves a lot to be desired. I myself have faced this problem on a multitude of occasions and figured out a couple of ways to do it. This short writeup should help those who want to search hadoop file system to find files that match a specific size or particularly the files in a given size range.
Option 1: If you are using Cloudera distribution then you are in luck as CDH has a pre-built functionality for the job. You can use the jar with almost all the features of the linux find command. Here is how you would do it to find the files with size greater than 2KB and less than 4KB:
locate search-mr-job.jar /usr/bin/cdh/lib/solr/contrib/mr/search-mr-job.jar findInHdfs='hadoop jar /usr/bin/cdh/lib/solr/contrib/mr/search-mr-jobjar org.apache.solr.hadoop.HdfsFindTool' ${findInHdfs} -find / -type f -size +2048c -size -4096c
Option 2:And in case you are not using CDH then you still have a way out wherein you could work it out using a combination of available commands in hdfs and unix. Here is how you would do it using awk to find all the files under hdfs root with size between 2KB and 4KB:
minSize=2048 maxSize=4096 hdfs dfs -ls -R / | awk '/^-/ { gsub(/[ ,]+/," ") ; if($5 > '$minSize' && $5 < '$maxSize') print $5,$8 }'
The last command needs a little explanation:
- hdfs dfs -ls -R / => A familiar command to list all the files recursively under hdfs root. The output could be like:
hdfs dfs -ls -R / -rw-r-----+ 1 user group 1891172 2016-11-29 14:38 /file1.csv drwxr-x--x+ - user group 0 2016-11-29 14:38 /dir1 -rw-r-----+ 1 user group 994 2016-11-29 14:38 /file2.csv drwxr-x--x+ - user group 0 2016-11-29 14:39 /dir2
- /^-/ => List of all the files in hdfs is piped into the awk command and this very first expression checks the file flag at the start of each input line. Allows only files (leaving out directories) to pass through as:
hdfs dfs -ls -R / | awk '/^-/ {print $0}' -rw-r-----+ 1 user group 1891172 2016-11-29 14:38 /file1.csv -rw-r-----+ 1 user group 994 2016-11-29 14:38 /file2.csv
- gsub(/[ ,]+/,” “) => Only if b above is true (that is the current input line is for a file) then globally replace two or more consecutive spaces with a single space so that we can use it as a delimiter between various parts of the input line. The output could be like:
hdfs dfs -ls -R / | awk '/^-/ { gsub(/[ ,]+/," ") ; print }' -rw-r-----+ 1 user group 1891172 2016-11-29 14:38 /file1.csv -rw-r-----+ 1 user group 994 2016-11-29 14:38 /file2.csv
- If condition => Our desired condition set. The fifth token in the output of the ls command represents the size in bytes. We are checking that the size of the file in bytes is within our specified limits.
- print $5,$8 => If the condition in e above is evaluated to true then we are printing the file size (5th token) in bytes as well as the path/name of the file (8th token) in the output. If you want to use it as part of another script (like creating a HAR file) then you can just output the file path/name. The output would be:
hdfs dfsadmin –fetchImage / hdfs oiv -p XML -i / -o /
Option 3: I have also tried achieving the same using fsimage wherein first the fsimage is obtained (step a below) before converting it into a human-readable form using OIV (offline image viewer) image parser, say XML (step b below). And then finally writing a custom XML analyzer to do the job. But I have observed that this is a little less performant as there is some gap between the fsimage creation, its verbose translation to human-readable form, and then the final analysis. There could always be new files getting created during this time interval which would not be part of this fsimage (but of course it would be picked up next time). Essentially it is related to your real-time requirements.
hdfs dfsadmin –fetchImage / hdfs oiv -p XML -i / -o /
Option 4: Most of the bigdata fraternity is still on pre 2.7 hadoop releases and hence is not really an option but those who have this can try experimenting with it. Hadoop releases post 2.7.x include a find command which you can check using:
hdfs dfs -help
You can choose and implement any of the options above as per your requirements.
Let me know in case you have any queries and please do share your feedback or comments if this was helpful to you.