Project to help analysing HDFS metadata.
- First feature is a D3 Sunburst visualization showing HDFS space usage and/or number of files
- Snapshot space consumption overhead analyzer (from this discussion is coming next (stay tunned).
##Options to run
###1- Zeppelin notebook
Just import URL below in your zeppelin instance and runs step-by-step:
https://raw.githubusercontent.com/gbraccialli/HdfsUtils/master/zeppelin/hdfs-d3.json
###2- Build from source, running in command line and using html file ###Building
git clone https://github.com/gbraccialli/HdfsUtils.git
cd HdfsUtils
mvn clean package
###Basic usage
java -jar target/gbraccialli-hdfs-utils-with-dependencies.jar \
--path=/ \
--maxLevelThreshold=-1 \
--minSizeThreshold=-1 \
--showFiles=false \
--verbose=true > out.json
###Visualizing
Open html/hdfs_sunburst.html in your browser and point to .json file you created in previous step, or copy/paste json content on right load options
PS: note Chrome browser has security contraint that does not allow you to load local files, use one of the following options:
- Use zeppelin notebook (describe above)
- Use Safari
- Enable Chrome local files access: instructions here
- Publish json in a webserver and use full URL
###Command line options:
####--confDir=
//path-to-conf-dir
//specify directory containing hadoop config files, default to /etc/hadoop/conf
####--maxLevelThreshold=
-1 or or valid int
//max number of directories do drill down. -1 means no limit. for example: maxLevelThreshold=3 means drill down will stop after 3 levels of subdirectories
####--minSizeThreshold=
//-1 or valid long
//min number of bytes in a directory to continue drill down. -1 means no limit. minSizeThreshold=1000000 means only directories greater > 1000000 bytes will be drilled down
####--showFiles=
//true or false
//whether to show information about files. showFiles=false will show summary information about files in each directory/subdirectory.
####--exclude=
//path1,path2,...
//directories to exclude from drill down, for example: /tmp/,/user/ won't present information about those directories.
####--doAs=
//username (hdfs for example)
//for non-kerberized cluster, you can set user to perform hdfs operations, using hdfs you won't have permissions issues. if you are using a kerberized cluster, grant read access to user performing this operation (you can use Ranger for this)
####--verbose=
//true or false
//when true print processing info into System.err (not applied for zeppelin)
####--path=
//path to start analysis
##Special thanks to:
- Dave Patton who first created HDP-Viz where I got insipered and copied lots of code
- Ali Bajwa who created ambari stack for Dave's project (and helped me get it working)
- David Streever who created (or forked) hdfs-cli, where I also copied lots of code