Skip to content

gbraccialli/HdfsUtils

Repository files navigation

HdfsUtils

Project to help analysing HDFS metadata.

  • First feature is a D3 Sunburst visualization showing HDFS space usage and/or number of files
  • Snapshot space consumption overhead analyzer (from this discussion is coming next (stay tunned).

##Options to run ###1- Zeppelin notebook Just import URL below in your zeppelin instance and runs step-by-step:
https://raw.githubusercontent.com/gbraccialli/HdfsUtils/master/zeppelin/hdfs-d3.json

###Live Preview here

###2- Build from source, running in command line and using html file ###Building

git clone https://github.com/gbraccialli/HdfsUtils.git
cd HdfsUtils
mvn clean package

###Basic usage

java -jar target/gbraccialli-hdfs-utils-with-dependencies.jar \
  --path=/ \
  --maxLevelThreshold=-1  \
  --minSizeThreshold=-1  \
  --showFiles=false   \
  --verbose=true > out.json  

###Visualizing Open html/hdfs_sunburst.html in your browser and point to .json file you created in previous step, or copy/paste json content on right load options

PS: note Chrome browser has security contraint that does not allow you to load local files, use one of the following options:

  • Use zeppelin notebook (describe above)
  • Use Safari
  • Enable Chrome local files access: instructions here
  • Publish json in a webserver and use full URL

###Command line options: ####--confDir=
//path-to-conf-dir //specify directory containing hadoop config files, default to /etc/hadoop/conf

####--maxLevelThreshold=
-1 or or valid int //max number of directories do drill down. -1 means no limit. for example: maxLevelThreshold=3 means drill down will stop after 3 levels of subdirectories

####--minSizeThreshold=
//-1 or valid long //min number of bytes in a directory to continue drill down. -1 means no limit. minSizeThreshold=1000000 means only directories greater > 1000000 bytes will be drilled down

####--showFiles=
//true or false //whether to show information about files. showFiles=false will show summary information about files in each directory/subdirectory.

####--exclude=
//path1,path2,... //directories to exclude from drill down, for example: /tmp/,/user/ won't present information about those directories.

####--doAs=
//username (hdfs for example) //for non-kerberized cluster, you can set user to perform hdfs operations, using hdfs you won't have permissions issues. if you are using a kerberized cluster, grant read access to user performing this operation (you can use Ranger for this)

####--verbose=
//true or false //when true print processing info into System.err (not applied for zeppelin)

####--path=
//path to start analysis

##Special thanks to:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published