-
Notifications
You must be signed in to change notification settings - Fork 110
HDFS Compatibility
Drake provides HDFS support by allowing you to specify inputs and outputs like hdfs://my/big_file.txt
.
Drake's default package includes a standard hadoop-core client library. However, there's a fair chance that your Hadoop cluster requires you to run a different versoin of the Hadoop client. Therefore, in order to make a best attempt at out-of-the-box HDFS support, the drake
script automatically looks for your local Hadoop client and prefers to use that.
Use drake --hadoop-version
to check which hadoop version drake client found.
If Drake cannot find a local Hadoop client, it will look at your HADOOP_CLASSPATH
environment variable and use that if found.
If drake
cannot find your local Hadoop client and your HADOOP_CLASSPATH
is not set and the client library that ships with Drake is not compatible with your local Hadoop cluster, Drake will not be able to support HDFS for you. Any attempts you make to use HDFS in your Drake workflows will result in errors such as this one:
ERROR java.io.IOException: Call to somehost/10.0.0.30:9000 failed on local exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
at org.apache.hadoop.ipc.Client.call(Client.java:743)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:180)
at drake.fs$hdfs_filesystem.invoke(fs.clj:144)
at drake.fs.HDFS.exists_QMARK_(fs.clj:153)
If you're trying to get your Drake workflows to work with HDFS and you see errors like this you should:
- Attempt to find out why Drake was not able to locate your local Hadoop client. Fixing this should fix Drake's HDFS support for you.
- If (1) does not yield results, you can modify
project.clj
to specify the exact Hadoop client library version you need, rather than the default version. Then you can make your own build of Drake that should be compatible with your Hadoop cluster.