Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

no lmdbjni in java.library.path exception #7

Closed
melody-rain opened this issue Feb 26, 2016 · 16 comments
Closed

no lmdbjni in java.library.path exception #7

melody-rain opened this issue Feb 26, 2016 · 16 comments

Comments

@melody-rain
Copy link

16/02/26 16:34:34 INFO caffe.DataSource$: Source data layer:0
16/02/26 16:34:34 INFO caffe.LMDB: Batch size:64
Exception in thread "main" java.lang.UnsatisfiedLinkError: no lmdbjni in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at com.yahoo.ml.caffe.LMDB$.makeSequence(LMDB.scala:28)
at com.yahoo.ml.caffe.LMDB.makeRDD(LMDB.scala:94)
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:113)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:44)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/02/26 16:34:34 INFO spark.SparkContext: Invoking stop() from shutdown hook

@melody-rain
Copy link
Author

Sorry it was solved by
export LD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib

Please close it. Thanks.

@anfeng anfeng closed this as completed Feb 26, 2016
@yilaguan
Copy link

yilaguan commented Oct 9, 2016

Hi,I have the same problem in my program. And I also use "export LD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib". It is useless.

16/10/09 19:49:00 ERROR ApplicationMaster: User class threw exception: java.lang.UnsatisfiedLinkError: no lmdbjni in java.library.path
267 java.lang.UnsatisfiedLinkError: no lmdbjni in java.library.path
268 at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
269 at java.lang.Runtime.loadLibrary0(Runtime.java:870)
270 at java.lang.System.loadLibrary(System.java:1122)
271 at com.yahoo.ml.caffe.LmdbRDD$.com$yahoo$ml$caffe$LmdbRDD$$loadLibrary(LmdbRDD.scala:245)
272 at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:200)
273 at com.yahoo.ml.caffe.LmdbRDD.getPartitions(LmdbRDD.scala:47)
274 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
275 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
276 at scala.Option.getOrElse(Option.scala:120)
277 at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
278 at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
279 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
280 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
281 at scala.Option.getOrElse(Option.scala:120)
282 at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
283 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
284 at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
285 at com.yahoo.ml.caffe.CaffeOnSpark.trainWithValidation(CaffeOnSpark.scala:257)
286 at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:42)
287 at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
288 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
289 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
290 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
291 at java.lang.reflect.Method.invoke(Method.java:498)
292 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:558)

@junshi15
Copy link
Collaborator

junshi15 commented Oct 9, 2016

  1. check the .so file exists in the specified path.
  2. depending how you launch your job, if you are using yarn, you will also need to set the LD_LIBRARY_PATH in both executor and driver as spark-submit options.

@yilaguan
Copy link

I check the .so file exists in the "${CAFFE_ON_SPARK}/caffe-distri/distribute/lib", they are libcaffedistri.so and liblmdbjni.so. I use yarn to run my job. Of cource I had set the LD_LIBRARY_PATH in both executor and driver as spark-submit option.
I compile the CaffeOnSpark in spark1.6.2. The spark cluster have one master and two slave. But when I run yarn cluster I find logs as beflow:

spark.driver.extraClassPath=/home/spark/caffeOnSpark/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar

spark.driver.extraJavaOptions=-server -Xmx24g -Xms4g  -Djava.security.krb5.conf=/home/work/kerberos5-client/etc/krb5.conf -Djava.library.path=/home/hadoop/hadoop/lib/native -XX:MaxPermSize=512m -XX:+PrintG     CDetails -XX:+PrintGCTimeStamps

spark.driver.extraLibraryPath=/home/spark/caffeOnSpark/CaffeOnSpark/caffe-public/distribute/lib:/home/spark/caffeOnSpark/CaffeOnSpark/caffe-distri/distribute/lib:/usr/local/cuda-7.0/lib64:/usr/local/mkl/li     b/intel64/

It seems the LD_LIBRARY_PATH was useless because it dose not contain in Djava.library.path.

@anfeng
Copy link
Contributor

anfeng commented Oct 10, 2016

Do you have CaffeOnSpark directories on the driver node? Please note that,
if you are using --deploy-mode=cluster, driver node is selected by YARN
cluster among its available nodes.

Andy

On Sun, Oct 9, 2016 at 8:55 PM, yilaguan notifications@github.com wrote:

I check the .so file exists in the "${CAFFE_ON_SPARK}/caffe-distri/distribute/lib",
they are libcaffedistri.so and liblmdbjni.so. I use yarn to run my job. Of
cource I had set the LD_LIBRARY_PATH in both executor and driver as
spark-submit option.
I compile the CaffeOnSpark in spark1.6.2. The spark cluster have one
master and two slave. But when I run yarn cluster I find logs as beflow:

spark.driver.extraClassPath=/home/spark/caffeOnSpark/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar

spark.driver.extraJavaOptions=-server -Xmx24g -Xms4g -Djava.security.krb5.conf=/home/work/kerberos5-client/etc/krb5.conf -Djava.library.path=/home/hadoop/hadoop/lib/native -XX:MaxPermSize=512m -XX:+PrintG CDetails -XX:+PrintGCTimeStamps

spark.driver.extraLibraryPath=/home/spark/caffeOnSpark/CaffeOnSpark/caffe-public/distribute/lib:/home/spark/caffeOnSpark/CaffeOnSpark/caffe-distri/distribute/lib:/usr/local/cuda-7.0/lib64:/usr/local/mkl/li b/intel64/

It seems the LD_LIBRARY_PATH was useless because it dose not contain in
Djava.library.path.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#7 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AClTeELdWpNSVKeWPDP3Lb066pOddgzxks5qybdHgaJpZM4HjlDs
.

@yilaguan
Copy link

I had CaffeOnSpark directories on the driver node. And I use “--master yarn --deploy-mode=cluster”

@mumlax
Copy link

mumlax commented Jul 23, 2017

I'm experiencing a similar error with the missing lmdbjni while trying to run the mnist-example in yarn-cluster mode (like yilaguan) after following the GetStarted_yarn instructions.
My setup includes three nodes, A B C. On all of them is running YARN, A is YARN- and Spark-master and contains the CaffeOnSpark directory (that Spark-clients are on B/C should be incidental due to the use of YARN).
LD_LIBRARY_PATH is set on A but the error appears nevertheless

Just for clarification: Do I have to put anything from CaffeOnSpark on the node B/C to eliminate this error? The complete directory? I don't think so because of usage of YARN. Or any configs or exports in the .bash_profile's? Or do any of the dependencies like glog or protobuf need to be installed on B/C?
There is nothing mentioned in the instruction.

@junshi15
Copy link
Collaborator

The .so files should exist on the executors and PATH needs to be set properly.

You don't need to put those files manually. We usually create a tar file containing all the required library files, then use "--archives /path/to/CaffeOnSparkLibrary_archive.tgz" to ship it by yarn.
Also, you set
--conf spark.driver.extraLibraryPath="./CaffeOnSparkLibrary_archive.tgz:your_other_LD_LIBRARY_PATH"
--conf spark.executorEnv.LD_LIBRARY_PATH="./CaffeOnSparkLibrary_archive.tgz:your_other_LD_LIBRARY_PATH"

so that the executors know where to find them.

@mumlax
Copy link

mumlax commented Jul 25, 2017

Thanks for you answer.

Okay, that sound like useful information. The --archives option is necessary? Then its missing in the wiki-instruction.
Which "CaffeOnSparkLibrary_archive.tgz" do you mean? I can't find one in my ${CAFFE_ON_SPARK}-directory. Or do you create this tar file on your own from the whole ${CAFFE_ON_SPARK}-directory?
And this archive needs to be added to the paths via --conf, alright, thats also missing in the instructions. I'll try it when I know which tar file you mean.

@junshi15
Copy link
Collaborator

I did not write the tutorial, so I am not sure what context it was. It seems to me the whole cluster is in a single box, i.e. everything is local. In that case, all the executors should know where the .so files are. You don't have to ship anything. I am not sure why were you getting error if you were running "local" yarn mode.

I was talking about a case where the executors are distributed, where --archives is needed. "CaffeOnSparkLibrary_archive.tgz" is just a tar.gz file that containing all the required library files, you create it yourself by "tar czf CaffeOnSparkLibrary_archive.tgz all_library_files". Basically, you need to copy files under caffe-public/distribute/lib and caffe-distri/distribute/lib to a temp directory and tar everything in that temp directory into a tar-ball.

@mumlax
Copy link

mumlax commented Aug 17, 2017

I think you mean a pseudo cluster. In my case it's a real cluster.

Alright. I added the archives and extended the paths as mentioned and this error disappeared. But another one appeared (other topic, I mentioned it in existing #239 (comment)).

Other story, real reason why I'm writing here again: In this archive-context a strange error appeared when executing the cifar example of the same tutorial (where I also added the archive).

17/08/17 17:16:20 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 6, HOSTNAME.com): java.lang.UnsatisfiedLinkError: /data/hadoop/yarn/local/usercache/spark/filecache/132/CoS_libArchive.tgz/libcaffedistri.so: libcaffe.so.1.0.0-rc3: cannot open shared object file: No such file or directory
	at java.lang.ClassLoader$NativeLibrary.load(Native Method)
	at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
	at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1857)
	at java.lang.Runtime.loadLibrary0(Runtime.java:870)
	at java.lang.System.loadLibrary(System.java:1122)
	at com.yahoo.ml.jcaffe.BaseObject.<clinit>(BaseObject.java:10)
	at com.yahoo.ml.caffe.CaffeProcessor.<init>(CaffeProcessor.scala:76)
	at com.yahoo.ml.caffe.CaffeProcessor$.instance(CaffeProcessor.scala:23)
	at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$4.apply(CaffeOnSpark.scala:115)
	at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$4.apply(CaffeOnSpark.scala:113)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1857)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1857)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I don't even understand the error. How is it possible, that the system says cannot open shared object file: No such file or directory when it unzipped the libcaffedistri.so file from the archive (CoS_libArchive.tgz) before?
Can anybody explain this to me and give me a hint how to solve this ;)?

I think the error above also causes the final exception, with which the app exits failed:

17/08/17 17:16:21 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 9, IVV5BS-ES08.UNI-MUENSTER.DE): java.lang.NoClassDefFoundError: Could not initialize class com.yahoo.ml.jcaffe.CaffeNet
	at com.yahoo.ml.caffe.CaffeProcessor.<init>(CaffeProcessor.scala:76)
	at com.yahoo.ml.caffe.CaffeProcessor$.instance(CaffeProcessor.scala:23)
	at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$4.apply(CaffeOnSpark.scala:115)
	at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$4.apply(CaffeOnSpark.scala:113)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1857)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1857)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@junshi15
Copy link
Collaborator

The error message is clear. The program can not find the library. i.e. libcaffedistri.so
/data/hadoop/yarn/local/usercache/spark/filecache/132/CoS_libArchive.tgz/libcaffedistri.so: libcaffe.so.1.0.0-rc3: cannot open shared object file: No such file or directory

Your LD_LIBRARY_PATH was not set properly, it appears.
if you had the following line in your spark-submit command:
--archives your/path/to/CoS_libArchive.tgz \
Then you should tell spark where the .so files are by the following options:
--conf spark.executorEnv.LD_LIBRARY_PATH="./CoS_libArchive.tgz:/your/other/lib/path"
Spark takes the .tgz and unzip it into a directory with the same name as the file. My assumption is libcaffedistri.so is under the root of your .tgz file. If you have sub-directories, you need to adjust the option accordingly. For example if your libcaffedistri.so is under ./lib64 inside the .tgz file, then your option should be:
--conf spark.executorEnv.LD_LIBRARY_PATH="./CoS_libArchive.tgz/lib64:/your/other/lib/path"

@mumlax
Copy link

mumlax commented Aug 17, 2017

Okay.

Just to give some more information, the spark-submit and the archive look as follows:

spark-submit --master yarn --deploy-mode cluster --num-executors ${SPARK_WORKER_INSTANCES}     
--files ${CAFFE_ON_SPARK}/data/cifar10_quick_solver.prototxt,${CAFFE_ON_SPARK}/data/cifar10_quick_train_test.prototxt,${CAFFE_ON_SPARK}/data/mean.binaryproto     
--archives /home/spark/CaffeOnSpark/CoS_libArchive.tgz     
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}:./CoS_libArchive.tgz"     
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:./CoS_libArchive.tgz"     
--class com.yahoo.ml.caffe.CaffeOnSpark 
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar         
-train -features accuracy,loss -label label -conf cifar10_quick_solver.prototxt 
-devices ${DEVICES} -connection ethernet 
-model hdfs:///cifar10.model.h5 -output hdfs:///cifar10_features_result
[spark@HOSTNAME data]$ tar -ztvf /home/spark/CaffeOnSpark/CoS_libArchive.tgz 
-rwxr-xr-x XX/YY 14618104 2017-07-27 11:17 libcaffedistri.a
-rwxr-xr-x XX/YY  5957472 2017-07-27 11:17 libcaffedistri.so
-rwxr-xr-x XX/YY  1631024 2017-07-27 11:19 liblmdbjni.so

So as you see, everything is as it should be.
But I think I know where the source of the UnsatisfiedLinkError is. I executed the command above with --deploy-mode client (forgot to mention it, completely my fault) because I read that it's good for debugging purposes and you get there more information through more logging. Both variants have the NoClassDefFoundError, but only the client-variant has the UnsatisfiedLinkError. I thought this new error was the additional logging output, but it seems that this error is caused by the way the deploy-mode=client works (other environments and so on), so that the libs aren't available there.
So we can forget that error. Sorry. I work on cluster-mode.

But the NoClassDefFoundError is still there. Do you have a quick idea or shall I open an additional issue? Don't want to spam this one.

@junshi15
Copy link
Collaborator

NoClassDefFoundError sounds like jvm could not find your .jar file. But your provided caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar. So I am not sure where the problem is.

@mumlax
Copy link

mumlax commented Oct 2, 2017

I was able to resolve the problem, here my solution for other users having a similar problem.

Dumb as I was I just packed the libs from caffe-distri/distribute/lib and forget the ones laying in caffe-public/distribute/lib.
Now everything is working fine :)
As a quick overview, these are the required files which must be included in the archive:

[spark@Hostname ~]$ tar -ztvf /home/spark/CaffeOnSpark/CoS_libArchive.tgz 
-rwxr-xr-x spark/hadoop 102357376 2017-08-16 17:57 libcaffe.a
-rwxr-xr-x spark/hadoop  14618104 2017-08-16 17:57 libcaffedistri.a
-rwxr-xr-x spark/hadoop   5957472 2017-08-16 17:57 libcaffedistri.so
-rwxr-xr-x spark/hadoop  27575720 2017-08-16 17:57 libcaffe.so
-rwxr-xr-x spark/hadoop  27575720 2017-08-16 17:57 libcaffe.so.1.0.0-rc3
-rwxr-xr-x spark/hadoop   1631024 2017-08-16 17:57 liblmdbjni.so

Thanks junshi15 for your help!

@junshi15
Copy link
Collaborator

junshi15 commented Oct 3, 2017

@BlueRayONE

Thanks for your feedback. It will help other users who experienced the same issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants