AMI updated to new code #41

rahulbhalerao001 · 2016-03-25T22:29:17Z

is the AMI ami-6373ca10 updated with the latest code. If no what are the steps to bring it up to the latest development.

anfeng · 2016-03-25T22:31:49Z

We will need to bring it up to date per instruction given at https://github.com/yahoo/CaffeOnSpark/wiki/Create_AMI

rahulbhalerao001 · 2016-03-25T22:34:06Z

Thank you for your quick response. So to confirm, I need to follow steps 6,7,8,9 from the wiki (not the clone though) on AMI ami-6373ca10.

Also, do I need to do it only on the master or also on the slaves?

rahulbhalerao001 · 2016-03-25T22:40:32Z

Or will you recommend creating a new AMI from scratch following all steps, and then using it for all master and slave machines?

anfeng · 2016-03-25T22:49:45Z

We should launch an instance w/ existing image, and apply steps 6, 8, 9 to build a new image. For step 6, we will do git pull for updated source code.

If I find time this weekend, I will try to create a new image.

rahulbhalerao001 · 2016-03-26T00:27:17Z

I started a new g2.8xlarge instance with the latest AMI - ami-6373ca10. I did a git pull and

pushd CaffeOnSpark/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
pushd ..
export CAFFE_ON_SPARK=/root/CaffeOnSpark
export LD_LIBRARY_PATH="${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib:/usr/lib64:/lib64:/usr/local/cuda-7.0/lib64"
make build

But am getting the following error :

[INFO] Compiling 16 source files to /root/CaffeOnSpark/caffe-grid/target/classes at 1458951756956
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:35: error: value hasDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] if (memdatalayer_param.hasDataframeFormat())
[INFO] ^
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:36: error: value getDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] reader = reader.format(memdatalayer_param.getDataframeFormat())
[INFO] ^
[ERROR] two errors found
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe ............................................. SUCCESS [0.002s]
[INFO] caffe-distri ...................................... SUCCESS [4:58.843s]
[INFO] caffe-grid ........................................ FAILURE [38.472s]
[INFO] ------------------------------------------------------------------------

It will be great if you could let me know if I am missing something here.

anfeng · 2016-03-26T01:20:00Z

You need update caffe-public submodule.

cd caffe-public
git pull origin master
cd ..
make build

Andy Feng

Sent from my iPhone

On Mar 25, 2016, at 5:27 PM, Rahul Bhalerao notifications@github.com wrote:

I started a new g2.8xlarge instance with the latest AMI - ami-6373ca10. I did a git pull and

pushd CaffeOnSpark/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
pushd ..
export CAFFE_ON_SPARK=/root/CaffeOnSpark
export LD_LIBRARY_PATH="${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib:/usr/lib64:/lib64:/usr/local/cuda-7.0/lib64"
make build

But am getting the following error :

[INFO] Compiling 16 source files to /root/CaffeOnSpark/caffe-grid/target/classes at 1458951756956
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:35: error: value hasDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] if (memdatalayer_param.hasDataframeFormat())
[INFO] ^
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:36: error: value getDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] reader = reader.format(memdatalayer_param.getDataframeFormat())
[INFO] ^
[ERROR] two errors found
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe ............................................. SUCCESS [0.002s]
[INFO] caffe-distri ...................................... SUCCESS [4:58.843s]
[INFO] caffe-grid ........................................ FAILURE [38.472s]
[INFO] ------------------------------------------------------------------------

It will be great if you could let me know if I am missing something here.

―
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

rahulbhalerao001 · 2016-03-26T01:46:11Z

It is giving another error now :( . I apologize for the spam, and please let me know if this is an issue specific to my instance and one which I should figure out.

Tests run: 17, Failures: 9, Errors: 0, Skipped: 7, Time elapsed: 0.785 sec <<< F AILURE!
setUp(com.yahoo.ml.jcaffe.CaffeNetTest) Time elapsed: 0.322 sec <<< FAILURE!
java.lang.UnsatisfiedLinkError: /root/CaffeOnSpark/caffe-distri/.build_release/l ib/libcaffedistri.so: libcudart.so.7.0: cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1965)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1880)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at com.yahoo.ml.jcaffe.BaseObject.(BaseObject.java:10)
at com.yahoo.ml.jcaffe.CaffeNetTest.setUp(CaffeNetTest.java:39)

mriduljain · 2016-03-26T15:29:02Z

Just specify the path to your cuda libs in LD_LIBRARY_PATH, export and compile

rahulbhalerao001 · 2016-03-27T04:30:46Z

Thank you for your patience.
I was able to build succesfully and then I launched a 3 node cluster using the scripts given in the wiki, by replacing the AMI ID with my AMI with the updated code.

I followed the steps further and ran the lenet example on the page, and am getting the following error. Your help will be greatly appreciated.

Exception in thread "main" org.apache.spark.SparkException: addFile does not support local directories when not running local mode.
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1368)
at com.yahoo.ml.caffe.LmdbRDD.getPartitions(LmdbRDD.scala:43)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:157)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

anfeng · 2016-03-27T05:32:19Z

It sounds like that we should not to use SparkContext.addFile() and SparkFiles.get() when lmdb_path is a local file path. LmdbRDD.scala may need to be revised as below:

L43 ... if (!lmdb_path.startsWith(FSUtils.localfsPrefix)) sc.addFile(lmdb_path, true)
L168 ... val local_lmdb_folder = if (lmdb_path.startsWith(FSUtils.localfsPrefix)) lmdb_path.substring("file://".length) else SparkFiles.get(folder.getName)

rahulbhalerao001 · 2016-03-27T06:17:08Z

Ok that has worked and I am not getting the same error now.
However, as shown below the MNIST example is taking a lot of time. It is stuck in the min step for more than 10 min. I remember running this just after the open source of CaffeOnSpark, and it used to get completed within 5 min. Am I missing something here?

mriduljain · 2016-03-27T06:20:54Z

Could you check the executor logs please

On Saturday, March 26, 2016, Rahul Bhalerao notifications@github.com
wrote:

Ok that has worked and I am not getting the same error now.
However, as shown below the MNIST example is taking a lot of time. It is
stuck in the min step for more than 10 min. I remember running this just
after the launch, and it used to get completed within 5 min. Am I missing
something here?
[image: image]
https://cloud.githubusercontent.com/assets/4405642/14064089/d764d65a-f3a8-11e5-90fa-d11c343f14a2.png

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#41 (comment)

rahulbhalerao001 · 2016-03-27T06:24:54Z

I think instead of copy pasting, sharing the Spark URL will be a better option. http://ec2-54-194-79-51.eu-west-1.compute.amazonaws.com:4040/jobs/

Please let me know if I should paste the logs here.

rahulbhalerao001 · 2016-03-27T06:42:36Z

Its stuck for 35 min now. Attaching the logs -
executor_stderr.txt

anfeng · 2016-03-27T07:12:05Z

What's your CLI command? Look like that GPU could not communicate with each
other.

I0327 06:06:38.474051 4559 parallel.cpp:392] GPUs pairs 0:1, 2:3, 0:2
I0327 06:06:38.485669 4559 parallel.cpp:234] GPU 1 does not have p2p
access to GPU 0
I0327 06:06:38.496879 4559 parallel.cpp:234] GPU 2 does not have p2p
access to GPU 0
I0327 06:06:38.508085 4559 parallel.cpp:234] GPU 3 does not have p2p
access to GPU 2

Andy

On Sat, Mar 26, 2016 at 11:42 PM, Rahul Bhalerao notifications@github.com
wrote:

Its stuck for 35 min now. Attaching the logs -
executor_stderr.txt
https://github.com/yahoo/CaffeOnSpark/files/190796/executor_stderr.txt

[image: image]
https://cloud.githubusercontent.com/assets/4405642/14064159/5826e910-f3ac-11e5-89cd-708bbfbb3e66.png

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#41 (comment)

rahulbhalerao001 · 2016-03-27T07:14:05Z

Command is same from the wiki

root@ip-172-31-29-4:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES} -connection ethernet
-model /mnist.model -output /mnist_features_result

anfeng · 2016-03-27T14:05:47Z

What's the value for total_cores and devices?

Andy Feng

Sent from my iPhone

On Mar 27, 2016, at 12:14 AM, Rahul Bhalerao notifications@github.com wrote:

Command is same from the wiki

root@ip-172-31-29-4:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077

--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt

--conf spark.cores.max=${TOTAL_CORES}

--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"

--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"

--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar

-train

-features accuracy,loss -label label

-conf lenet_memory_solver.prototxt

-clusterSize ${SPARK_WORKER_INSTANCES}

-devices ${DEVICES} -connection ethernet

-model /mnist.model -output /mnist_features_result

―
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

junshi15 · 2016-03-27T14:17:00Z

In your setup, gpus can not do p2p access. Your communication among the gpus will be slow. You can check your setup according to
https://github.com/BVLC/caffe/blob/master/docs/multigpu.md#hardware-configuration-assumptions

Your code was really stuck somewhere below. This is where the program tries to find the minimal size of the partitions. I suspect one of the executors failed to read the lmdb file for various reasons, hard disk failure, hadoop file system failure, lmdb parser failure, etc. I only see log file for one executor. You may want to examine all of them.
https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/CaffeOnSpark.scala#L167-L182

Did the old AMI work for you? You only had problems after upgrading the AMI?

rahulbhalerao001 · 2016-03-27T17:46:04Z

@anfeng :
export CORES_PER_WORKER=32
export DEVICES=4
export SPARK_WORKER_INSTANCES=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

rahulbhalerao001 · 2016-03-27T17:49:01Z

@junshi15 :
All the log files had similar content.
The first AMI which was launched had worked. But after pulling latest changes, it did not work in any AMI.
I am not doing anything new or different, I just pulled the changes , built and followed steps on the wiki.

rahulbhalerao001 · 2016-03-27T18:30:03Z

It will be great if someone tries this out and see if it works..because I have tried to run this basic example directly out of box.

anfeng · 2016-03-27T22:13:49Z

@rahulbhalerao001 I just upgraded AMI, and verified its execution per guide with g2.8xlarge. Please try out the new AMI ami-790c8b0a at eu-west-1 region.

@junshi15 @mriduljain Please review PR #42 which fixed the lcoal file issue found by @rahulbhalerao001.

rahulbhalerao001 · 2016-03-28T00:21:51Z

I am still getting the same behavior at http://ec2-54-171-180-143.eu-west-1.compute.amazonaws.com:4040/jobs/

Here are my commands :

export AMI_IMAGE=ami-790c8b0a
export EC2_REGION=eu-west-1
export EC2_ZONE=eu-west-1c
export SPARK_WORKER_INSTANCES=3
export EC2_INSTANCE_TYPE=g2.8xlarge
~/spark/ec2/spark-ec2 --key-pair=newtrial --identity-file=newtrial.pem
--region=${EC2_REGION} --zone=${EC2_ZONE}
--ebs-vol-size=50
--instance-type=${EC2_INSTANCE_TYPE}
--master-instance-type=m4.xlarge
--ami=${AMI_IMAGE} -s ${SPARK_WORKER_INSTANCES}
--copy-aws-credentials
--hadoop-major-version=yarn --spark-version 1.6.0
--no-ganglia
--user-data ~/CaffeOnSpark/scripts/ec2-cloud-config.txt
--ebs-vol-size=200
launch CaffeOnSparkDemo

On Master

export CORES_PER_WORKER=32
export DEVICES=4
export SPARK_WORKER_INSTANCES=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

source ~/.bashrc
export PATH=${PATH}:${HADOOP_HOME}/bin:${SPARK_HOME}/bin
pushd ${CAFFE_ON_SPARK}/data

hadoop fs -rm -r -f /mnist.model
hadoop fs -rm -r -f /mnist_features_result
spark-submit --master spark://$(hostname):7077
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result

rahulbhalerao001 · 2016-03-28T01:15:52Z

Still having the same issue, am attaching the logs here and terminating the instances.

worker1.txt
worker2.txt
worker3.txt
master.txt

anfeng · 2016-03-28T03:52:18Z

From the log, GPU devices could not be synchronized. They are stuck at L180 of CaffeOnSpark. This should be a system issue unrelated to our code change.

Here is a response from nVidia engineer:

Known issue. You can’t enable P2P on a VM because of the VMs protection mechanism.

We may have to set devices=1 for EC2.

rahulbhalerao001 · 2016-03-28T04:05:15Z

@anfeng
Thank you for your response. I will try with devices=1 but I want to tell that with the initial AMI, it worked on EC2. I highly doubt it will be a issue specific to EC2/nvidia, because the code ran perfectly fine with the same configuration previously.

anfeng · 2016-03-28T19:29:40Z

@rahulbhalerao001 We reproduced a hang problem w/in L180 in house, and will come up a solution soon.

rahulbhalerao001 · 2016-03-28T20:10:02Z

@anfeng thank you for looking into it. Meanwhile, will it work with 1 device per machine.

anfeng · 2016-03-28T20:19:20Z

AFAK, the code works fine for 2 machines even with multiple devices.
Somehow, we have problem with 3 machines or more.

rahulbhalerao001 · 2016-03-28T21:03:28Z

@anfeng : I am reframing my question :
MNIST example produces a directory /mnist_features_result which has intermediate accuracy and loss logging.
eg.
{"SampleID":"00000000","accuracy":[1.0],"loss":[0.0019047105],"label":[7.0]}
{"SampleID":"00000001","accuracy":[1.0],"loss":[0.0019047105],"label":[2.0]}
{"SampleID":"00000002","accuracy":[1.0],"loss":[0.0019047105],"label":[1.0]}
{"SampleID":"00000003","accuracy":[1.0],"loss":[0.0019047105],"label":[0.0]}
{"SampleID":"00000004","accuracy":[1.0],"loss":[0.0019047105],"label":[4.0]}
{"SampleID":"00000005","accuracy":[1.0],"loss":[0.0019047105],"label":[1.0]}
{"SampleID":"00000006","accuracy":[1.0],"loss":[0.0019047105],"label":[4.0]}
{"SampleID":"00000007","accuracy":[1.0],"loss":[0.0019047105],"label":[9.0]}
{"SampleID":"00000008","accuracy":[1.0],"loss":[0.0019047105],"label":[5.0]}
{"SampleID":"00000009","accuracy":[1.0],"loss":[0.0019047105],"label":[9.0]}

CIFAR-10 /cifar10_features_result is a file which has final loss and accuracy
For example CIFAR-10, the accuracy can be viewed :
loss: 1.367735541228092
accuracy: 0.6595959609205072

Could you provide some info as to how to configure what form of result is obtained. For example how to view periodic values for CIFAR-10

anfeng · 2016-03-29T00:50:03Z

@rahulbhalerao001 It will be great if you could verify our new code per PR #43. It's verified at our local environment only.

rahulbhalerao001 · 2016-03-29T00:54:15Z

Thank you for providing a fix. I will do a make build, verify and let you know by end of day.

rahulbhalerao001 · 2016-03-29T06:40:21Z

@anfeng : I verified the MNIST and CIFAR-10 example on a 3 g2.8xlarge cluster. It is working correctly and no longer getting hung.

Before closing this issue it will be great if you could help me figure out the previous question.

junshi15 · 2016-03-29T13:52:21Z

@rahulbhalerao001
The accuracy and loss you got for MNIST are per mini-batch numbers. You got them by "-features accuracy,loss". To get the overall accuracy and loss, replace that with "-test" as you did with CIFAR-10.

Interleaving training and testing, i.e. periodic testing while training, is not available in current version of CaffeOnSpark. So you will not get any test accuracy/loss during training (though you could get the train accuracy/loss which may not be as useful). The workaround will be snapshotting the model periodically and start another Spark job to "-test" the accuracy/loss.

rahulbhalerao001 · 2016-03-29T18:52:31Z

@junshi15 : Thank you for the detailed explanation. Appreciate your help.

anfeng mentioned this issue Mar 29, 2016

Multithreading issue with multiple executors - ethernet fix #43

Merged

rahulbhalerao001 closed this as completed Mar 29, 2016

dejunzhang mentioned this issue Apr 27, 2016

An Py4JJavaError happened when follow the python instructions #61

Closed

AMI updated to new code #41

AMI updated to new code #41

Comments

rahulbhalerao001 commented Mar 25, 2016

anfeng commented Mar 25, 2016

rahulbhalerao001 commented Mar 25, 2016

rahulbhalerao001 commented Mar 25, 2016

anfeng commented Mar 25, 2016

rahulbhalerao001 commented Mar 26, 2016

anfeng commented Mar 26, 2016

rahulbhalerao001 commented Mar 26, 2016

mriduljain commented Mar 26, 2016

rahulbhalerao001 commented Mar 27, 2016

anfeng commented Mar 27, 2016

rahulbhalerao001 commented Mar 27, 2016

mriduljain commented Mar 27, 2016

rahulbhalerao001 commented Mar 27, 2016

rahulbhalerao001 commented Mar 27, 2016

anfeng commented Mar 27, 2016

rahulbhalerao001 commented Mar 27, 2016

anfeng commented Mar 27, 2016

junshi15 commented Mar 27, 2016

rahulbhalerao001 commented Mar 27, 2016

rahulbhalerao001 commented Mar 27, 2016

rahulbhalerao001 commented Mar 27, 2016

anfeng commented Mar 27, 2016

rahulbhalerao001 commented Mar 28, 2016

rahulbhalerao001 commented Mar 28, 2016

anfeng commented Mar 28, 2016

rahulbhalerao001 commented Mar 28, 2016

anfeng commented Mar 28, 2016

rahulbhalerao001 commented Mar 28, 2016

anfeng commented Mar 28, 2016

rahulbhalerao001 commented Mar 28, 2016

anfeng commented Mar 29, 2016

rahulbhalerao001 commented Mar 29, 2016

rahulbhalerao001 commented Mar 29, 2016

junshi15 commented Mar 29, 2016

rahulbhalerao001 commented Mar 29, 2016