Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

AMI updated to new code #41

Closed
rahulbhalerao001 opened this issue Mar 25, 2016 · 35 comments
Closed

AMI updated to new code #41

rahulbhalerao001 opened this issue Mar 25, 2016 · 35 comments

Comments

@rahulbhalerao001
Copy link

is the AMI ami-6373ca10 updated with the latest code. If no what are the steps to bring it up to the latest development.

@anfeng
Copy link
Contributor

anfeng commented Mar 25, 2016

We will need to bring it up to date per instruction given at https://github.com/yahoo/CaffeOnSpark/wiki/Create_AMI

@rahulbhalerao001
Copy link
Author

Thank you for your quick response. So to confirm, I need to follow steps 6,7,8,9 from the wiki (not the clone though) on AMI ami-6373ca10.

Also, do I need to do it only on the master or also on the slaves?

@rahulbhalerao001
Copy link
Author

Or will you recommend creating a new AMI from scratch following all steps, and then using it for all master and slave machines?

@anfeng
Copy link
Contributor

anfeng commented Mar 25, 2016

We should launch an instance w/ existing image, and apply steps 6, 8, 9 to build a new image. For step 6, we will do git pull for updated source code.

If I find time this weekend, I will try to create a new image.

@rahulbhalerao001
Copy link
Author

I started a new g2.8xlarge instance with the latest AMI - ami-6373ca10. I did a git pull and

pushd CaffeOnSpark/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
pushd ..
export CAFFE_ON_SPARK=/root/CaffeOnSpark
export LD_LIBRARY_PATH="${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib:/usr/lib64:/lib64:/usr/local/cuda-7.0/lib64"
make build

But am getting the following error :

[INFO] Compiling 16 source files to /root/CaffeOnSpark/caffe-grid/target/classes at 1458951756956
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:35: error: value hasDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] if (memdatalayer_param.hasDataframeFormat())
[INFO] ^
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:36: error: value getDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] reader = reader.format(memdatalayer_param.getDataframeFormat())
[INFO] ^
[ERROR] two errors found
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe ............................................. SUCCESS [0.002s]
[INFO] caffe-distri ...................................... SUCCESS [4:58.843s]
[INFO] caffe-grid ........................................ FAILURE [38.472s]
[INFO] ------------------------------------------------------------------------

It will be great if you could let me know if I am missing something here.

@anfeng
Copy link
Contributor

anfeng commented Mar 26, 2016

You need update caffe-public submodule.

cd caffe-public
git pull origin master
cd ..
make build

Andy Feng

Sent from my iPhone

On Mar 25, 2016, at 5:27 PM, Rahul Bhalerao notifications@github.com wrote:

I started a new g2.8xlarge instance with the latest AMI - ami-6373ca10. I did a git pull and

pushd CaffeOnSpark/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
pushd ..
export CAFFE_ON_SPARK=/root/CaffeOnSpark
export LD_LIBRARY_PATH="${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib:/usr/lib64:/lib64:/usr/local/cuda-7.0/lib64"
make build

But am getting the following error :

[INFO] Compiling 16 source files to /root/CaffeOnSpark/caffe-grid/target/classes at 1458951756956
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:35: error: value hasDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] if (memdatalayer_param.hasDataframeFormat())
[INFO] ^
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:36: error: value getDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] reader = reader.format(memdatalayer_param.getDataframeFormat())
[INFO] ^
[ERROR] two errors found
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe ............................................. SUCCESS [0.002s]
[INFO] caffe-distri ...................................... SUCCESS [4:58.843s]
[INFO] caffe-grid ........................................ FAILURE [38.472s]
[INFO] ------------------------------------------------------------------------

It will be great if you could let me know if I am missing something here.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@rahulbhalerao001
Copy link
Author

It is giving another error now :( . I apologize for the spam, and please let me know if this is an issue specific to my instance and one which I should figure out.

Tests run: 17, Failures: 9, Errors: 0, Skipped: 7, Time elapsed: 0.785 sec <<< F AILURE!
setUp(com.yahoo.ml.jcaffe.CaffeNetTest) Time elapsed: 0.322 sec <<< FAILURE!
java.lang.UnsatisfiedLinkError: /root/CaffeOnSpark/caffe-distri/.build_release/l ib/libcaffedistri.so: libcudart.so.7.0: cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1965)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1880)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at com.yahoo.ml.jcaffe.BaseObject.(BaseObject.java:10)
at com.yahoo.ml.jcaffe.CaffeNetTest.setUp(CaffeNetTest.java:39)

@mriduljain
Copy link
Contributor

Just specify the path to your cuda libs in LD_LIBRARY_PATH, export and compile

@rahulbhalerao001
Copy link
Author

Thank you for your patience.
I was able to build succesfully and then I launched a 3 node cluster using the scripts given in the wiki, by replacing the AMI ID with my AMI with the updated code.

I followed the steps further and ran the lenet example on the page, and am getting the following error. Your help will be greatly appreciated.

Exception in thread "main" org.apache.spark.SparkException: addFile does not support local directories when not running local mode.
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1368)
at com.yahoo.ml.caffe.LmdbRDD.getPartitions(LmdbRDD.scala:43)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:157)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

@anfeng
Copy link
Contributor

anfeng commented Mar 27, 2016

It sounds like that we should not to use SparkContext.addFile() and SparkFiles.get() when lmdb_path is a local file path. LmdbRDD.scala may need to be revised as below:

  • L43 ... if (!lmdb_path.startsWith(FSUtils.localfsPrefix)) sc.addFile(lmdb_path, true)
  • L168 ... val local_lmdb_folder = if (lmdb_path.startsWith(FSUtils.localfsPrefix)) lmdb_path.substring("file://".length) else SparkFiles.get(folder.getName)

@rahulbhalerao001
Copy link
Author

Ok that has worked and I am not getting the same error now.
However, as shown below the MNIST example is taking a lot of time. It is stuck in the min step for more than 10 min. I remember running this just after the open source of CaffeOnSpark, and it used to get completed within 5 min. Am I missing something here?
image

@mriduljain
Copy link
Contributor

Could you check the executor logs please

On Saturday, March 26, 2016, Rahul Bhalerao notifications@github.com
wrote:

Ok that has worked and I am not getting the same error now.
However, as shown below the MNIST example is taking a lot of time. It is
stuck in the min step for more than 10 min. I remember running this just
after the launch, and it used to get completed within 5 min. Am I missing
something here?
[image: image]
https://cloud.githubusercontent.com/assets/4405642/14064089/d764d65a-f3a8-11e5-90fa-d11c343f14a2.png


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#41 (comment)

@rahulbhalerao001
Copy link
Author

I think instead of copy pasting, sharing the Spark URL will be a better option. http://ec2-54-194-79-51.eu-west-1.compute.amazonaws.com:4040/jobs/

Please let me know if I should paste the logs here.

@rahulbhalerao001
Copy link
Author

Its stuck for 35 min now. Attaching the logs -
executor_stderr.txt

image

@anfeng
Copy link
Contributor

anfeng commented Mar 27, 2016

What's your CLI command? Look like that GPU could not communicate with each
other.

I0327 06:06:38.474051 4559 parallel.cpp:392] GPUs pairs 0:1, 2:3, 0:2
I0327 06:06:38.485669 4559 parallel.cpp:234] GPU 1 does not have p2p
access to GPU 0
I0327 06:06:38.496879 4559 parallel.cpp:234] GPU 2 does not have p2p
access to GPU 0
I0327 06:06:38.508085 4559 parallel.cpp:234] GPU 3 does not have p2p
access to GPU 2

Andy

On Sat, Mar 26, 2016 at 11:42 PM, Rahul Bhalerao notifications@github.com
wrote:

Its stuck for 35 min now. Attaching the logs -
executor_stderr.txt
https://github.com/yahoo/CaffeOnSpark/files/190796/executor_stderr.txt

[image: image]
https://cloud.githubusercontent.com/assets/4405642/14064159/5826e910-f3ac-11e5-89cd-708bbfbb3e66.png


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#41 (comment)

@rahulbhalerao001
Copy link
Author

Command is same from the wiki

root@ip-172-31-29-4:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES} -connection ethernet
-model /mnist.model -output /mnist_features_result

@anfeng
Copy link
Contributor

anfeng commented Mar 27, 2016

What's the value for total_cores and devices?

Andy Feng

Sent from my iPhone

On Mar 27, 2016, at 12:14 AM, Rahul Bhalerao notifications@github.com wrote:

Command is same from the wiki

root@ip-172-31-29-4:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077

--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt

--conf spark.cores.max=${TOTAL_CORES}

--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"

--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"

--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar

-train

-features accuracy,loss -label label

-conf lenet_memory_solver.prototxt

-clusterSize ${SPARK_WORKER_INSTANCES}

-devices ${DEVICES} -connection ethernet

-model /mnist.model -output /mnist_features_result


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@junshi15
Copy link
Collaborator

In your setup, gpus can not do p2p access. Your communication among the gpus will be slow. You can check your setup according to
https://github.com/BVLC/caffe/blob/master/docs/multigpu.md#hardware-configuration-assumptions

Your code was really stuck somewhere below. This is where the program tries to find the minimal size of the partitions. I suspect one of the executors failed to read the lmdb file for various reasons, hard disk failure, hadoop file system failure, lmdb parser failure, etc. I only see log file for one executor. You may want to examine all of them.
https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/CaffeOnSpark.scala#L167-L182

Did the old AMI work for you? You only had problems after upgrading the AMI?

@rahulbhalerao001
Copy link
Author

@anfeng :
export CORES_PER_WORKER=32
export DEVICES=4
export SPARK_WORKER_INSTANCES=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

@rahulbhalerao001
Copy link
Author

@junshi15 :
All the log files had similar content.
The first AMI which was launched had worked. But after pulling latest changes, it did not work in any AMI.
I am not doing anything new or different, I just pulled the changes , built and followed steps on the wiki.

@rahulbhalerao001
Copy link
Author

It will be great if someone tries this out and see if it works..because I have tried to run this basic example directly out of box.

@anfeng
Copy link
Contributor

anfeng commented Mar 27, 2016

@rahulbhalerao001 I just upgraded AMI, and verified its execution per guide with g2.8xlarge. Please try out the new AMI ami-790c8b0a at eu-west-1 region.

@junshi15 @mriduljain Please review PR #42 which fixed the lcoal file issue found by @rahulbhalerao001.

@rahulbhalerao001
Copy link
Author

I am still getting the same behavior at http://ec2-54-171-180-143.eu-west-1.compute.amazonaws.com:4040/jobs/

Here are my commands :

export AMI_IMAGE=ami-790c8b0a
export EC2_REGION=eu-west-1
export EC2_ZONE=eu-west-1c
export SPARK_WORKER_INSTANCES=3
export EC2_INSTANCE_TYPE=g2.8xlarge
~/spark/ec2/spark-ec2 --key-pair=newtrial --identity-file=newtrial.pem
--region=${EC2_REGION} --zone=${EC2_ZONE}
--ebs-vol-size=50
--instance-type=${EC2_INSTANCE_TYPE}
--master-instance-type=m4.xlarge
--ami=${AMI_IMAGE} -s ${SPARK_WORKER_INSTANCES}
--copy-aws-credentials
--hadoop-major-version=yarn --spark-version 1.6.0
--no-ganglia
--user-data ~/CaffeOnSpark/scripts/ec2-cloud-config.txt
--ebs-vol-size=200
launch CaffeOnSparkDemo


On Master

export CORES_PER_WORKER=32
export DEVICES=4
export SPARK_WORKER_INSTANCES=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

source ~/.bashrc
export PATH=${PATH}:${HADOOP_HOME}/bin:${SPARK_HOME}/bin
pushd ${CAFFE_ON_SPARK}/data

hadoop fs -rm -r -f /mnist.model
hadoop fs -rm -r -f /mnist_features_result
spark-submit --master spark://$(hostname):7077
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result

@rahulbhalerao001
Copy link
Author

Still having the same issue, am attaching the logs here and terminating the instances.

worker1.txt
worker2.txt
worker3.txt
master.txt

@anfeng
Copy link
Contributor

anfeng commented Mar 28, 2016

From the log, GPU devices could not be synchronized. They are stuck at L180 of CaffeOnSpark. This should be a system issue unrelated to our code change.

Here is a response from nVidia engineer:

  • Known issue. You can’t enable P2P on a VM because of the VMs protection mechanism.

We may have to set devices=1 for EC2.

@rahulbhalerao001
Copy link
Author

@anfeng
Thank you for your response. I will try with devices=1 but I want to tell that with the initial AMI, it worked on EC2. I highly doubt it will be a issue specific to EC2/nvidia, because the code ran perfectly fine with the same configuration previously.

@anfeng
Copy link
Contributor

anfeng commented Mar 28, 2016

@rahulbhalerao001 We reproduced a hang problem w/in L180 in house, and will come up a solution soon.

@rahulbhalerao001
Copy link
Author

@anfeng thank you for looking into it. Meanwhile, will it work with 1 device per machine.

@anfeng
Copy link
Contributor

anfeng commented Mar 28, 2016

AFAK, the code works fine for 2 machines even with multiple devices.
Somehow, we have problem with 3 machines or more.

@rahulbhalerao001
Copy link
Author

@anfeng : I am reframing my question :
MNIST example produces a directory /mnist_features_result which has intermediate accuracy and loss logging.
eg.
{"SampleID":"00000000","accuracy":[1.0],"loss":[0.0019047105],"label":[7.0]}
{"SampleID":"00000001","accuracy":[1.0],"loss":[0.0019047105],"label":[2.0]}
{"SampleID":"00000002","accuracy":[1.0],"loss":[0.0019047105],"label":[1.0]}
{"SampleID":"00000003","accuracy":[1.0],"loss":[0.0019047105],"label":[0.0]}
{"SampleID":"00000004","accuracy":[1.0],"loss":[0.0019047105],"label":[4.0]}
{"SampleID":"00000005","accuracy":[1.0],"loss":[0.0019047105],"label":[1.0]}
{"SampleID":"00000006","accuracy":[1.0],"loss":[0.0019047105],"label":[4.0]}
{"SampleID":"00000007","accuracy":[1.0],"loss":[0.0019047105],"label":[9.0]}
{"SampleID":"00000008","accuracy":[1.0],"loss":[0.0019047105],"label":[5.0]}
{"SampleID":"00000009","accuracy":[1.0],"loss":[0.0019047105],"label":[9.0]}

CIFAR-10 /cifar10_features_result is a file which has final loss and accuracy
For example CIFAR-10, the accuracy can be viewed :
loss: 1.367735541228092
accuracy: 0.6595959609205072

Could you provide some info as to how to configure what form of result is obtained. For example how to view periodic values for CIFAR-10

@anfeng
Copy link
Contributor

anfeng commented Mar 29, 2016

@rahulbhalerao001 It will be great if you could verify our new code per PR #43. It's verified at our local environment only.

@rahulbhalerao001
Copy link
Author

Thank you for providing a fix. I will do a make build, verify and let you know by end of day.

@rahulbhalerao001
Copy link
Author

@anfeng : I verified the MNIST and CIFAR-10 example on a 3 g2.8xlarge cluster. It is working correctly and no longer getting hung.

Before closing this issue it will be great if you could help me figure out the previous question.

@junshi15
Copy link
Collaborator

@rahulbhalerao001
The accuracy and loss you got for MNIST are per mini-batch numbers. You got them by "-features accuracy,loss". To get the overall accuracy and loss, replace that with "-test" as you did with CIFAR-10.

Interleaving training and testing, i.e. periodic testing while training, is not available in current version of CaffeOnSpark. So you will not get any test accuracy/loss during training (though you could get the train accuracy/loss which may not be as useful). The workaround will be snapshotting the model periodically and start another Spark job to "-test" the accuracy/loss.

@rahulbhalerao001
Copy link
Author

@junshi15 : Thank you for the detailed explanation. Appreciate your help.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants