-
Notifications
You must be signed in to change notification settings - Fork 357
AMI updated to new code #41
Comments
We will need to bring it up to date per instruction given at https://github.com/yahoo/CaffeOnSpark/wiki/Create_AMI |
Thank you for your quick response. So to confirm, I need to follow steps 6,7,8,9 from the wiki (not the clone though) on AMI ami-6373ca10. Also, do I need to do it only on the master or also on the slaves? |
Or will you recommend creating a new AMI from scratch following all steps, and then using it for all master and slave machines? |
We should launch an instance w/ existing image, and apply steps 6, 8, 9 to build a new image. For step 6, we will do git pull for updated source code. If I find time this weekend, I will try to create a new image. |
I started a new g2.8xlarge instance with the latest AMI - ami-6373ca10. I did a git pull and pushd CaffeOnSpark/caffe-public/ But am getting the following error : [INFO] Compiling 16 source files to /root/CaffeOnSpark/caffe-grid/target/classes at 1458951756956 It will be great if you could let me know if I am missing something here. |
You need update caffe-public submodule. cd caffe-public Andy Feng Sent from my iPhone
|
It is giving another error now :( . I apologize for the spam, and please let me know if this is an issue specific to my instance and one which I should figure out. Tests run: 17, Failures: 9, Errors: 0, Skipped: 7, Time elapsed: 0.785 sec <<< F AILURE! |
Just specify the path to your cuda libs in LD_LIBRARY_PATH, export and compile |
Thank you for your patience. I followed the steps further and ran the lenet example on the page, and am getting the following error. Your help will be greatly appreciated. Exception in thread "main" org.apache.spark.SparkException: addFile does not support local directories when not running local mode. |
It sounds like that we should not to use SparkContext.addFile() and SparkFiles.get() when lmdb_path is a local file path. LmdbRDD.scala may need to be revised as below:
|
Could you check the executor logs please On Saturday, March 26, 2016, Rahul Bhalerao notifications@github.com
|
I think instead of copy pasting, sharing the Spark URL will be a better option. http://ec2-54-194-79-51.eu-west-1.compute.amazonaws.com:4040/jobs/ Please let me know if I should paste the logs here. |
Its stuck for 35 min now. Attaching the logs - |
What's your CLI command? Look like that GPU could not communicate with each I0327 06:06:38.474051 4559 parallel.cpp:392] GPUs pairs 0:1, 2:3, 0:2 Andy On Sat, Mar 26, 2016 at 11:42 PM, Rahul Bhalerao notifications@github.com
|
Command is same from the wiki root@ip-172-31-29-4:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077 |
What's the value for total_cores and devices? Andy Feng Sent from my iPhone
|
In your setup, gpus can not do p2p access. Your communication among the gpus will be slow. You can check your setup according to Your code was really stuck somewhere below. This is where the program tries to find the minimal size of the partitions. I suspect one of the executors failed to read the lmdb file for various reasons, hard disk failure, hadoop file system failure, lmdb parser failure, etc. I only see log file for one executor. You may want to examine all of them. Did the old AMI work for you? You only had problems after upgrading the AMI? |
@anfeng : |
@junshi15 : |
It will be great if someone tries this out and see if it works..because I have tried to run this basic example directly out of box. |
@rahulbhalerao001 I just upgraded AMI, and verified its execution per guide with g2.8xlarge. Please try out the new AMI ami-790c8b0a at eu-west-1 region. @junshi15 @mriduljain Please review PR #42 which fixed the lcoal file issue found by @rahulbhalerao001. |
I am still getting the same behavior at http://ec2-54-171-180-143.eu-west-1.compute.amazonaws.com:4040/jobs/ Here are my commands : export AMI_IMAGE=ami-790c8b0a On Master export CORES_PER_WORKER=32 source ~/.bashrc hadoop fs -rm -r -f /mnist.model |
Still having the same issue, am attaching the logs here and terminating the instances. |
From the log, GPU devices could not be synchronized. They are stuck at L180 of CaffeOnSpark. This should be a system issue unrelated to our code change. Here is a response from nVidia engineer:
We may have to set devices=1 for EC2. |
@anfeng |
@rahulbhalerao001 We reproduced a hang problem w/in L180 in house, and will come up a solution soon. |
@anfeng thank you for looking into it. Meanwhile, will it work with 1 device per machine. |
AFAK, the code works fine for 2 machines even with multiple devices. |
@anfeng : I am reframing my question : CIFAR-10 /cifar10_features_result is a file which has final loss and accuracy Could you provide some info as to how to configure what form of result is obtained. For example how to view periodic values for CIFAR-10 |
@rahulbhalerao001 It will be great if you could verify our new code per PR #43. It's verified at our local environment only. |
Thank you for providing a fix. I will do a make build, verify and let you know by end of day. |
@anfeng : I verified the MNIST and CIFAR-10 example on a 3 g2.8xlarge cluster. It is working correctly and no longer getting hung. Before closing this issue it will be great if you could help me figure out the previous question. |
@rahulbhalerao001 Interleaving training and testing, i.e. periodic testing while training, is not available in current version of CaffeOnSpark. So you will not get any test accuracy/loss during training (though you could get the train accuracy/loss which may not be as useful). The workaround will be snapshotting the model periodically and start another Spark job to "-test" the accuracy/loss. |
@junshi15 : Thank you for the detailed explanation. Appreciate your help. |
is the AMI ami-6373ca10 updated with the latest code. If no what are the steps to bring it up to the latest development.
The text was updated successfully, but these errors were encountered: