-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CURAND_STATUS_LAUNCH_FAILURE using new Titan X Pascal #1264
Comments
Hello, did you rebuild Caffe against CUDA 8 after you upgraded your CUDA toolkit? Caffe build instructions. Caffe binaries for CUDA 8.0 haven't been released yet so you have to build from source. |
You may also want to double check that DIGITS is using the correct instance of Caffe on your system: config instructions. |
yes. I rebuilt it from source. On Sun, Nov 13, 2016 at 7:13 AM, Greg Heinrich notifications@github.com
|
Also, I wanted to rule out that it is my data that is somehow the problem, On Sun, Nov 13, 2016 at 7:53 AM, David Cofer dcofer@neurorobotictech.com
|
Are you sure the right version of Caffe is being used by DIGITS? |
pretty sure. After some more research I believe I have found something else I will let you know if this works. Thanks for your help. On Mon, Nov 14, 2016 at 1:17 AM, Greg Heinrich notifications@github.com
|
Well dang! I recompiled and verified in the cmake output that the CUDA_ARCH On Mon, Nov 14, 2016 at 6:55 AM, David Cofer dcofer@neurorobotictech.com
|
You can try cutting DIGITS out of the loop for debugging and just use caffe directly. For an easy test: ./data/mnist/get_mnist.sh
./examples/mnist/create_mnist.sh
./examples/mnist/train_lenet.sh You can inspect your build and make sure it's actually linking against CUDA 8.0: ldd ./build/tools/caffe | grep cuda |
So that worked fine on the Titan. I was able to train the mnist example dcofer@ubudesk:~/caffe$ ldd ./build/tools/caffe | grep cuda This made me realize I had not yet tried anything other than semantic On Tue, Nov 15, 2016 at 11:06 AM, Luke Yeager notifications@github.com
|
You can find examples on the https://github.com/shelhamer/fcn.berkeleyvision.org repository. Are you still using nv-caffe 0.15.9? I recall there were some issues on this version. Can you try to upgrade to 0.15.13? |
I upgraded to 0.15.13 and it still errored out on me. After a lot of trial I1117 06:49:52.260607 11460 net.cpp:220] pool1 needs backward computation. On Wed, Nov 16, 2016 at 5:35 AM, Greg Heinrich notifications@github.com
|
I'm pretty sure this still comes down to a CUDA 8.0 toolkit and driver issue. See NVIDIA/caffe#270 for a summary of issues dealing with this error. |
Success! I was able to get the Titan X working finally. I started out by Thanks so much for everyone's help and suggestions. On Fri, Nov 18, 2016 at 11:39 AM, Luke Yeager notifications@github.com
|
@NeuroRoboticTech I met a problem similar to yours.Because i need to run an experiment of a paper which was published in the end of 2015.I configured the environment according to the author's at that time:ubuntu 14.04,cuda 7.0 cudnn V3,nvidia driver NVIDIA-Linux-x86_64-375.66,caffe can be compiled successfully.but when I run the script for training,an error occurred: math_functions.cu: Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE. |
I have been using Digits with caffe for a while now with a GeForce GTX 960. I wanted some more power so I shelled out a bunch of money for a new Titan X Pascal GPU. I installed it in the secondary PCI-e slot so I could use the GTX primarily for graphics, and use the Titan for GPU. I can run deviceQuery and see both GPUs. I have also run the bandwidth test on the new Titan and it passed, so the GPU is working. Initially I was using digits version 4.1-dev, and nvidia/caffe 0.15.9, and I was able to figure out how to change the digits config file to set it up so I could use both of them. However, when I clone any of my previous jobs that ran fine on the GTX and try and run them on the Titan I get an error.
This is a very simple test network I am using just to see if things are running correctly. All jobs I have tried to run on the Titan eventually fail this way after churning for a while, but If I clone it and run it again on the GTX it runs perfectly fine. While doing some research I found a link on this issues forum that seemed to imply that I might need CUDA 8. Since I was on CUDA 7.5 I decided to move up to 8. I had to upgrade pretty much everything. I switched over to CUDA 8 and upgraded to cuDNN 5.1, and then I had to rebuild the latest opencv with CUDA 8. I pulled the latest nvidia/caffe and rebuilt and installed it. Since Digits 5.0 was just released with semantic segmentation I pulled that from github and used it instead of the 4.1-dev I was using. However, I have the exact same problem. I can run semantic segmentation tasks on digits 5.0 just fine with my older GTX card, but when I clone that job and try and run it on the new Titan X it fails with a CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE.
I am unsure if this is a dgits or caffe error or cuda or what, and I am stumped. Does anyone have any ideas on why this is not running on my new, and very expensive GPU card, but runs fine on the older, cheaper one? Does anyone have any suggestions on how I can get some more info on why it is failing? I have attached the caffe log.
caffe_output.log.zip
Thanks
The text was updated successfully, but these errors were encountered: