Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CURAND_STATUS_LAUNCH_FAILURE using new Titan X Pascal #1264

Closed
NeuroRoboticTech opened this issue Nov 13, 2016 · 14 comments
Closed

CURAND_STATUS_LAUNCH_FAILURE using new Titan X Pascal #1264

NeuroRoboticTech opened this issue Nov 13, 2016 · 14 comments
Labels

Comments

@NeuroRoboticTech
Copy link

NeuroRoboticTech commented Nov 13, 2016

I have been using Digits with caffe for a while now with a GeForce GTX 960. I wanted some more power so I shelled out a bunch of money for a new Titan X Pascal GPU. I installed it in the secondary PCI-e slot so I could use the GTX primarily for graphics, and use the Titan for GPU. I can run deviceQuery and see both GPUs. I have also run the bandwidth test on the new Titan and it passed, so the GPU is working. Initially I was using digits version 4.1-dev, and nvidia/caffe 0.15.9, and I was able to figure out how to change the digits config file to set it up so I could use both of them. However, when I clone any of my previous jobs that ran fine on the GTX and try and run them on the Titan I get an error.

I1112 06:19:01.239487  5120 solver.cpp:362] Iteration 0, Testing net (#0)
I1112 06:19:04.057446  5120 blocking_queue.cpp:50] Data layer prefetch queue empty
I1112 06:23:06.287995  5120 solver.cpp:429]     Test net output #0: accuracy = 0.0070898
I1112 06:23:06.288074  5120 solver.cpp:429]     Test net output #1: loss = 3.04448 (* 1 = 3.04448 loss)
F1112 06:23:06.447123  5120 math_functions.cu:396] Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0)  CURAND_STATUS_LAUNCH_FAILURE
*** Check failure stack trace: ***
@     0x7fa1f5014daa  (unknown)
@     0x7fa1f5014ce4  (unknown)
@     0x7fa1f50146e6  (unknown)
@     0x7fa1f5017687  (unknown)
@     0x7fa1f5737cd4  (unknown)
@     0x7fa1f5767f55  (unknown)
@     0x7fa1f56d87d8  (unknown)
@     0x7fa1f56d8b57  (unknown)
@     0x7fa1f57116fc  (unknown)
@     0x7fa1f5711fce  (unknown)
@           0x40af36  (unknown)
@           0x40867c  (unknown)
@     0x7fa1f3b17f45  (unknown)
@           0x408e4d  (unknown)
@              (nil)  (unknown)

This is a very simple test network I am using just to see if things are running correctly. All jobs I have tried to run on the Titan eventually fail this way after churning for a while, but If I clone it and run it again on the GTX it runs perfectly fine. While doing some research I found a link on this issues forum that seemed to imply that I might need CUDA 8. Since I was on CUDA 7.5 I decided to move up to 8. I had to upgrade pretty much everything. I switched over to CUDA 8 and upgraded to cuDNN 5.1, and then I had to rebuild the latest opencv with CUDA 8. I pulled the latest nvidia/caffe and rebuilt and installed it. Since Digits 5.0 was just released with semantic segmentation I pulled that from github and used it instead of the 4.1-dev I was using. However, I have the exact same problem. I can run semantic segmentation tasks on digits 5.0 just fine with my older GTX card, but when I clone that job and try and run it on the new Titan X it fails with a CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE.

I am unsure if this is a dgits or caffe error or cuda or what, and I am stumped. Does anyone have any ideas on why this is not running on my new, and very expensive GPU card, but runs fine on the older, cheaper one? Does anyone have any suggestions on how I can get some more info on why it is failing? I have attached the caffe log.

caffe_output.log.zip

Thanks

@gheinrich
Copy link
Contributor

Hello, did you rebuild Caffe against CUDA 8 after you upgraded your CUDA toolkit? Caffe build instructions. Caffe binaries for CUDA 8.0 haven't been released yet so you have to build from source.

@gheinrich
Copy link
Contributor

You may also want to double check that DIGITS is using the correct instance of Caffe on your system: config instructions.

@NeuroRoboticTech
Copy link
Author

yes. I rebuilt it from source.

On Sun, Nov 13, 2016 at 7:13 AM, Greg Heinrich notifications@github.com
wrote:

Hello, did you rebuild Caffe against CUDA 8 after you upgraded your CUDA
toolkit? Caffe build instructions
https://github.com/NVIDIA/DIGITS/blob/digits-5.0/docs/BuildCaffe.md.
Caffe binaries for CUDA 8.0 haven't been released yet so you have to build
from source.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1264 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEkKwYVOJueVFP3V8ShxECCk4aL9H7LJks5q9wz_gaJpZM4Kwr3H
.

@NeuroRoboticTech
Copy link
Author

Also, I wanted to rule out that it is my data that is somehow the problem,
so I reran the PASCAL-VOC example you provided
https://github.com/gheinrich/DIGITS/blob/c894015f9dbf010329442aedc71faf2c388d948c/examples/semantic-segmentation/README.md.
I was able to run that example with the GTX, but got the same error again
after several minutes of churning when using the Titan.

On Sun, Nov 13, 2016 at 7:53 AM, David Cofer dcofer@neurorobotictech.com
wrote:

yes. I rebuilt it from source.

On Sun, Nov 13, 2016 at 7:13 AM, Greg Heinrich notifications@github.com
wrote:

Hello, did you rebuild Caffe against CUDA 8 after you upgraded your CUDA
toolkit? Caffe build instructions
https://github.com/NVIDIA/DIGITS/blob/digits-5.0/docs/BuildCaffe.md.
Caffe binaries for CUDA 8.0 haven't been released yet so you have to build
from source.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1264 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEkKwYVOJueVFP3V8ShxECCk4aL9H7LJks5q9wz_gaJpZM4Kwr3H
.

@gheinrich
Copy link
Contributor

Are you sure the right version of Caffe is being used by DIGITS?

@NeuroRoboticTech
Copy link
Author

pretty sure. After some more research I believe I have found something else
to try though. I found some links where others were having similar errors
with the new Pascal architecture. It seemed to indicate that to use this
you must compile with the latest sm_61 setting. (
BVLC/caffe#4834) I looked through the cmake files
for CUDA in caffe and it does not have this as one of the setting options.
It stops at 50. So my assumption is that even though I recompiled with CUDA
8 it defaulted to a lower architecture setting than the new card needs. I
plan to change this to add the 61 option and try and rebuild it and test. I
am in the middle of a segmentation test run now using my old GPU though. As
soon as that is finished I plan to do this and see if it will fix the
problem.

I will let you know if this works. Thanks for your help.

On Mon, Nov 14, 2016 at 1:17 AM, Greg Heinrich notifications@github.com
wrote:

Are you sure the right version of Caffe is being used by DIGITS?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1264 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEkKwV7L6cR78f3T8rZqq-2HBcCTJ-9Zks5q-AsJgaJpZM4Kwr3H
.

@NeuroRoboticTech
Copy link
Author

Well dang! I recompiled and verified in the cmake output that the CUDA_ARCH
had "GPU arch(s) : sm_61 sm_52". So it is compiling for the correct
architecture now. I also explicitly set the CAFFE_ROOT to point to my caffe
folder at ~/caffe. Unfortunately, it still fails with the same error. I am
out of ideas again. Do you have anything else you can think to try?

On Mon, Nov 14, 2016 at 6:55 AM, David Cofer dcofer@neurorobotictech.com
wrote:

pretty sure. After some more research I believe I have found something
else to try though. I found some links where others were having similar
errors with the new Pascal architecture. It seemed to indicate that to use
this you must compile with the latest sm_61 setting. (
BVLC/caffe#4834) I looked through the cmake
files for CUDA in caffe and it does not have this as one of the setting
options. It stops at 50. So my assumption is that even though I recompiled
with CUDA 8 it defaulted to a lower architecture setting than the new
card needs. I plan to change this to add the 61 option and try and rebuild
it and test. I am in the middle of a segmentation test run now using my old
GPU though. As soon as that is finished I plan to do this and see if it
will fix the problem.

I will let you know if this works. Thanks for your help.

On Mon, Nov 14, 2016 at 1:17 AM, Greg Heinrich notifications@github.com
wrote:

Are you sure the right version of Caffe is being used by DIGITS?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1264 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEkKwV7L6cR78f3T8rZqq-2HBcCTJ-9Zks5q-AsJgaJpZM4Kwr3H
.

@lukeyeager
Copy link
Member

You can try cutting DIGITS out of the loop for debugging and just use caffe directly. For an easy test:

./data/mnist/get_mnist.sh
./examples/mnist/create_mnist.sh
./examples/mnist/train_lenet.sh

You can inspect your build and make sure it's actually linking against CUDA 8.0:

ldd ./build/tools/caffe | grep cuda

@NeuroRoboticTech
Copy link
Author

So that worked fine on the Titan. I was able to train the mnist example
without any issues. I also checked the linking and it verified it is linked
to cuda 8.0:

dcofer@ubudesk:~/caffe$ ldd ./build/tools/caffe | grep cuda
libcudart.so.8.0 => /usr/local/cuda-8.0/lib64/libcudart.so.8.0
(0x00007fecc7d0b000)
libcurand.so.8.0 => /usr/local/cuda-8.0/lib64/libcurand.so.8.0
(0x00007fecc1aad000)
libcublas.so.8.0 => /usr/local/cuda-8.0/lib64/libcublas.so.8.0
(0x00007fecbf0fc000)

This made me realize I had not yet tried anything other than semantic
segmentation projects with the Titan. So I attempted to rerun the mnist
classification with Digits on the Titan, and it worked perfectly fine. So
it appears that it is only the semantic segmentation code that is failing
on the Titan for some reason. Do you know if there is an example of the
segmentation that can be run in caffe only? This would let me narrow down
if it is something in caffe itself, or if it is something in digits.

On Tue, Nov 15, 2016 at 11:06 AM, Luke Yeager notifications@github.com
wrote:

You can try cutting DIGITS out of the loop for debugging and just use
caffe directly. For an easy test:

./data/mnist/get_mnist.sh
./examples/mnist/create_mnist.sh
./examples/mnist/train_lenet.sh

You can inspect your build and make sure it's actually linking against
CUDA 8.0:

ldd ./build/tools/caffe | grep cuda


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1264 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEkKwR_MjvZkycmwbY5KWVn-c9wAng64ks5q-eZ5gaJpZM4Kwr3H
.

@gheinrich
Copy link
Contributor

You can find examples on the https://github.com/shelhamer/fcn.berkeleyvision.org repository.

Are you still using nv-caffe 0.15.9? I recall there were some issues on this version. Can you try to upgrade to 0.15.13?

@NeuroRoboticTech
Copy link
Author

I upgraded to 0.15.13 and it still errored out on me. After a lot of trial
and error I was finally able to get the voc-fcn-alexnet to try and run
using just caffe without digits. It failed with the same error as before.
So this appears to have nothing to do with digits, and is a caffe problem.
Here is the output:

I1117 06:49:52.260607 11460 net.cpp:220] pool1 needs backward computation.
I1117 06:49:52.260612 11460 net.cpp:220] relu1 needs backward computation.
I1117 06:49:52.260617 11460 net.cpp:220] conv1 needs backward computation.
I1117 06:49:52.260622 11460 net.cpp:222] data_data_0_split does not need
backward computation.
I1117 06:49:52.260628 11460 net.cpp:222] data does not need backward
computation.
I1117 06:49:52.260633 11460 net.cpp:264] This network produces output loss
I1117 06:49:52.260653 11460 net.cpp:284] Network initialization done.
I1117 06:49:52.260722 11460 solver.cpp:60] Solver scaffolding done.
F1117 06:49:52.470127 11460 math_functions.cu:396] Check failed: status ==
CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE
*** Check failure stack trace: ***
Aborted (core dumped)

On Wed, Nov 16, 2016 at 5:35 AM, Greg Heinrich notifications@github.com
wrote:

You can find examples on the https://github.com/shelhamer/
fcn.berkeleyvision.org repository.

Are you still using nv-caffe 0.15.9? I recall there were some issues on
this version. Can you try to upgrade to 0.15.13
https://github.com/NVIDIA/caffe/releases/tag/v0.15.13?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1264 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEkKwQDUALc-h81NHMA6jzKEuGI9HaCZks5q-upkgaJpZM4Kwr3H
.

@lukeyeager
Copy link
Member

I'm pretty sure this still comes down to a CUDA 8.0 toolkit and driver issue. See NVIDIA/caffe#270 for a summary of issues dealing with this error.

@NeuroRoboticTech
Copy link
Author

Success! I was able to get the Titan X working finally. I started out by
taking the GTX 960 out and swapping the Titan into the primary PCIx16 slot.
I do not think this ultimately had anything to do with it working, but I
was trying to get to a base install state. I then started fresh with a new
install of Ubuntu 16.04. I then installed everything from scratch from CUDA
8.0 up to digits 5. This time it was able to perform a test segmenting job
without any issues. I think there must have been something in my old
configuration that was pointing to an older version of CUDA. I am just not
sure what that was. I then put the GTX 960 back in. I had to recompile
CUDA again to get the older sm_51 architecture included, but I was able to
get it working with both the GTX 960 and the Titan X.

Thanks so much for everyone's help and suggestions.
David

On Fri, Nov 18, 2016 at 11:39 AM, Luke Yeager notifications@github.com
wrote:

I'm pretty sure this still comes down to a CUDA 8.0 toolkit and driver
issue. See NVIDIA/caffe#270 NVIDIA/caffe#270
for a summary of issues dealing with this error.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1264 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEkKwUY5WeNP3b9LGU40CiJcy8_om71Uks5q_eLZgaJpZM4Kwr3H
.

@lzqcode
Copy link

lzqcode commented May 15, 2017

@NeuroRoboticTech I met a problem similar to yours.Because i need to run an experiment of a paper which was published in the end of 2015.I configured the environment according to the author's at that time:ubuntu 14.04,cuda 7.0 cudnn V3,nvidia driver NVIDIA-Linux-x86_64-375.66,caffe can be compiled successfully.but when I run the script for training,an error occurred: math_functions.cu: Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE.
My GPU card is TITAN X PASCAL,the author's is GTX1080.I also tried run in ubuntu 16.04,cuda 8.0,cudnn 5.1,but it would occurred errors about *cudnn.hpp when compiling,I think maybe this experiment was done in the old version of cudnn. I don't know how to run the previous works in TITAN X PASCAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants