-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda problem when training the model #2417
Comments
Are you on a Maxwell GPU? We've been seeing this occasionally, cause is not yet known. |
@longjon I was hit by this issue recently on Amazon EC2 (g2.2xlarge).
Only 1 of 2 instances have this problem. I fail to find a difference between them so far. The troubling instance would sometimes fail with another error:
Also
For the record, I also tried to reset the GPU with:
|
More stats: in total, I launched 7 instances (with max 5 running concurrently), and only observed the issue once. It proves that it's not the system image issue. It's more likely that GPU could get into a troubled state, which could not be reset with nvidia-smi or rebooting. |
I have this exact problem. The GPU is a GTX Titan. |
I also have this problem, did anyone solve this ? |
I just encountered this issue, in my case this seems due to the use of rectangular kernels into conv layers. |
I have the similar problem: When I run demo.py with gpu_id = 0, it is OK. But when I set gpu_id = 1 or 2 or 3(I have 4 gpu), the problem arises. |
@catsdogone your solution works for me. Thx |
@wangdelp While I find the problem is that I have deleted the line of " cfg.GPU_ID = args.gpu_id" in demo.py because of some fortuitous reasons. I corrected my errors and everything is OK now. |
Just had this problem with my pretty large net on TitanX. Was doing 128 batch. Works fine with batch 64. |
I got this error when running [make runtest] Can confirm this same error on CUDA 7.5 / 367 drivers on Ubuntu 16.04 with a GTX 980 and GTX 1080. [ RUN ] GPUStochasticPoolingLayerTest/0.TestGradient |
Removed the GTX980. Here's another trace. [----------] 3 tests from GPUStochasticPoolingLayerTest/1, where TypeParam = double |
@jaredstarkey if using cuda-8, the problem seems missing from my side. |
@jaredstarkey Have you found a solution to the error? I'm getting exactly the same error, also Ubuntu 16.04, CUDA 7.5 but with a GTX1070 |
We ended up building a different system and tried again. Using the install guide, we were able to install no problem and pass the tests. My honest suspicion is that we screwed up the prereqs installation and just needed to clean up our dependencies. We did resolve the issue, but since we were focused on many other things, I can't say what resolved our caffe issues. |
I'm having the same error, with Ubuntu 16.04, CUDA 7.5 and GTX1070. |
Download cuda 8.0 it should fix it |
I'm having the same error, with Ubuntu 16.04, GTX1070 when running
Is 7.5 used by default or something and if so how do you change it? - thanks in advance |
@stig11 I'd check ~/.bashrc . It looks like you might have a PATH & LD_LIBRARY_PATH to your old CUDA installation of 7.5 (which you might want to uninstall, actually). Make sure you have paths to current installation and don't have to the old one and that you've opened a new terminal after you changed .bashrc |
Caffe version is 0.15.14 with Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080
Plz help!! see issue #1186 |
@catsdogone your solution works for me. Thx |
having the same issue with an NVIDIA 1070
|
@nikAleksandr have you solved your problem? |
I got *additional information So I tried to find out what is the reason of misaligned address. For the same network, I reduce the batch size, and the thing that is different is The
In this specific case, with different batch, the However, I don't find the same phenomena with other experiments. Sometimes when I reduce the batch, the variables values were same or even increased. So, reducing the batch sometimes work, sometimes not. I don't know yet how cudnn calculate the workspace size. Also, note that the workspace size is multiplied by number of group. So, if you have many groups (e.g. in channelwise case), the workspace size may multiplied and therefore CMIIW. Please share if you know something more than this. Thank you |
Hello, In my case the "top" and "bottom" layer in the "deconvolution" layers where the same (save variable but with different num_output) and this cause a strange problem to occur somehow overwriting the data wrongly and producing the error. As soon as I changed them so input and output are stored in separate variables problems was solved. My advice is try to store variables on every layer in separate memory location and this may solve the problem. Regards |
@nikAleksandr I solved some of my misaligned address problem by using CuDNN 5.1 instead of CuDNN 5. *I use CUDA 8.0 and Titan X |
Ran into the same issue. Then I found that for the LIBRARY_DIRS in my Makefile.config, one of the paths direct to an old CUDA installation of 7.5. Delete it fix my problem. |
@kalkaneus I also encountered this misaligned address problem.
I use CUDA8.0, cudnn5.1 and TitanX. |
@RookieLCode Yeah... compiling caffe without cudnn can avoid the problem but somehow my training become too slow. |
@kalkaneus As a temporary, the problem can be solved by changing the 'engine' parameter to '1'(CAFFE) of the convolution layer where the error is occured. The other convolution layers still can use cudnn engine. |
Compiling Caffe without CuDNN solve my problem. But, the training time become slow. Solution by @seokhoonboo by changing the 'engine' to '1' (CAFFE) also work, faster than without CuDNN at all, as only problematic layers that don't use CuDNN. Then, I found the NVIDIA's Caffe, which I can use CuDNN for all my layers without error. |
I got misaligned address issue with cuda8.0.61 cudnn 5.1.10 driver 375.26 on GTX1080. |
I've tried using only gpu_id=0 and I'm using cuda8, but it still not working. But when I lower the batch size, it runs perfectly. So I guess is the lack of gpu memory. |
@Godricly Hi. You write "I got misaligned address issue with cuda8.0.61 cudnn 5.1.10 driver 375.26 on GTX1080. reversing cudnn to 5.0.5 solved this issue." . Did you mean that you suffered " error Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered" ? |
I don't remember it now. 😞 I think so. |
having the same issue with an NVIDIA 1070 |
Hello, I have encountered the same problem: Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered Some information of my computer: Ubuntu16.04 2GPU the nvidia-smi shows that: gpu0 use 29% I don't think the gpg are not enough for my caffe code, how should I deal with this error? Thanks! |
math_functions.cu:28] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED The above error will occur when installing CUDA9.0 |
i had the same issue with titan x. I tried many methods and searched google, but still couln't solve it. Finally, i found my GPU_ID=0's memory is full and only comes to normal when there is enough memory on gpu0. I think although you set gpu_id=1, you code still uses the gpu=0 |
Same problem!! |
When I trained a model with googlenet, I always got a strange problem with cuda, the error is
"F0505 14:15:54.209436 5318 math_functions.cu:28] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED",
it can also be "F1013 09:33:06.971670 4890 math_functions.cpp:91] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered".
I noticed that in the google groupe of caffe users, some people got the same problem but nobody in the groupe knows the reason or how to solve this. Anybody helps me? Tks!
The text was updated successfully, but these errors were encountered: