Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda problem when training the model #2417

Closed
TingLee91 opened this issue May 5, 2015 · 41 comments
Closed

cuda problem when training the model #2417

TingLee91 opened this issue May 5, 2015 · 41 comments
Labels

Comments

@TingLee91
Copy link

When I trained a model with googlenet, I always got a strange problem with cuda, the error is
"F0505 14:15:54.209436 5318 math_functions.cu:28] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED",
it can also be "F1013 09:33:06.971670 4890 math_functions.cpp:91] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered".
I noticed that in the google groupe of caffe users, some people got the same problem but nobody in the groupe knows the reason or how to solve this. Anybody helps me? Tks!

@jshfeng jshfeng added the JL label May 5, 2015
@longjon
Copy link
Contributor

longjon commented May 8, 2015

Are you on a Maxwell GPU? We've been seeing this occasionally, cause is not yet known.

@krasin
Copy link

krasin commented Jun 1, 2015

@longjon I was hit by this issue recently on Amazon EC2 (g2.2xlarge).

$ lspci | grep NVIDIA                                                                                         
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)

Only 1 of 2 instances have this problem. I fail to find a difference between them so far.

The troubling instance would sometimes fail with another error:

F0601 20:26:31.249778  1690 math_functions.cu:407] Check failed: status == 
CURAND_STATUS_SUCCESS (201 vs. 0)  CURAND_STATUS_LAUNCH_FAILURE
*** Check failure stack trace: ***

Also

F0601 20:30:14.161443  2897 math_functions.cu:81] Check failed: error == 
cudaSuccess (74 vs. 0)  misaligned address

For the record, I also tried to reset the GPU with:

$ sudo nvidia-smi -r -i 0
GPU 0000:00:03.0 was successfully reset.
All done.

@krasin
Copy link

krasin commented Jun 1, 2015

More stats: in total, I launched 7 instances (with max 5 running concurrently), and only observed the issue once. It proves that it's not the system image issue. It's more likely that GPU could get into a troubled state, which could not be reset with nvidia-smi or rebooting.

@dmaniry
Copy link

dmaniry commented Jul 28, 2015

I have this exact problem. The GPU is a GTX Titan.

@jayelm
Copy link

jayelm commented Dec 1, 2015

Same problem as @krasin on a g2.2xlarge instance. Followed the tutorial in the wiki here . Whether or not I get a illegal memory access error seems arbitrary. Sometimes it fails instantly, sometimes the net I'm training actually starts up for a couple of seconds...

@HeddaZhu
Copy link

I also have this problem, did anyone solve this ?
cuda 7.0 ,GPU is GTX 980
thx!

@mtamburrano
Copy link
Contributor

I just encountered this issue, in my case this seems due to the use of rectangular kernels into conv layers.
Reshaping the kernels into square ones resolved the issue.
Note that I also used rectangular kernels for the pooling layers and those work properly

@catsdogone
Copy link

I have the similar problem:
F0309 11:30:48.307298 892 syncedmem.hpp:19] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
Aborted (core dumped)

When I run demo.py with gpu_id = 0, it is OK. But when I set gpu_id = 1 or 2 or 3(I have 4 gpu), the problem arises.

@wangdelp
Copy link

@catsdogone your solution works for me. Thx

@catsdogone
Copy link

@wangdelp While I find the problem is that I have deleted the line of " cfg.GPU_ID = args.gpu_id" in demo.py because of some fortuitous reasons. I corrected my errors and everything is OK now.

@ibmua
Copy link

ibmua commented May 21, 2016

Just had this problem with my pretty large net on TitanX. Was doing 128 batch. Works fine with batch 64.
But it still continues to happen from time to time. Unstable. My autotrainer which should run nonstop gets crashed along the way because of this. Smaller batches seem to help, though.

@jaredstarkey
Copy link

I got this error when running [make runtest] Can confirm this same error on CUDA 7.5 / 367 drivers on Ubuntu 16.04 with a GTX 980 and GTX 1080.

[ RUN ] GPUStochasticPoolingLayerTest/0.TestGradient
F0606 15:22:28.586875 3962 math_functions.cu:381] Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE
*** Check failure stack trace: ***
@ 0x7f90b864e5cd google::LogMessage::Fail()
@ 0x7f90b8650433 google::LogMessage::SendToLog()
@ 0x7f90b864e15b google::LogMessage::Flush()
@ 0x7f90b8650e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f90b6264a16 caffe::caffe_gpu_rng_uniform<>()
@ 0x7f90b629d0bb caffe::PoolingLayer<>::Forward_gpu()
@ 0x477a6d caffe::Layer<>::Forward()
@ 0x4ed8f4 caffe::GradientChecker<>::CheckGradientSingle()
@ 0x782fae caffe::GPUStochasticPoolingLayerTest_TestGradient_Test<>::TestBody()
@ 0x90d923 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x906f3a testing::Test::Run()
@ 0x907088 testing::TestInfo::Run()
@ 0x907165 testing::TestCase::Run()
@ 0x90843f testing::internal::UnitTestImpl::RunAllTests()
@ 0x908763 testing::UnitTest::Run()
@ 0x46d04d main
@ 0x7f90b5422830 __libc_start_main
@ 0x474a39 _start
@ (nil) (unknown)
Makefile:525: recipe for target 'runtest' failed
make: *** [runtest] Aborted (core dumped)

@jaredstarkey
Copy link

Removed the GTX980. Here's another trace.

[----------] 3 tests from GPUStochasticPoolingLayerTest/1, where TypeParam = double
[ RUN ] GPUStochasticPoolingLayerTest/1.TestStochastic
F0606 15:53:56.730144 26116 math_functions.cu:394] Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE
*** Check failure stack trace: ***
@ 0x7f91781bb5cd google::LogMessage::Fail()
@ 0x7f91781bd433 google::LogMessage::SendToLog()
@ 0x7f91781bb15b google::LogMessage::Flush()
@ 0x7f91781bde1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f9175dd1bc4 caffe::caffe_gpu_rng_uniform<>()
@ 0x7f9175e08c8b caffe::PoolingLayer<>::Forward_gpu()
@ 0x47779d caffe::Layer<>::Forward()
@ 0x781a87 caffe::GPUStochasticPoolingLayerTest_TestStochastic_Test<>::TestBody()
@ 0x90d923 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x906f3a testing::Test::Run()
@ 0x907088 testing::TestInfo::Run()
@ 0x907165 testing::TestCase::Run()
@ 0x90843f testing::internal::UnitTestImpl::RunAllTests()
@ 0x908763 testing::UnitTest::Run()
@ 0x46d04d main
@ 0x7f9174f8f830 __libc_start_main
@ 0x474a39 _start
@ (nil) (unknown)
Makefile:525: recipe for target 'runtest' failed

@Darwin2011
Copy link

@jaredstarkey if using cuda-8, the problem seems missing from my side.

@jiong3
Copy link

jiong3 commented Sep 4, 2016

@jaredstarkey Have you found a solution to the error? I'm getting exactly the same error, also Ubuntu 16.04, CUDA 7.5 but with a GTX1070

@jaredstarkey
Copy link

jaredstarkey commented Sep 6, 2016

We ended up building a different system and tried again. Using the install guide, we were able to install no problem and pass the tests. My honest suspicion is that we screwed up the prereqs installation and just needed to clean up our dependencies. We did resolve the issue, but since we were focused on many other things, I can't say what resolved our caffe issues.

@sbrugman
Copy link

I'm having the same error, with Ubuntu 16.04, CUDA 7.5 and GTX1070.

@Wizardofoddz
Copy link

Download cuda 8.0 it should fix it

@stig11
Copy link

stig11 commented Sep 19, 2016

I'm having the same error, with Ubuntu 16.04, GTX1070 when running make runtest. I downloaded the the CUDA 8 file and installed it from the nvidia website but when I do nvcc -VI get:

Cuda compilation tools, release 7.5, V7.5.17

Is 7.5 used by default or something and if so how do you change it? - thanks in advance

@ibmua
Copy link

ibmua commented Sep 19, 2016

@stig11 I'd check ~/.bashrc . It looks like you might have a PATH & LD_LIBRARY_PATH to your old CUDA installation of 7.5 (which you might want to uninstall, actually). Make sure you have paths to current installation and don't have to the old one and that you've opened a new terminal after you changed .bashrc

@xhuvom
Copy link

xhuvom commented Oct 20, 2016

Caffe version is 0.15.14 with Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080
Terminal console output:


2016-10-20 18:40:59 [20161020-184058-da4a] [INFO ] Task subprocess args: "/usr/bin/caffe train --solver=/home/xhuv/digits/digits/jobs/20161020-184058-da4a/solver.prototxt --gpu=0 --weights=/home/xhuv/digits/googlenet.caffemodel"
2016-10-20 18:41:31 [20161020-184058-da4a] [ERROR] Train Caffe Model: Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0)  CURAND_STATUS_LAUNCH_FAILURE
2016-10-20 18:43:26 [20161020-184058-da4a] [ERROR] Train Caffe Model task failed with error code -6

Plz help!!

see issue #1186

@yfor1008
Copy link

yfor1008 commented Nov 8, 2016

@catsdogone your solution works for me. Thx

@nikAleksandr
Copy link

nikAleksandr commented Dec 23, 2016

having the same issue with an NVIDIA 1070

F1223 18:31:38.033812 10676 math_functions.cu:79] Check failed: error == cudaSuccess (74 vs. 0) misaligned address *** Check failure stack trace: ***

@zimenglan-sysu-512
Copy link

@nikAleksandr have you solved your problem?

@aseuteurideu
Copy link

aseuteurideu commented Jan 17, 2017

I got misaligned address error (same as @nikAleksandr ) and solution by @ibmua to reduce batch size sometimes work. But sometimes even with batch 1, the error still persist.
I use TitanX and CUDA 8.0.

*additional information

So I tried to find out what is the reason of misaligned address. For the same network, I reduce the batch size, and the thing that is different is Reallocating workspace storage: value (can be seen in terminal when caffe initialize their training). The smaller batch was working while the bigger batch got misaligned address error.

The Reallocating workspace storage: is printed by cudnn_conv_layer.cpp:194. The total_max_workspace is the maximum of 3 variables: total_workspace_fwd, total_workspace_bwd_data, and total_workspace_bwd_filter.

  • total_workspace_fwd is taken from cudnnGetConvolutionForwardWorkspaceSize function.
  • total_workspace_bwd_data is taken from cudnnGetConvolutionBackwardDataWorkspaceSize function.
  • total_workspace_bwd_filter is taken from cudnnGetConvolutionBackwardFilterWorkspaceSize function.

In this specific case, with different batch, the total_workspace_fwd is reduced and it is the maximum compare to the other variables. Therefore, the error disappears when the batch size is reduced.

However, I don't find the same phenomena with other experiments. Sometimes when I reduce the batch, the variables values were same or even increased.

So, reducing the batch sometimes work, sometimes not. I don't know yet how cudnn calculate the workspace size.

Also, note that the workspace size is multiplied by number of group. So, if you have many groups (e.g. in channelwise case), the workspace size may multiplied and therefore misaligned address error came out.

CMIIW. Please share if you know something more than this.

Thank you

@deep-unlearn
Copy link

Hello,

In my case the "top" and "bottom" layer in the "deconvolution" layers where the same (save variable but with different num_output) and this cause a strange problem to occur somehow overwriting the data wrongly and producing the error. As soon as I changed them so input and output are stored in separate variables problems was solved.

My advice is try to store variables on every layer in separate memory location and this may solve the problem.

Regards

@aseuteurideu
Copy link

@nikAleksandr I solved some of my misaligned address problem by using CuDNN 5.1 instead of CuDNN 5.
Even though, still it is not working in all of my misaligned address problems.

*I use CUDA 8.0 and Titan X

@Fang-Haoshu
Copy link

Ran into the same issue. Then I found that for the LIBRARY_DIRS in my Makefile.config, one of the paths direct to an old CUDA installation of 7.5. Delete it fix my problem.

@RookieLCode
Copy link

@kalkaneus I also encountered this misaligned address problem.
This problem can be reproduced on my computer by solving this network.

  • layer1: DummyData 1x1x255x255
  • layer2: Convolution num_output:1 kernel_size:1 stride:1 pad:0
  • layer3: Convolution num_output:1 kernel_size:3 stride:1 pad:1
  • layer4: EuclideanLoss (layer1's top and layer3's top)

I use CUDA8.0, cudnn5.1 and TitanX.
Additionally, compiling caffe without cudnn can avoid this problem. Changing the size from 255 to other even numbers can also avoid this problem.

@aseuteurideu
Copy link

@RookieLCode Yeah... compiling caffe without cudnn can avoid the problem but somehow my training become too slow.
I tried caffe version from NVIDIA github. It is more stable and my problem is solved. I guess this version has been tested by NVIDIA.

@seokhoonboo
Copy link

@kalkaneus As a temporary, the problem can be solved by changing the 'engine' parameter to '1'(CAFFE) of the convolution layer where the error is occured. The other convolution layers still can use cudnn engine.

@aseuteurideu
Copy link

Compiling Caffe without CuDNN solve my problem. But, the training time become slow.

Solution by @seokhoonboo by changing the 'engine' to '1' (CAFFE) also work, faster than without CuDNN at all, as only problematic layers that don't use CuDNN.

Then, I found the NVIDIA's Caffe, which I can use CuDNN for all my layers without error.

@Godricly
Copy link

Godricly commented May 12, 2017

I got misaligned address issue with cuda8.0.61 cudnn 5.1.10 driver 375.26 on GTX1080.
reversing cudnn to 5.0.5 solved this issue.
Just FYI. Maybe you can try with this.

@HencyChen
Copy link

I've tried using only gpu_id=0 and I'm using cuda8, but it still not working. But when I lower the batch size, it runs perfectly. So I guess is the lack of gpu memory.

@shiyuangogogo
Copy link

@Godricly Hi. You write "I got misaligned address issue with cuda8.0.61 cudnn 5.1.10 driver 375.26 on GTX1080. reversing cudnn to 5.0.5 solved this issue." . Did you mean that you suffered " error Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered" ?

@Godricly
Copy link

I don't remember it now. 😞 I think so.

@zhonhel
Copy link

zhonhel commented Apr 22, 2018

having the same issue with an NVIDIA 1070

@nann93
Copy link

nann93 commented May 18, 2018

Hello, I have encountered the same problem:

Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered

Some information of my computer: Ubuntu16.04 2GPU

the nvidia-smi shows that: gpu0 use 29%
gpu1 use 23%

I don't think the gpg are not enough for my caffe code, how should I deal with this error?

Thanks!

@asa008
Copy link

asa008 commented Mar 5, 2019

math_functions.cu:28] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED

The above error will occur when installing CUDA9.0
By installing patchesPatch 2 (Released Mar 5, 2018)solve

@jack1yang
Copy link

i had the same issue with titan x. I tried many methods and searched google, but still couln't solve it. Finally, i found my GPU_ID=0's memory is full and only comes to normal when there is enough memory on gpu0. I think although you set gpu_id=1, you code still uses the gpu=0

@Shimingyi
Copy link

Same problem!!
Titan Xp machine. I ran the network on another card which has enought memory, but always gave me error code 77 - an illegal memory access was encountered.
Solution: Shut down the program which is running on the first card, becuase caffe will try to allocate some memory on first card firstly. So weird!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests