cuda problem when training the model #2417

TingLee91 · 2015-05-05T12:49:41Z

When I trained a model with googlenet, I always got a strange problem with cuda, the error is
"F0505 14:15:54.209436 5318 math_functions.cu:28] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED",
it can also be "F1013 09:33:06.971670 4890 math_functions.cpp:91] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered".
I noticed that in the google groupe of caffe users, some people got the same problem but nobody in the groupe knows the reason or how to solve this. Anybody helps me? Tks!

longjon · 2015-05-08T01:50:10Z

Are you on a Maxwell GPU? We've been seeing this occasionally, cause is not yet known.

krasin · 2015-06-01T20:22:06Z

@longjon I was hit by this issue recently on Amazon EC2 (g2.2xlarge).

$ lspci | grep NVIDIA                                                                                         
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)

Only 1 of 2 instances have this problem. I fail to find a difference between them so far.

The troubling instance would sometimes fail with another error:

F0601 20:26:31.249778  1690 math_functions.cu:407] Check failed: status == 
CURAND_STATUS_SUCCESS (201 vs. 0)  CURAND_STATUS_LAUNCH_FAILURE
*** Check failure stack trace: ***

Also

F0601 20:30:14.161443  2897 math_functions.cu:81] Check failed: error == 
cudaSuccess (74 vs. 0)  misaligned address

For the record, I also tried to reset the GPU with:

$ sudo nvidia-smi -r -i 0
GPU 0000:00:03.0 was successfully reset.
All done.

krasin · 2015-06-01T22:17:25Z

More stats: in total, I launched 7 instances (with max 5 running concurrently), and only observed the issue once. It proves that it's not the system image issue. It's more likely that GPU could get into a troubled state, which could not be reset with nvidia-smi or rebooting.

dmaniry · 2015-07-28T11:34:19Z

I have this exact problem. The GPU is a GTX Titan.

jayelm · 2015-12-01T19:13:00Z

Same problem as @krasin on a g2.2xlarge instance. Followed the tutorial in the wiki here . Whether or not I get a illegal memory access error seems arbitrary. Sometimes it fails instantly, sometimes the net I'm training actually starts up for a couple of seconds...

HeddaZhu · 2016-01-24T08:32:51Z

I also have this problem, did anyone solve this ?
cuda 7.0 ,GPU is GTX 980
thx!

mtamburrano · 2016-02-15T11:44:23Z

I just encountered this issue, in my case this seems due to the use of rectangular kernels into conv layers.
Reshaping the kernels into square ones resolved the issue.
Note that I also used rectangular kernels for the pooling layers and those work properly

catsdogone · 2016-03-09T03:33:04Z

I have the similar problem:
F0309 11:30:48.307298 892 syncedmem.hpp:19] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
Aborted (core dumped)

When I run demo.py with gpu_id = 0, it is OK. But when I set gpu_id = 1 or 2 or 3(I have 4 gpu), the problem arises.

wangdelp · 2016-04-25T17:23:07Z

@catsdogone your solution works for me. Thx

catsdogone · 2016-05-01T16:15:06Z

@wangdelp While I find the problem is that I have deleted the line of " cfg.GPU_ID = args.gpu_id" in demo.py because of some fortuitous reasons. I corrected my errors and everything is OK now.

ibmua · 2016-05-21T19:21:03Z

Just had this problem with my pretty large net on TitanX. Was doing 128 batch. Works fine with batch 64.
But it still continues to happen from time to time. Unstable. My autotrainer which should run nonstop gets crashed along the way because of this. Smaller batches seem to help, though.

jaredstarkey · 2016-06-06T20:25:55Z

I got this error when running [make runtest] Can confirm this same error on CUDA 7.5 / 367 drivers on Ubuntu 16.04 with a GTX 980 and GTX 1080.

[ RUN ] GPUStochasticPoolingLayerTest/0.TestGradient
F0606 15:22:28.586875 3962 math_functions.cu:381] Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE
*** Check failure stack trace: ***
@ 0x7f90b864e5cd google::LogMessage::Fail()
@ 0x7f90b8650433 google::LogMessage::SendToLog()
@ 0x7f90b864e15b google::LogMessage::Flush()
@ 0x7f90b8650e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f90b6264a16 caffe::caffe_gpu_rng_uniform<>()
@ 0x7f90b629d0bb caffe::PoolingLayer<>::Forward_gpu()
@ 0x477a6d caffe::Layer<>::Forward()
@ 0x4ed8f4 caffe::GradientChecker<>::CheckGradientSingle()
@ 0x782fae caffe::GPUStochasticPoolingLayerTest_TestGradient_Test<>::TestBody()
@ 0x90d923 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x906f3a testing::Test::Run()
@ 0x907088 testing::TestInfo::Run()
@ 0x907165 testing::TestCase::Run()
@ 0x90843f testing::internal::UnitTestImpl::RunAllTests()
@ 0x908763 testing::UnitTest::Run()
@ 0x46d04d main
@ 0x7f90b5422830 __libc_start_main
@ 0x474a39 _start
@ (nil) (unknown)
Makefile:525: recipe for target 'runtest' failed
make: *** [runtest] Aborted (core dumped)

jaredstarkey · 2016-06-06T20:57:45Z

Removed the GTX980. Here's another trace.

[----------] 3 tests from GPUStochasticPoolingLayerTest/1, where TypeParam = double
[ RUN ] GPUStochasticPoolingLayerTest/1.TestStochastic
F0606 15:53:56.730144 26116 math_functions.cu:394] Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE
*** Check failure stack trace: ***
@ 0x7f91781bb5cd google::LogMessage::Fail()
@ 0x7f91781bd433 google::LogMessage::SendToLog()
@ 0x7f91781bb15b google::LogMessage::Flush()
@ 0x7f91781bde1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f9175dd1bc4 caffe::caffe_gpu_rng_uniform<>()
@ 0x7f9175e08c8b caffe::PoolingLayer<>::Forward_gpu()
@ 0x47779d caffe::Layer<>::Forward()
@ 0x781a87 caffe::GPUStochasticPoolingLayerTest_TestStochastic_Test<>::TestBody()
@ 0x90d923 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x906f3a testing::Test::Run()
@ 0x907088 testing::TestInfo::Run()
@ 0x907165 testing::TestCase::Run()
@ 0x90843f testing::internal::UnitTestImpl::RunAllTests()
@ 0x908763 testing::UnitTest::Run()
@ 0x46d04d main
@ 0x7f9174f8f830 __libc_start_main
@ 0x474a39 _start
@ (nil) (unknown)
Makefile:525: recipe for target 'runtest' failed

Darwin2011 · 2016-06-17T06:40:49Z

@jaredstarkey if using cuda-8, the problem seems missing from my side.

jiong3 · 2016-09-04T10:37:48Z

@jaredstarkey Have you found a solution to the error? I'm getting exactly the same error, also Ubuntu 16.04, CUDA 7.5 but with a GTX1070

jaredstarkey · 2016-09-06T14:26:14Z

We ended up building a different system and tried again. Using the install guide, we were able to install no problem and pass the tests. My honest suspicion is that we screwed up the prereqs installation and just needed to clean up our dependencies. We did resolve the issue, but since we were focused on many other things, I can't say what resolved our caffe issues.

sbrugman · 2016-09-12T12:16:02Z

I'm having the same error, with Ubuntu 16.04, CUDA 7.5 and GTX1070.

Wizardofoddz · 2016-09-15T23:31:33Z

Download cuda 8.0 it should fix it

stig11 · 2016-09-19T16:51:34Z

I'm having the same error, with Ubuntu 16.04, GTX1070 when running make runtest. I downloaded the the CUDA 8 file and installed it from the nvidia website but when I do nvcc -VI get:

Cuda compilation tools, release 7.5, V7.5.17

Is 7.5 used by default or something and if so how do you change it? - thanks in advance

ibmua · 2016-09-19T17:08:31Z

@stig11 I'd check ~/.bashrc . It looks like you might have a PATH & LD_LIBRARY_PATH to your old CUDA installation of 7.5 (which you might want to uninstall, actually). Make sure you have paths to current installation and don't have to the old one and that you've opened a new terminal after you changed .bashrc

xhuvom · 2016-10-20T12:55:18Z

Caffe version is 0.15.14 with Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080
Terminal console output:


2016-10-20 18:40:59 [20161020-184058-da4a] [INFO ] Task subprocess args: "/usr/bin/caffe train --solver=/home/xhuv/digits/digits/jobs/20161020-184058-da4a/solver.prototxt --gpu=0 --weights=/home/xhuv/digits/googlenet.caffemodel"
2016-10-20 18:41:31 [20161020-184058-da4a] [ERROR] Train Caffe Model: Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0)  CURAND_STATUS_LAUNCH_FAILURE
2016-10-20 18:43:26 [20161020-184058-da4a] [ERROR] Train Caffe Model task failed with error code -6

Plz help!!

see issue #1186

yfor1008 · 2016-11-08T03:12:21Z

@catsdogone your solution works for me. Thx

nikAleksandr · 2016-12-23T23:36:02Z

having the same issue with an NVIDIA 1070

F1223 18:31:38.033812 10676 math_functions.cu:79] Check failed: error == cudaSuccess (74 vs. 0) misaligned address *** Check failure stack trace: ***

zimenglan-sysu-512 · 2017-01-14T10:28:27Z

@nikAleksandr have you solved your problem?

aseuteurideu · 2017-01-17T01:19:55Z

I got misaligned address error (same as @nikAleksandr ) and solution by @ibmua to reduce batch size sometimes work. But sometimes even with batch 1, the error still persist.
I use TitanX and CUDA 8.0.

*additional information

So I tried to find out what is the reason of misaligned address. For the same network, I reduce the batch size, and the thing that is different is Reallocating workspace storage: value (can be seen in terminal when caffe initialize their training). The smaller batch was working while the bigger batch got misaligned address error.

The Reallocating workspace storage: is printed by cudnn_conv_layer.cpp:194. The total_max_workspace is the maximum of 3 variables: total_workspace_fwd, total_workspace_bwd_data, and total_workspace_bwd_filter.

total_workspace_fwd is taken from cudnnGetConvolutionForwardWorkspaceSize function.
total_workspace_bwd_data is taken from cudnnGetConvolutionBackwardDataWorkspaceSize function.
total_workspace_bwd_filter is taken from cudnnGetConvolutionBackwardFilterWorkspaceSize function.

In this specific case, with different batch, the total_workspace_fwd is reduced and it is the maximum compare to the other variables. Therefore, the error disappears when the batch size is reduced.

However, I don't find the same phenomena with other experiments. Sometimes when I reduce the batch, the variables values were same or even increased.

So, reducing the batch sometimes work, sometimes not. I don't know yet how cudnn calculate the workspace size.

Also, note that the workspace size is multiplied by number of group. So, if you have many groups (e.g. in channelwise case), the workspace size may multiplied and therefore misaligned address error came out.

CMIIW. Please share if you know something more than this.

Thank you

deep-unlearn · 2017-02-08T15:09:05Z

Hello,

In my case the "top" and "bottom" layer in the "deconvolution" layers where the same (save variable but with different num_output) and this cause a strange problem to occur somehow overwriting the data wrongly and producing the error. As soon as I changed them so input and output are stored in separate variables problems was solved.

My advice is try to store variables on every layer in separate memory location and this may solve the problem.

Regards

aseuteurideu · 2017-03-02T07:49:44Z

@nikAleksandr I solved some of my misaligned address problem by using CuDNN 5.1 instead of CuDNN 5.
Even though, still it is not working in all of my misaligned address problems.

*I use CUDA 8.0 and Titan X

Fang-Haoshu · 2017-03-08T02:31:30Z

Ran into the same issue. Then I found that for the LIBRARY_DIRS in my Makefile.config, one of the paths direct to an old CUDA installation of 7.5. Delete it fix my problem.

RookieLCode · 2017-03-15T12:45:57Z

@kalkaneus I also encountered this misaligned address problem.
This problem can be reproduced on my computer by solving this network.

layer1: DummyData 1x1x255x255
layer2: Convolution num_output:1 kernel_size:1 stride:1 pad:0
layer3: Convolution num_output:1 kernel_size:3 stride:1 pad:1
layer4: EuclideanLoss (layer1's top and layer3's top)

I use CUDA8.0, cudnn5.1 and TitanX.
Additionally, compiling caffe without cudnn can avoid this problem. Changing the size from 255 to other even numbers can also avoid this problem.

aseuteurideu · 2017-03-16T00:32:04Z

@RookieLCode Yeah... compiling caffe without cudnn can avoid the problem but somehow my training become too slow.
I tried caffe version from NVIDIA github. It is more stable and my problem is solved. I guess this version has been tested by NVIDIA.

seokhoonboo · 2017-03-24T04:32:49Z

@kalkaneus As a temporary, the problem can be solved by changing the 'engine' parameter to '1'(CAFFE) of the convolution layer where the error is occured. The other convolution layers still can use cudnn engine.

aseuteurideu · 2017-03-26T06:10:59Z

Compiling Caffe without CuDNN solve my problem. But, the training time become slow.

Solution by @seokhoonboo by changing the 'engine' to '1' (CAFFE) also work, faster than without CuDNN at all, as only problematic layers that don't use CuDNN.

Then, I found the NVIDIA's Caffe, which I can use CuDNN for all my layers without error.

Godricly · 2017-05-12T05:11:21Z

I got misaligned address issue with cuda8.0.61 cudnn 5.1.10 driver 375.26 on GTX1080.
reversing cudnn to 5.0.5 solved this issue.
Just FYI. Maybe you can try with this.

HencyChen · 2017-10-19T08:38:56Z

I've tried using only gpu_id=0 and I'm using cuda8, but it still not working. But when I lower the batch size, it runs perfectly. So I guess is the lack of gpu memory.

shiyuangogogo · 2018-01-21T11:40:22Z

@Godricly Hi. You write "I got misaligned address issue with cuda8.0.61 cudnn 5.1.10 driver 375.26 on GTX1080. reversing cudnn to 5.0.5 solved this issue." . Did you mean that you suffered " error Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered" ?

Godricly · 2018-01-21T12:53:31Z

I don't remember it now. 😞 I think so.

zhonhel · 2018-04-22T02:47:54Z

having the same issue with an NVIDIA 1070

nann93 · 2018-05-18T09:28:50Z

Hello, I have encountered the same problem:

Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered

Some information of my computer: Ubuntu16.04 2GPU

the nvidia-smi shows that: gpu0 use 29%
gpu1 use 23%

I don't think the gpg are not enough for my caffe code, how should I deal with this error?

Thanks!

asa008 · 2019-03-05T13:25:32Z

math_functions.cu:28] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED

The above error will occur when installing CUDA9.0
By installing patchesPatch 2 (Released Mar 5, 2018)solve

jack1yang · 2019-12-19T15:55:01Z

i had the same issue with titan x. I tried many methods and searched google, but still couln't solve it. Finally, i found my GPU_ID=0's memory is full and only comes to normal when there is enough memory on gpu0. I think although you set gpu_id=1, you code still uses the gpu=0

Shimingyi · 2020-09-23T17:29:12Z

Same problem!!
Titan Xp machine. I ran the network on another card which has enought memory, but always gave me error code 77 - an illegal memory access was encountered.
Solution: Shut down the program which is running on the first card, becuase caffe will try to allocate some memory on first card firstly. So weird!!

jshfeng added the JL label May 5, 2015

schelian mentioned this issue Sep 16, 2016

Error during training shihenw/convolutional-pose-machines-release#6

Closed

shuait mentioned this issue Nov 23, 2016

error Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered my89/SituationCrf#3

Closed

shelhamer closed this as completed Apr 13, 2017

shicai mentioned this issue May 23, 2017

training error when using engine: CAFFE shicai/MobileNet-Caffe#7

Closed

cuda problem when training the model #2417

cuda problem when training the model #2417

Comments

TingLee91 commented May 5, 2015

longjon commented May 8, 2015

krasin commented Jun 1, 2015

krasin commented Jun 1, 2015

dmaniry commented Jul 28, 2015

jayelm commented Dec 1, 2015

HeddaZhu commented Jan 24, 2016

mtamburrano commented Feb 15, 2016

catsdogone commented Mar 9, 2016

wangdelp commented Apr 25, 2016

catsdogone commented May 1, 2016

ibmua commented May 21, 2016 • edited Loading

jaredstarkey commented Jun 6, 2016

jaredstarkey commented Jun 6, 2016

Darwin2011 commented Jun 17, 2016

jiong3 commented Sep 4, 2016

jaredstarkey commented Sep 6, 2016 • edited Loading

sbrugman commented Sep 12, 2016

Wizardofoddz commented Sep 15, 2016

stig11 commented Sep 19, 2016

ibmua commented Sep 19, 2016 • edited Loading

xhuvom commented Oct 20, 2016 • edited Loading

yfor1008 commented Nov 8, 2016

nikAleksandr commented Dec 23, 2016 • edited Loading

zimenglan-sysu-512 commented Jan 14, 2017

aseuteurideu commented Jan 17, 2017 • edited Loading

deep-unlearn commented Feb 8, 2017

aseuteurideu commented Mar 2, 2017

Fang-Haoshu commented Mar 8, 2017

RookieLCode commented Mar 15, 2017

aseuteurideu commented Mar 16, 2017

seokhoonboo commented Mar 24, 2017

aseuteurideu commented Mar 26, 2017

Godricly commented May 12, 2017 • edited Loading

HencyChen commented Oct 19, 2017

shiyuangogogo commented Jan 21, 2018

Godricly commented Jan 21, 2018

zhonhel commented Apr 22, 2018

nann93 commented May 18, 2018

asa008 commented Mar 5, 2019 • edited Loading

jack1yang commented Dec 19, 2019

Shimingyi commented Sep 23, 2020

ibmua commented May 21, 2016 •

edited

Loading

jaredstarkey commented Sep 6, 2016 •

edited

Loading

ibmua commented Sep 19, 2016 •

edited

Loading

xhuvom commented Oct 20, 2016 •

edited

Loading

nikAleksandr commented Dec 23, 2016 •

edited

Loading

aseuteurideu commented Jan 17, 2017 •

edited

Loading

Godricly commented May 12, 2017 •

edited

Loading

asa008 commented Mar 5, 2019 •

edited

Loading