-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered #3
Comments
Got the same issue. Still have no clue. |
The error seems to be gpu specific, so I'm having trouble tracking it down. Running on a Titan the error doesn't happen. What are your configurations? |
@my89 I searched on Caffe issues and it seems to be a cuda or cudnn problem. I'm using a K40c card and cuda 7.0 with cudnn 4.0.7. |
can you point me to the thread you found? maybe I can do something to write around the issue... |
@my89 Here are the top 3 links from Google when I just searched the error message: |
if you build the caffe tests do you get the same errors in the tests cases? |
in runtest I'm stuck at: `[----------] 2 tests from CuDNNSoftmaxLayerTest/1 (1461 ms total) [----------] 11 tests from AdaDeltaSolverTest/0, where TypeParam = caffe::CPUDevice also this test failed: '[ RUN ] HDF5OutputLayerTest/2.TestForward |
I added the files you are missing. My gitignore was too aggressive. Pull repository and retry. |
@my89 Hi, thanks for updating the code. All tests passed perfectly, but the error when running the demo is still the same: I1123 14:02:50.392503 8168 net.cpp:283] Network initialization done. |
@my89 I’m using k40m |
@qioooo @my89 By removing '--gpu 0' flag in the argument I'm able to bypass this error and train the network. The reason why I did this was I saw in this post (BVLC/caffe#2417) they got the same error message when using Amazon EC2 instances, but only happens once out of 7 instances, so I guess it could be the hardware not recognized. I removed this argument, letting caffe auto select GPU available. And it seems to work now, I have finished training 504 epochs but I don't quiet understand what the results mean, I will open a different issue for it. |
Hi, @shuait that does not fix it. It simply runs it in cpu mode (which currently the code doesn't support bc its too slow to use in my dev cycle), so it will output nonsense. I updated the code so this is clear. I have tracked down what I think is the source of the bug. Honesty, as far as I can tell its a compiler bug because adding a no-op statement in one of my cuda kernels fixes it. Line 15 of caffe/src/caffe/layers/mil_frame_loss_layer.cu if you are interested in further investigating. I've tested the code on Tesla K40c, cuda 8.0, cudnn 5.1.5 and it now matches with the output on a Titan GTX. |
reducing the batch size solved my problem |
when i run the test as the readme, i have the error
I1121 19:14:36.428259 29498 mil_frame_loss_layer.cpp:147] args 1788
I1121 19:14:36.428267 29498 mil_frame_loss_layer.cpp:148] batches 64
I1121 19:14:36.428285 29498 mil_frame_loss_layer.cpp:149] frames 504
I1121 19:14:36.428292 29498 mil_frame_loss_layer.cpp:150] max_value 342
I1121 19:14:36.428297 29498 mil_frame_loss_layer.cpp:151] bbs 1
I1121 19:14:36.428305 29498 mil_frame_loss_layer.cpp:161] DONE ALLOCATION
I1121 19:14:36.428313 29498 mil_frame_loss_layer.cu:340] MIL START FORWARD 64 1 predict only=0 size=3
I1121 19:14:36.428319 29498 mil_frame_loss_layer.cu:341] 504
F1121 19:14:36.568738 29498 mil_frame_loss_layer.cu:348] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered
*** Check failure stack trace:
@ 0x7fb8533dadaa (unknown)
@ 0x7fb8533dace4 (unknown)
@ 0x7fb8533da6e6 (unknown)
@ 0x7fb8533dd687 (unknown)
@ 0x7fb853bc3681 caffe::MILFrameLossLayer<>::Forward_gpu()
@ 0x7fb853a353a5 caffe::Net<>::ForwardFromTo()
@ 0x7fb853a35717 caffe::Net<>::Forward()
@ 0x7fb853a26217 caffe::Solver<>::Step()
@ 0x7fb853a26ad9 caffe::Solver<>::Solve()
@ 0x40876b train()
@ 0x405b6c main
@ 0x7fb8523e5ec5 (unknown)
@ 0x4063db (unknown)
@ (nil) (unknown)
Aborted (core dumped)
The text was updated successfully, but these errors were encountered: