Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered #3

Closed
qioooo opened this issue Nov 21, 2016 · 14 comments

Comments

@qioooo
Copy link

qioooo commented Nov 21, 2016

when i run the test as the readme, i have the error

I1121 19:14:36.428259 29498 mil_frame_loss_layer.cpp:147] args 1788
I1121 19:14:36.428267 29498 mil_frame_loss_layer.cpp:148] batches 64
I1121 19:14:36.428285 29498 mil_frame_loss_layer.cpp:149] frames 504
I1121 19:14:36.428292 29498 mil_frame_loss_layer.cpp:150] max_value 342
I1121 19:14:36.428297 29498 mil_frame_loss_layer.cpp:151] bbs 1
I1121 19:14:36.428305 29498 mil_frame_loss_layer.cpp:161] DONE ALLOCATION
I1121 19:14:36.428313 29498 mil_frame_loss_layer.cu:340] MIL START FORWARD 64 1 predict only=0 size=3
I1121 19:14:36.428319 29498 mil_frame_loss_layer.cu:341] 504
F1121 19:14:36.568738 29498 mil_frame_loss_layer.cu:348] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered
*** Check failure stack trace:
@ 0x7fb8533dadaa (unknown)
@ 0x7fb8533dace4 (unknown)
@ 0x7fb8533da6e6 (unknown)
@ 0x7fb8533dd687 (unknown)
@ 0x7fb853bc3681 caffe::MILFrameLossLayer<>::Forward_gpu()
@ 0x7fb853a353a5 caffe::Net<>::ForwardFromTo()
@ 0x7fb853a35717 caffe::Net<>::Forward()
@ 0x7fb853a26217 caffe::Solver<>::Step()
@ 0x7fb853a26ad9 caffe::Solver<>::Solve()
@ 0x40876b train()
@ 0x405b6c main
@ 0x7fb8523e5ec5 (unknown)
@ 0x4063db (unknown)
@ (nil) (unknown)
Aborted (core dumped)

@shuait
Copy link

shuait commented Nov 23, 2016

Got the same issue. Still have no clue.

@my89
Copy link
Owner

my89 commented Nov 23, 2016

The error seems to be gpu specific, so I'm having trouble tracking it down. Running on a Titan the error doesn't happen. What are your configurations?

@shuait
Copy link

shuait commented Nov 23, 2016

@my89 I searched on Caffe issues and it seems to be a cuda or cudnn problem. I'm using a K40c card and cuda 7.0 with cudnn 4.0.7.

@my89
Copy link
Owner

my89 commented Nov 23, 2016

can you point me to the thread you found? maybe I can do something to write around the issue...

@shuait
Copy link

shuait commented Nov 23, 2016

@my89 Here are the top 3 links from Google when I just searched the error message:
BVLC/caffe#4169
NVIDIA/DIGITS#598
BVLC/caffe#2417

@my89
Copy link
Owner

my89 commented Nov 23, 2016

if you build the caffe tests do you get the same errors in the tests cases?
in caffe/build directory:
make -j8 test
make -j8 runtest

@shuait
Copy link

shuait commented Nov 23, 2016

in runtest I'm stuck at:

`[----------] 2 tests from CuDNNSoftmaxLayerTest/1 (1461 ms total)

[----------] 11 tests from AdaDeltaSolverTest/0, where TypeParam = caffe::CPUDevice
[ RUN ] AdaDeltaSolverTest/0.TestLeastSquaresUpdateWithEverythingAccumShare
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 47623243356672:
#000: ../../../src/H5F.c line 1586 in H5Fopen(): unable to open file
major: File accessibilty
minor: Unable to open file
#1: ../../../src/H5F.c line 1275 in H5F_open(): unable to open file: time = Wed Nov 23 12:56:46 2016
, name = 'src/caffe/test/test_data/solver_data.h5', tent_flags = 0
major: File accessibilty
minor: Unable to open file
#2: ../../../src/H5FD.c line 987 in H5FD_open(): open failed
major: Virtual File Layer
minor: Unable to initialize object
#3: ../../../src/H5FDsec2.c line 343 in H5FD_sec2_open(): unable to open file: name = 'src/caffe/test/test_data/solver_data.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0
major: File accessibilty
minor: Unable to open file
F1123 12:56:46.817138 6491 hdf5_data_layer.cpp:31] Failed opening HDF5 file: src/caffe/test/test_data/solver_data.h5
*** Check failure stack trace: ***
@ 0x2b50279a0daa (unknown)
@ 0x2b50279a0ce4 (unknown)
@ 0x2b50279a06e6 (unknown)
@ 0x2b50279a3687 (unknown)
@ 0x2b5026c4fd04 caffe::HDF5DataLayer<>::LoadHDF5FileData()
@ 0x2b5026c4e53b caffe::HDF5DataLayer<>::LayerSetUp()
@ 0x2b5026b03f8b caffe::Net<>::Init()
@ 0x2b5026b051b8 caffe::Net<>::Net()
@ 0x2b5026b1f67a caffe::Solver<>::InitTrainNet()
@ 0x2b5026b20bf2 caffe::Solver<>::Init()
@ 0x2b5026b20f4a caffe::Solver<>::Solver()
@ 0xabb168 caffe::AdaDeltaSolverTest<>::InitSolver()
@ 0xab8a03 caffe::GradientBasedSolverTest<>::InitSolverFromProtoString()
@ 0xa79e67 caffe::GradientBasedSolverTest<>::RunLeastSquaresSolver()
@ 0xab68d0 caffe::GradientBasedSolverTest<>::CheckAccumulation()
@ 0xda3fd3 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0xd9b99a testing::Test::Run()
@ 0xd9bae8 testing::TestInfo::Run()
@ 0xd9bbc5 testing::TestCase::Run()
@ 0xd9ce48 testing::internal::UnitTestImpl::RunAllTests()
@ 0xd9d123 testing::UnitTest::Run()
@ 0x89077f main
@ 0x2b502d188f45 (unknown)
@ 0x8921a2 (unknown)
@ (nil) (unknown)
Aborted (core dumped)
make[3]: *** [src/caffe/test/CMakeFiles/runtest] Error 134
make[2]: *** [src/caffe/test/CMakeFiles/runtest.dir/all] Error 2
make[1]: *** [src/caffe/test/CMakeFiles/runtest.dir/rule] Error 2
make: *** [runtest] Error 2
`

also this test failed:

'[ RUN ] HDF5OutputLayerTest/2.TestForward
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 47623243356672:
#000: ../../../src/H5F.c line 1586 in H5Fopen(): unable to open file
major: File accessibilty
minor: Unable to open file
#1: ../../../src/H5F.c line 1275 in H5F_open(): unable to open file: time = Wed Nov 23 12:56:01 2016
, name = 'src/caffe/test/test_data/sample_data.h5', tent_flags = 0
major: File accessibilty
minor: Unable to open file
#2: ../../../src/H5FD.c line 987 in H5FD_open(): open failed
major: Virtual File Layer
minor: Unable to initialize object
#3: ../../../src/H5FDsec2.c line 343 in H5FD_sec2_open(): unable to open file: name = 'src/caffe/test/test_data/sample_data.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0
major: File accessibilty
minor: Unable to open file
/media/storage2/shuait/Git/SituationCrf/caffe/src/caffe/test/test_hdf5_output_layer.cpp:78: Failure
Expected: (file_id) >= (0), actual: -1 vs 0
Failed to open HDF5 filesrc/caffe/test/test_data/sample_data.h5
[ FAILED ] HDF5OutputLayerTest/2.TestForward, where TypeParam = caffe::GPUDevice (0 ms)'

@my89
Copy link
Owner

my89 commented Nov 23, 2016

I added the files you are missing. My gitignore was too aggressive. Pull repository and retry.

@shuait
Copy link

shuait commented Nov 23, 2016

@my89 Hi, thanks for updating the code. All tests passed perfectly, but the error when running the demo is still the same:

I1123 14:02:50.392503 8168 net.cpp:283] Network initialization done.
I1123 14:02:50.397264 8168 hdf5.cpp:32] Datatype class: H5T_FLOAT
I1123 14:02:50.739897 8168 caffe.cpp:285] Running for 504 iterations.
I1123 14:02:50.741216 8168 multilabel_data_layer.cpp:213] 250
I1123 14:02:51.635288 8168 roi_pooling_layer.cpp:43] 50 , 512 , 14 , 14
I1123 14:02:51.785593 8168 mil_frame_loss_layer.cpp:123] 3 0
I1123 14:02:51.785616 8168 mil_frame_loss_layer.cpp:147] args 1788
I1123 14:02:51.785622 8168 mil_frame_loss_layer.cpp:148] batches 50
I1123 14:02:51.785629 8168 mil_frame_loss_layer.cpp:149] frames 504
I1123 14:02:51.785634 8168 mil_frame_loss_layer.cpp:150] max_value 342
I1123 14:02:51.785640 8168 mil_frame_loss_layer.cpp:151] bbs 1
I1123 14:02:51.785647 8168 mil_frame_loss_layer.cpp:161] DONE ALLOCATION
I1123 14:02:51.785655 8168 mil_frame_loss_layer.cu:340] MIL START FORWARD 50 1 predict only=0 size=3
I1123 14:02:51.785662 8168 mil_frame_loss_layer.cu:341] 504
F1123 14:02:52.286033 8168 mil_frame_loss_layer.cu:348] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7f7774a93daa (unknown)
@ 0x7f7774a93ce4 (unknown)
@ 0x7f7774a936e6 (unknown)
@ 0x7f7774a96687 (unknown)
@ 0x7f77751eee0e caffe::MILFrameLossLayer<>::Forward_gpu()
@ 0x7f7775000472 caffe::Net<>::ForwardFromTo()
@ 0x7f7775000587 caffe::Net<>::Forward()
@ 0x40a3a9 test()
@ 0x408333 main
@ 0x7f7773286f45 (unknown)
@ 0x408cb1 (unknown)
@ (nil) (unknown)
Aborted (core dumped)

@qioooo
Copy link
Author

qioooo commented Nov 24, 2016

@my89 I have added the files,but still have the same problem.Same as @shuait

@qioooo
Copy link
Author

qioooo commented Nov 24, 2016

@my89 I’m using k40m

@shuait
Copy link

shuait commented Nov 30, 2016

@qioooo @my89 By removing '--gpu 0' flag in the argument I'm able to bypass this error and train the network. The reason why I did this was I saw in this post (BVLC/caffe#2417) they got the same error message when using Amazon EC2 instances, but only happens once out of 7 instances, so I guess it could be the hardware not recognized. I removed this argument, letting caffe auto select GPU available. And it seems to work now, I have finished training 504 epochs but I don't quiet understand what the results mean, I will open a different issue for it.

@my89
Copy link
Owner

my89 commented Dec 18, 2016

Hi,

@shuait that does not fix it. It simply runs it in cpu mode (which currently the code doesn't support bc its too slow to use in my dev cycle), so it will output nonsense. I updated the code so this is clear.

I have tracked down what I think is the source of the bug. Honesty, as far as I can tell its a compiler bug because adding a no-op statement in one of my cuda kernels fixes it. Line 15 of caffe/src/caffe/layers/mil_frame_loss_layer.cu if you are interested in further investigating. I've tested the code on Tesla K40c, cuda 8.0, cudnn 5.1.5 and it now matches with the output on a Titan GTX.

@my89 my89 closed this as completed Dec 18, 2016
@Abhijeet-Bhilare
Copy link

reducing the batch size solved my problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants