Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume training fails #786

Closed
dutran opened this issue Jul 25, 2014 · 17 comments
Closed

Resume training fails #786

dutran opened this issue Jul 25, 2014 · 17 comments

Comments

@dutran
Copy link

dutran commented Jul 25, 2014

Hi all,

I was trying to resume training (from 25k) and got this message, anyone have ideas/hints please help me out?

Many thanks,
Du

I0725 01:06:17.916695 10039 solver.cpp:66] Restoring previous solver status from convnet_iter_25000.solverstate
I0725 01:06:18.531533 10039 solver.cpp:312] SGDSolver: restoring history
I0725 01:06:18.621152 10039 solver.cpp:106] Iteration 25000, Testing net
I0725 01:08:52.277266 10039 solver.cpp:147] Test score #0: 0.3901
I0725 01:08:52.277325 10039 solver.cpp:147] Test score #1: 3.1283
F0725 01:08:55.576004 10039 syncedmem.cpp:55] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure
*** Check failure stack trace: ***
@ 0x7f1c4da37b4d google::LogMessage::Fail()
@ 0x7f1c4da3bb67 google::LogMessage::SendToLog()
@ 0x7f1c4da399e9 google::LogMessage::Flush()
@ 0x7f1c4da39ced google::LogMessageFatal::~LogMessageFatal()
@ 0x4709f3 caffe::SyncedMemory::to_gpu()
@ 0x470579 caffe::SyncedMemory::mutable_gpu_data()
@ 0x45aadd caffe::Blob<>::mutable_gpu_data()
@ 0x4465dc caffe::SGDSolver<>::ComputeUpdateValue()
@ 0x44776e caffe::Solver<>::Solve()
@ 0x41af86 main
@ 0x7f1c4ad09cdd __libc_start_main
@ 0x41abe9 (unknown)
Aborted

@Yangqing
Copy link
Member

Might be a cuda error instead of caffe? Check if all caffe tests pass. Might also be an out of memory issue if your model/batch is too big.

@OpenHero
Copy link

Hi @Yangqing & @dutran FYI...#707 #727

@dutran
Copy link
Author

dutran commented Jul 25, 2014

I think it is very likely to be out of memory as my model is quite big. May be the overhead memory make it become out of memory as training it without resuming is working fine (GPU memory is around 12G, run training wo resuming ~ 11GB).

@dutran
Copy link
Author

dutran commented Aug 13, 2014

I tried this with smaller model, it still get the same issue. I think the problem is not about the memory, but some problem with CUDA or caffe.
Please help if you have any hints on this.
Thank a lot!

@dutran
Copy link
Author

dutran commented Aug 13, 2014

Resuming on CPU works OK but GPU does not. May be CUDA problem? The error happen at
==> CUDA_CHECK(cudaMemcpy(gpu_ptr_, cpu_ptr_, size_, cudaMemcpyHostToDevice));

Thanks a lot!

inline void SyncedMemory::to_gpu() {
switch (head_) {
case UNINITIALIZED:
CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));
CUDA_CHECK(cudaMemset(gpu_ptr_, 0, size_));
head_ = HEAD_AT_GPU;
break;
case HEAD_AT_CPU:
if (gpu_ptr_ == NULL) {
CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));
}
==> CUDA_CHECK(cudaMemcpy(gpu_ptr_, cpu_ptr_, size_, cudaMemcpyHostToDevice));
head_ = SYNCED;
break;
case HEAD_AT_GPU:
case SYNCED:
break;
}
}

@dutran
Copy link
Author

dutran commented Aug 13, 2014

Thank you all for your helps, solved my problem!

Cheers,
Du

@chocolate9624
Copy link

Hi @dutran ,I meet the same problem with you. Could you share the solution! Thanks!

@dutran
Copy link
Author

dutran commented Nov 25, 2014

@chocolate9624 : I was under-allocating memory in CPU. I guess the cudaMemcpy check and find out that cpu_ptr has smaller size than size_.

@chocolate9624
Copy link

@dutran Do you mean your CPU memory is not enough for running caffe in GPU mode? But the CPU mode is OK. Thanks!

@chocolate9624
Copy link

I got the problem. It is my data's problem. Thanks!

@kuixu
Copy link

kuixu commented Aug 9, 2016

My problem is out of the memory, thank you! @Yangqing

@mrgloom
Copy link

mrgloom commented Sep 3, 2016

Same error Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure

I0904 01:09:19.388075 11377 sgd_solver.cpp:106] Iteration 475, lr = 0.01
F0904 01:09:24.065701 11377 cudnn_conv_layer.cu:139] Check failed: error == cudaSuccess (4 vs. 0)  unspecified launch failure
*** Check failure stack trace: ***
@     0x7f3a442bddaa  (unknown)
@     0x7f3a442bdce4  (unknown)
@     0x7f3a442bd6e6  (unknown)
@     0x7f3a442c0687  (unknown)
@     0x7f3a449fa907  caffe::CuDNNConvolutionLayer<>::Backward_gpu()
@     0x7f3a4489d568  caffe::Net<>::BackwardFromTo()
@     0x7f3a4489da11  caffe::Net<>::Backward()
@     0x7f3a4488c1f7  caffe::Solver<>::Step()
@     0x7f3a4488cabe  caffe::Solver<>::Solve()
@           0x40af86  train()
@           0x4086cc  main
@     0x7f3a42dbdf45  (unknown)
@           0x408e9d  (unknown)
@              (nil)  (unknown)

I'm using AlexNet model with 256x256 images, I have gtx 1070 with 8Gb of memory and 8Gb of memory on host, during training memory was < 4Gb, so I don't think this is memory issue.

I'm using NVIDIA branch 0.15.
Also I'm using CUDA 8.0 and CUDNN 5.1.

@sayadyaghoobi
Copy link

sayadyaghoobi commented Apr 24, 2017

i have the exact same issue, if anyone have idea please share. i think this is about cuda problems not caffe :::
I0424 16:51:36.866385 4297 caffe.cpp:218] Using GPUs 0
I0424 16:51:36.868625 4297 caffe.cpp:223] GPU 0: �7�~�
F0424 16:51:36.868661 4297 common.cpp:152] Check failed: error == cudaSuccess (30 vs. 0) unknown error
*** Check failure stack trace: ***
@ 0x7f76f21485cd google::LogMessage::Fail()
@ 0x7f76f214a433 google::LogMessage::SendToLog()
@ 0x7f76f214815b google::LogMessage::Flush()
@ 0x7f76f214ae1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f76f2911012 caffe::Caffe::SetDevice()
@ 0x40b018 train()
@ 0x4072f0 main
@ 0x7f76f10b9830 __libc_start_main
@ 0x407b19 _start
@ (nil) (unknown)
Aborted (core dumped)

@mrgloom
Copy link

mrgloom commented Nov 4, 2017

Once more crush on fresh master branch:

ERROR: Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure

Ignoring source layer train-data
Restarting data prefetching from start.
Test net output #0: accuracy = 0.139648
Test net output #1: loss = 3.7423 (* 1 = 3.7423 loss)
Iteration 3200 (0.794172 iter/s, 20.1468s/16 iters), loss = 2.61571
Train net output #0: loss = 2.61571 (* 1 = 2.61571 loss)
Iteration 3200, lr = 0.0001
Iteration 3216 (3.13362 iter/s, 5.10591s/16 iters), loss = 2.96576
Train net output #0: loss = 2.96576 (* 1 = 2.96576 loss)
Iteration 3216, lr = 0.0001
Iteration 3232 (2.92405 iter/s, 5.47187s/16 iters), loss = 3.12505
Train net output #0: loss = 3.12505 (* 1 = 3.12505 loss)
Iteration 3232, lr = 0.0001
Iteration 3248 (2.80573 iter/s, 5.70261s/16 iters), loss = 2.75908
Train net output #0: loss = 2.75908 (* 1 = 2.75908 loss)
Iteration 3248, lr = 0.0001
Iteration 3264 (2.87587 iter/s, 5.56354s/16 iters), loss = 3.07124
Train net output #0: loss = 3.07124 (* 1 = 3.07124 loss)
Iteration 3264, lr = 0.0001
Check failed: error == cudaSuccess (4 vs. 0)  unspecified launch failure

BTW: I have successfully runned AlexNet model with batchsize 256 and 128, but with batch size 64 it crtashed somewhere in the middle of of the training.

@dong-x16
Copy link

dong-x16 commented Nov 6, 2017

screen shot 2017-11-05 at 9 31 14 pm

during training, I meet this error, I don't know what's wrong with it, anyone have ideas? Many thanks

@sebastiangonsal
Copy link

@chocolate9624 what was your problem?

@shaibagon
Copy link
Member

Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help.
You may also post questions on stackoverflow, make sure you tag them with caffe tag.
There is also caffe.help documenting the different layers of caffe.
Do not post such requests to Issues. Doing so interferes with the development of Caffe.

Please read the guidelines for contributing before submitting this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests