Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stops at iteration 0 with no error message or probable cause? #41

Closed
alfredox10 opened this issue Oct 10, 2015 · 5 comments
Closed
Labels

Comments

@alfredox10
Copy link

Here is the output of the training command: http://pastebin.com/ATPCBjQd

Here is the net I'm using to train it: http://pastebin.com/H1gLW8Lv
I recently upgraded from an older version of caffe, and had to change some of the parameter names, here is the previous net: http://pastebin.com/43Utkkpe

You will notice the transform_param had to be removed from the HDF5 data layer because apparently that one doesn't support that in 0.13. Can you guys check to see if you find any issues? Caffe passes all the make runtests fine, and it is performing image detections fine with previously trained caffemodel files. I did notice my video card remains at 0% usage even after the training kicks off. I am not sure if it has to do with the video card not being set, but I know it is being used because the runtests used the GPU, and the make config file is set to use GPU.

Software:
Caffe v0.13
CUDA 7.5
NVIDIA 352 driver
CuDNN v3

@lukeyeager lukeyeager added the bug label Oct 10, 2015
@lukeyeager
Copy link
Member

Looks like this could be same problem as @antran89 in NVIDIA/DIGITS#347?

@alfredox10
Copy link
Author

I looked through it but it doesn't seem to be the same. For starters I'm not using DIGITS, this is all running from py scripts, also I don't get the msg about
I1009 10:16:44.213986 14899 blocking_queue.cpp:50] Waiting for data

On another note, I noticed that caffe is running when stuck on iteration 0, it's using up 2.4GB of RAM, 12% CPU, and 1-6% GPU. So I'm not sure if it's using the GPU correctly because otherwise it would go a lot faster than this. However I did see in the caffe log that it's using the GPU:
solver_mode: GPU

@alfredox10
Copy link
Author

Ok this time I let the training run for a while on iteration 0 to see if it really had failed or not. It's been running over an hour, and still on iteration 0, but at least it looks like it is running and producing some output. However it now seems more evident that the code for some reason is not running on the GPU. This is the output so far:

I1009 20:43:18.229598 25324 net.cpp:274] Network initialization done.
I1009 20:43:18.229603 25324 net.cpp:275] Memory required for data: 211200264
I1009 20:43:18.229746 25324 solver.cpp:45] Solver scaffolding done.
I1009 20:43:18.229771 25324 caffe.cpp:179] Starting Optimization
I1009 20:43:18.229779 25324 solver.cpp:269] Solving LogisticRegressionNet
I1009 20:43:18.229782 25324 solver.cpp:270] Learning Rate Policy: step
I1009 20:43:18.229792 25324 solver.cpp:314] Iteration 0, Testing net (#0)
I1009 22:10:26.562768 25324 solver.cpp:363] Test net output #0: accuracy1 = 0.496854
I1009 22:10:26.562873 25324 solver.cpp:363] Test net output #1: loss1 = 0.692814 (* 1 = 0.692814 loss)
I1009 22:11:01.647912 25324 solver.cpp:217] Iteration 0, loss = 0.692216
I1009 22:11:01.648032 25324 solver.cpp:234] Train net output #0: loss1 = 0.692216 (* 1 = 0.692216 loss)
I1009 22:11:01.648051 25324 solver.cpp:511] Iteration 0 (0/s), lr = 0.001
I1009 23:12:13.552716 25324 solver.cpp:217] Iteration 100, loss = 0.68043
I1009 23:12:13.552844 25324 solver.cpp:234] Train net output #0: loss1 = 0.68043 (* 1 = 0.68043 loss)
I1009 23:12:13.552857 25324 solver.cpp:511] Iteration 100 (0.0272338/s), lr = 0.001
I1010 00:13:41.020321 25324 solver.cpp:217] Iteration 200, loss = 0.642775
I1010 00:13:41.020479 25324 solver.cpp:234] Train net output #0: loss1 = 0.642775 (* 1 = 0.642775 loss)
I1010 00:13:41.020496 25324 solver.cpp:511] Iteration 200 (0.0271189/s), lr = 0.001
I1010 01:18:33.475559 25324 solver.cpp:217] Iteration 300, loss = 0.623129
I1010 01:18:33.475679 25324 solver.cpp:234] Train net output #0: loss1 = 0.623129 (* 1 = 0.623129 loss)
I1010 01:18:33.475693 25324 solver.cpp:511] Iteration 300 (0.0256907/s), lr = 0.001

@lukeyeager
Copy link
Member

@thatguymike you said this was a tricky multi-GPU race condition? Any progress on that?

Also, I've got two guys complaining about system reboots:

it will restart my workstation
NVIDIA/DIGITS#347 (comment)

the ubuntu system shutted down
NVIDIA/DIGITS#347 (comment)

That thread seems possibly related to this one. Is there any way this could cause a reboot?

@lukeyeager
Copy link
Member

@alfredox10 can you try using v0.14 to see if that solves your issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants