Training stops at iteration 0 with no error message or probable cause? #41

alfredox10 · 2015-10-10T01:21:34Z

Here is the output of the training command: http://pastebin.com/ATPCBjQd

Here is the net I'm using to train it: http://pastebin.com/H1gLW8Lv
I recently upgraded from an older version of caffe, and had to change some of the parameter names, here is the previous net: http://pastebin.com/43Utkkpe

You will notice the transform_param had to be removed from the HDF5 data layer because apparently that one doesn't support that in 0.13. Can you guys check to see if you find any issues? Caffe passes all the make runtests fine, and it is performing image detections fine with previously trained caffemodel files. I did notice my video card remains at 0% usage even after the training kicks off. I am not sure if it has to do with the video card not being set, but I know it is being used because the runtests used the GPU, and the make config file is set to use GPU.

Software:
Caffe v0.13
CUDA 7.5
NVIDIA 352 driver
CuDNN v3

lukeyeager · 2015-10-10T01:37:48Z

Looks like this could be same problem as @antran89 in NVIDIA/DIGITS#347?

alfredox10 · 2015-10-10T01:47:25Z

I looked through it but it doesn't seem to be the same. For starters I'm not using DIGITS, this is all running from py scripts, also I don't get the msg about
I1009 10:16:44.213986 14899 blocking_queue.cpp:50] Waiting for data

On another note, I noticed that caffe is running when stuck on iteration 0, it's using up 2.4GB of RAM, 12% CPU, and 1-6% GPU. So I'm not sure if it's using the GPU correctly because otherwise it would go a lot faster than this. However I did see in the caffe log that it's using the GPU:
solver_mode: GPU

alfredox10 · 2015-10-10T03:24:43Z

Ok this time I let the training run for a while on iteration 0 to see if it really had failed or not. It's been running over an hour, and still on iteration 0, but at least it looks like it is running and producing some output. However it now seems more evident that the code for some reason is not running on the GPU. This is the output so far:

I1009 20:43:18.229598 25324 net.cpp:274] Network initialization done.
I1009 20:43:18.229603 25324 net.cpp:275] Memory required for data: 211200264
I1009 20:43:18.229746 25324 solver.cpp:45] Solver scaffolding done.
I1009 20:43:18.229771 25324 caffe.cpp:179] Starting Optimization
I1009 20:43:18.229779 25324 solver.cpp:269] Solving LogisticRegressionNet
I1009 20:43:18.229782 25324 solver.cpp:270] Learning Rate Policy: step
I1009 20:43:18.229792 25324 solver.cpp:314] Iteration 0, Testing net (#0)
I1009 22:10:26.562768 25324 solver.cpp:363] Test net output #0: accuracy1 = 0.496854
I1009 22:10:26.562873 25324 solver.cpp:363] Test net output #1: loss1 = 0.692814 (* 1 = 0.692814 loss)
I1009 22:11:01.647912 25324 solver.cpp:217] Iteration 0, loss = 0.692216
I1009 22:11:01.648032 25324 solver.cpp:234] Train net output #0: loss1 = 0.692216 (* 1 = 0.692216 loss)
I1009 22:11:01.648051 25324 solver.cpp:511] Iteration 0 (0/s), lr = 0.001
I1009 23:12:13.552716 25324 solver.cpp:217] Iteration 100, loss = 0.68043
I1009 23:12:13.552844 25324 solver.cpp:234] Train net output #0: loss1 = 0.68043 (* 1 = 0.68043 loss)
I1009 23:12:13.552857 25324 solver.cpp:511] Iteration 100 (0.0272338/s), lr = 0.001
I1010 00:13:41.020321 25324 solver.cpp:217] Iteration 200, loss = 0.642775
I1010 00:13:41.020479 25324 solver.cpp:234] Train net output #0: loss1 = 0.642775 (* 1 = 0.642775 loss)
I1010 00:13:41.020496 25324 solver.cpp:511] Iteration 200 (0.0271189/s), lr = 0.001
I1010 01:18:33.475559 25324 solver.cpp:217] Iteration 300, loss = 0.623129
I1010 01:18:33.475679 25324 solver.cpp:234] Train net output #0: loss1 = 0.623129 (* 1 = 0.623129 loss)
I1010 01:18:33.475693 25324 solver.cpp:511] Iteration 300 (0.0256907/s), lr = 0.001

lukeyeager · 2015-12-18T17:25:26Z

@thatguymike you said this was a tricky multi-GPU race condition? Any progress on that?

Also, I've got two guys complaining about system reboots:

it will restart my workstation
NVIDIA/DIGITS#347 (comment)

the ubuntu system shutted down
NVIDIA/DIGITS#347 (comment)

That thread seems possibly related to this one. Is there any way this could cause a reboot?

lukeyeager · 2016-02-13T00:22:45Z

@alfredox10 can you try using v0.14 to see if that solves your issue?

lukeyeager added the bug label Oct 10, 2015

antran89 mentioned this issue Oct 12, 2015

Error when fine-tuning Caffe model using DIGITS NVIDIA/DIGITS#347

Closed

lukeyeager closed this as completed Aug 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stops at iteration 0 with no error message or probable cause? #41

Training stops at iteration 0 with no error message or probable cause? #41

alfredox10 commented Oct 10, 2015

lukeyeager commented Oct 10, 2015

alfredox10 commented Oct 10, 2015

alfredox10 commented Oct 10, 2015

lukeyeager commented Dec 18, 2015

lukeyeager commented Feb 13, 2016

Training stops at iteration 0 with no error message or probable cause? #41

Training stops at iteration 0 with no error message or probable cause? #41

Comments

alfredox10 commented Oct 10, 2015

lukeyeager commented Oct 10, 2015

alfredox10 commented Oct 10, 2015

alfredox10 commented Oct 10, 2015

lukeyeager commented Dec 18, 2015

lukeyeager commented Feb 13, 2016