-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training stops at iteration 0 with no error message or probable cause? #41
Comments
Looks like this could be same problem as @antran89 in NVIDIA/DIGITS#347? |
I looked through it but it doesn't seem to be the same. For starters I'm not using DIGITS, this is all running from py scripts, also I don't get the msg about On another note, I noticed that caffe is running when stuck on iteration 0, it's using up 2.4GB of RAM, 12% CPU, and 1-6% GPU. So I'm not sure if it's using the GPU correctly because otherwise it would go a lot faster than this. However I did see in the caffe log that it's using the GPU: |
Ok this time I let the training run for a while on iteration 0 to see if it really had failed or not. It's been running over an hour, and still on iteration 0, but at least it looks like it is running and producing some output. However it now seems more evident that the code for some reason is not running on the GPU. This is the output so far: I1009 20:43:18.229598 25324 net.cpp:274] Network initialization done. |
@thatguymike you said this was a tricky multi-GPU race condition? Any progress on that? Also, I've got two guys complaining about system reboots:
That thread seems possibly related to this one. Is there any way this could cause a reboot? |
@alfredox10 can you try using v0.14 to see if that solves your issue? |
Here is the output of the training command: http://pastebin.com/ATPCBjQd
Here is the net I'm using to train it: http://pastebin.com/H1gLW8Lv
I recently upgraded from an older version of caffe, and had to change some of the parameter names, here is the previous net: http://pastebin.com/43Utkkpe
You will notice the transform_param had to be removed from the HDF5 data layer because apparently that one doesn't support that in 0.13. Can you guys check to see if you find any issues? Caffe passes all the make runtests fine, and it is performing image detections fine with previously trained caffemodel files. I did notice my video card remains at 0% usage even after the training kicks off. I am not sure if it has to do with the video card not being set, but I know it is being used because the runtests used the GPU, and the make config file is set to use GPU.
Software:
Caffe v0.13
CUDA 7.5
NVIDIA 352 driver
CuDNN v3
The text was updated successfully, but these errors were encountered: