-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training imagenet: loss does not decrease #401
Comments
Similar discussions were conducted in #59. |
Thanks for the pointer! The last comment by huangjunshi suggest that his problem has not been resolved yet, but he found a setup that worked for him. I wonder why training works well on boost-eigen branch, but gets stuck on dev branch. Possible reason mentioned are
The training gets stuck right at the beginning, maybe it is enough to just get passed this initial threshold? Since dropout also slows down training, maybe it would help to wait for the first 10k iterations before enabling dropout? Is this possible with Caffe? |
That driver is old! Upgrade the driver and everything should work. We have trained ImageNet without trouble with the dev branch, so that is Le vendredi 9 mai 2014, to3i
Evan Shelhamer |
Concerning my current setup I used random shuffling for generation of level-db files (parameter set to 1). Moreover, caffe-dev passed runtest without error several times (401 tests passed..), and mnist demo also works. Additionally I checked what would happen if I resumed training on imagenet with a network "pre-trained" for the first 10k iterations (train 10k on boost-eigen branch, convert binary-proto, and continue training on dev branch). I stopped training at 80k iterations while the accuracy had risen from 27->41. With regard to the tests I performed with my current setup caffe-dev is not completely broken. Only the very early stage of training imagenet seems to be affected. I will look into upgrading the display drivers and report the outcome. Do you have any intuition why the new display driver changes behavior of caffe for training imagenet during the first couple thousand iterations? |
Problems in the initial iterations are likely due to problems in the weights initializations. @jeffdonahue in #297 and myself in #335 discovered some problems with random number generators in caffe related to boost. @jeffdonahue fixed them #336 for |
I have run into the same problem. Changing the bias initialization from 1.0 to some smaller value, say 0.7 or 0.5 in the bias_filler of the convolutional layers 2, 4, and 5 has consistently solved it. Using bias equal to 1.0 seems to be too big to get training started when the weights are initialized with std = 0.01. |
@to3i @gpapan ImageNet training works in the latest master and dev without altering the model prototxt when run on a GTX 780 with CUDA 6.
|
Nice, I think there was some problem with the random weights initialization. Sergio 2014-05-22 11:23 GMT-07:00 Evan Shelhamer notifications@github.com:
|
@to3i did you resolve this? was the issue the old driver? |
Sorry for not getting back to you earlier. I am currently running experiments with different libraries, and since there is not the immediate need to update, I have postponed that so far. I have tried the workaround by @gpapan changing the bias initialization of the conv layers 2,4,5 to 0,7 or 0,5. Without success yet. But maybe I was a bit to impatient and I should have give training a couple thousand more iterations. |
So I got my setup to work. So I thought I had shuffled the input data, but I hadn't. The problem was that I created the leveldb file a while back and just assumed this without actually checking. |
@shelhamer Hi Evan, I'm Cuong working for J. Weinman at Grinnell.
Is the loss value calculated using only data from each batch or using the whole training data? |
Hey Cuong, The training iteration loss is over the minibatches, not the whole training The loss over the whole validation set is computed once in a If your loss is oscillating too wildly or diverging, try smaller weight Good luck! Le mercredi 4 juin 2014, nguyentu1602 notifications@github.com a écrit :
|
Thanks @nguyentu1602 for clarifying; that was my guess from reading the logs, but reading the code from solver.cpp made me think it was the loss for the whole data. My sense is that to get a good sense of the correct learning values (learning rate, decay, momentum) for a particular problem, one needs to see the total train loss (even if it's just the stochastic sum over the batch losses). @shelhamer , is there a straightforward way to report the loss for the whole training set, perhaps in analogue to SGDSolver::Test()? The hack I could think of is making the validation set to be the train set to use the built-in code, but I'm wondering if one can have cake and eat it, too. |
I reproduced the problem with training imagenet on a different system with Ubuntu 14.04, GTX 770, driver version 331.38, Cuda 5.5. Thus I can confirm tdonhams findings that this problem is not primarily related to a different driver version. I was able to solve the training issues reliably by changing the initial bias fillers to the same values (0.1) as provided in the alexnet config.
|
I am afraid I have to correct my earlier comment. After training about 8k iterations suddenly things are falling apart (see output below at iteration 7900). I have not witnessed this before, and it would be great if someone could give me a pointer to why the optimization can break down like this. My only clue is that maybe something is wrong with the training data. But things have been working fine with the older caffe version....
|
I never seem something like this before in any of my trainings. So either Can you give more details about your data? The way you pre-processed and if Sergio 2014-06-20 8:27 GMT-07:00 to3i <notifications@github.com
Sergio |
I use the training dataset from ILSVRC 2012, resize/warp everything down to 256x256, and then run create_imagenet shell script with shuffle enabled. In order to investigate the possibility of a faulty datasets I repeated the training with the same setup and the error showed up after 45k iterations:
I also recreated the leveldb applying create_imagenet.sh (shuffle enabled again) and using the same setup it broke down again after 26k iterations:
@sguada With regard to the three sources of error you mentioned I would now assume there is an issue with the GPU or the random number generators. What exactly do you mean with the GPU starts making to many errors? Is this a driver or an hardware issue? It was mentioned before in this discussion that there have been some changes to the random number generators. I would like to revert these changes on my local copy of caffe-dev and try training again to see if this helps. Do you know which files have to be changed? I am still new to the git repository world, so I wonder if there is a way to search through the changes of the last months to accomplish this? |
@to3i at this point my guess would be that there is something going on with you GPU, it can be the drivers or the card itself. Maybe when it gets hot start behaving erratically. The loss is around 6.9 when the system is doing random guessing, what probably means that all the weights got corrupted or just became zero. You can try to use an older version of Caffe. Look at the releases https://github.com/BVLC/caffe/releases maybe v0.9 argentine. |
@sguada I will check if switching between different nvidia drivers will resolve the problem. Thanks for your help! |
Closing since ImageNet training has been replicated elsewhere with Caffe. If you keep having problems, please follow-up with a comment in this thread. |
I'm running this, the loss stuck at 6.9 for 8k iteration before going down. But now it keeps going down so that's good. |
I haven't test imagenet by myself yet. I re-implement a network in a paper that the loss does not go down. But it works when I use another deep learning framework (so my understanding of the paper and dataset preparation is correct). If I can get correct answer with mnist dataset. Could I say my cuda and caffe version is no problem? |
Hi @to3i |
Hi all , |
Hi, first of all thank you for sharing caffe, it really looks great! I run into an issue training the imagenet model. The training error is not decreasing even after 40k iterations. I am using the imagenet configuration provided in the examples folder. The model is trained with caffe from the dev branch (since master has not been merged with boost-eigen yet).
The train loss starts out around 7.1, then decreases close to 6.9 in the first 150 iterations, and then remains above that value for 40k iterations and likely beyond. I run the same configuration on the boost-eigen branch and had not trouble with the training. The error started decreasing after about 1k iterations and everything seems to be working fine.
Am I missing something? Can you reproduce this behavior? I am running caffe on Ubuntu 12.04 with Cuda 5.5 (GTX 770).
The text was updated successfully, but these errors were encountered: