-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to train imagenet with reduced memory and batch size? #430
Comments
@research2010 Did you changed the batch_size for the validation.prototxt? That would also help you reduce the memory usage. Regarding the batch_size=64 for training should be okay, although base_lr is linked to the batch_size, it allows some variability. Originally base_lr = 0.01 with batch_size=128, we have also used with batch_size=256 and still works. In theory when you reduce the batch_size by a factor of X then you should increase the base_lr by a factor of What you should change is the Pay attention to the loss, if it doesn't go below 6.9 (which is basically random guessing) after 10k-20k iterations, then your training is not learning anything. |
@sguada , Thank you very much for your kind comments and suggestions. I use the "git clone https://github.com/BVLC/caffe.git" to checkout the latest version at Recently I have been using the GPU card to run other experiments. So I couldn't give the results in time. I'll give feedback as soon as the experiments on ImageNet data set restart. |
@sguada @kloudkl , thank you very much for replying! I have been running the imagenet example again. And some results are as follows:
but when I use It seems that training with the alexnet works fine. I'm not sure what the problem of training caffenet is.
|
Try setting the bias to 0.1 in all the layers Sergio 2014-07-11 17:21 GMT-07:00 research2010 notifications@github.com:
|
@sguada , OK, thank you! I will try that after the training of the alexnet model is done. |
@sguada , I'm sorry about that I just made a mistake for typing your name and "sergeyk" and I have corrected that. |
@sguada , oh, I just forget that we could resume the training procedure. That's very convenient! |
It looks good to me. Given your reduced batch you will need to train for On Saturday, July 12, 2014, conaniron notifications@github.com wrote:
Sergio |
@sguada , Thanks for your kindly comments. I've been running the training of caffenet for about one week, and the results as follows is smilar to but a little different from that you have presented in #33. As the reduced batch, it indeed needs more iterations as you said. And in this time of training, I just set the max_iter to 900000 for 90 epochs. It indeed needs more parameter adjustments, "To train these models is more of an art than a science" as indicated by Matthew Zeiler in http://www.wired.com/2014/07/clarifai/. Thank you very much for sharing your valuable experience and results of parameter adjustment. |
Finally, the training has similar behavior with that in #33, and the testing accuracy is ~56%, ~1% lower than that in #33 and ~3.9% lower than that in Alex's paper in 2012. The configuration is: |
Good to hear you got it working with the proper tuning! |
@shelhamer , thanks for your comments! |
It takes about |
…that image width was scaled twice disappeared 2. Reduced batch size for pairs network and changed parameters in solver (took from imagenet and scaled acc to info in BVLC/caffe#430) Minimization dont converge.. but at least running now
@research2010 Hello,I see the accuracy result you plot has "second increase phase" in iter 200000. |
@research2010 |
Sorry to chime in so late on a closed issue -- but I'm trying to understand the same thing that WoooHaa commented about. What is the cause of the "bottlenecks" and how are these overcome? It seems dangerously easy to wait so long and think that training has converged to an optimal value, when it hasn't yet. |
thats the "step" a change in the learning rate. So when there is a failure it changes the weights with a stronger effect. When u would start with that higher learning rate from the beginning, your program would start to bounce and would never get better so you have to start with a lower learning rate and increase it when your system reaches saturation. In the plots you can see that he set his step value to 200 000 because you see these changes at 200 000, 400 000 and 600 000. |
Thank you for the response! Just to clarify... I usually start with a higher learning rate and decrease it over time. But what you say is actually increase the learning later on during training? |
#430 (comment) |
Ah gotcha, it all makes sense now. Thank you! |
Hi, thank you very much for this valuable library!
The hardware and software environments are as follows:
When with the default train configuration file for imagenet data set, the train_net.bin will error with "out of memory". So I change the batch_size into 64 (128 also not valid). Then it works!
The following is the output of train_net.bin:
And the results are as follows after 2000 iterations:
It seems the testing scores are not changed. As indicated in #218, @sguada said that the batch_size and the learning rate are linked. I have set the batch_size is 64, maybe the learning rate should also be modified. Could anyone give any advice on this subject, please?
The text was updated successfully, but these errors were encountered: