Learning stops early with reduced batch size #1557

guzh870423 · 2014-12-11T02:17:50Z

Hello,

I am using Tesla c2050 which is of compute capability 2.0. It reports an error if I train imagenet with default setting batch_size = 256, like #629

So I reduced batch_size to 64, and correspondingly changed base_lr from 0.01 to 0.01414, stepsize from 100000 to 400000, max_iter from 450000 to 1800000. I also changed bias from 1 to 0.1 for some layers in file models/bvlc_reference_caffenet/train_val.prototxt, as suggested #430, otherwise it does not learn anything.

But my result is not as good as #430. The loss is not decreasing but oscillating after 20,000 iterations. I tried alexnet model with the same parameter change except base_lr = 0.02. The result was similar, if not worse.

Any idea what may cause this? Thanks.

danielorf · 2014-12-20T23:47:24Z

This page may be of help to you: #430
Also, how did you gather training-loss/iteration data shown in your plot?

guzh870423 · 2014-12-21T01:52:02Z

@danielorf
Thank you for answering my post. Actually #430 is what I am questioning of. My result is different from that.
Now I am using another GPU k20. I think it is doing fine now.

Honestly I had the same question. I was wondering how they did those plots.
For me, I just wrote a script to extract iteration vs loss from output file. I suspect they have a build-in tool to do the similar thing. Maybe you can find the answer and tell me.

sguada · 2014-12-21T01:58:36Z

The tool is in tools/extra/parse_log.sh

On Saturday, December 20, 2014, Zhenghao Gu notifications@github.com
wrote:

@danielorf https://github.com/danielorf
Thank you for answering my post. Actually #430
#430 is what I am questioning of.
My result is different from that.
Now I am using another GPU k20. I think it is doing fine now.

Honestly I had the same question. I was wondering how they did those plots.
For me, I just wrote a script to extract iteration vs loss from output
file. I suspect they have a build-in tool to do the similar thing. Maybe
you can find the answer and tell me.

—
Reply to this email directly or view it on GitHub
#1557 (comment).

Sergio

danielorf · 2014-12-21T04:48:08Z

Sorry, I clearly didn't read your question properly, my mistake.

danielorf · 2014-12-23T01:48:44Z

Whre might one find this "caffe.log" file?

Edit: Found it - It's located in /tmp/ with the name "caffe.[pc name].[username].log.INFO.date

shelhamer closed this as completed Jan 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning stops early with reduced batch size #1557

Learning stops early with reduced batch size #1557

guzh870423 commented Dec 11, 2014

danielorf commented Dec 20, 2014

guzh870423 commented Dec 21, 2014

sguada commented Dec 21, 2014

danielorf commented Dec 21, 2014

danielorf commented Dec 23, 2014

Learning stops early with reduced batch size #1557

Learning stops early with reduced batch size #1557

Comments

guzh870423 commented Dec 11, 2014

danielorf commented Dec 20, 2014

guzh870423 commented Dec 21, 2014

sguada commented Dec 21, 2014

danielorf commented Dec 21, 2014

danielorf commented Dec 23, 2014