-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why training loss doesn't decrease at the very beginning? #2051
Comments
This is my train_val.prototxt layers { layers { layers { layers { layers { |
And this is my solver.prototxt |
Let it run 10-20k iterations and see. |
The first few iterations are seeing different batches. It is very possible that the learning on batch1 won't impact the loss on batch 2 that soon. One suggestion is to wait for longer and plot the training loss curve to verify the effectiveness of your network. |
Hi @ZHUANGBOHAN, nice to find you here. I am Hongyang Li.
|
where are my comments??? |
@ghost I also have the same problem,could you tell me how to solve it?thank you! |
Seems VGG-16 need larger batch size |
@mrgloom Is this required? My EC2 GPU instance can't seem to get higher than a batch size of 16? |
Not sure, but it helped me, also seems learning_rate (and maybe other parameters) is dependent on the batch size.
If you have out of memory problems you can use batch accumulation it can be set via |
@mrgloom This was my latest VGG-16 training attempt: The training Log: https://gist.github.com/ProGamerGov/21f7bf21105bbea0010bab69a2761386 Category 1 has 3177 training images and 353 validation images. My "create_imagenet.sh" and my "make_imagenet_mean.sh" scripts can be found here: https://gist.github.com/ProGamerGov/16eddea12aee2b49e1fce9eb506a9648 It seems as though my loss values go down a bit before increasing by a significant amount (going from 4-8, to 30-56), and then they decrease again. Would the batch size be a reason for my lack of success with training? Do you have any other suggestions about how I can be more successful in training? I have used these settings while randomly tweaking the learning rate for months, without much success. |
I tried several different network structures, and also tried different learning rate, bias initialization as well weight decay, but all these trials did not work at all! I inspected my network and feel sure it's correct!
I0306 19:06:09.863409 38597 solver.cpp:420] Iteration 3840, lr = 0.01
I0306 19:07:02.010265 38597 solver.cpp:196] Iteration 3860, loss = 8.88737
I0306 19:07:02.010437 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:07:02.010562 38597 solver.cpp:211] Train net output #2: loss = 8.88737 (* 1 = 8.88737 loss)
I0306 19:07:02.010738 38597 solver.cpp:420] Iteration 3860, lr = 0.01
I0306 19:08:09.907526 38597 solver.cpp:196] Iteration 3880, loss = 8.85471
I0306 19:08:09.907752 38597 solver.cpp:211] Train net output #0: accuracy = 0.0078125
I0306 19:08:09.907881 38597 solver.cpp:211] Train net output #2: loss = 8.85471 (* 1 = 8.85471 loss)
I0306 19:08:09.908064 38597 solver.cpp:420] Iteration 3880, lr = 0.01
I0306 19:09:30.316922 38597 solver.cpp:196] Iteration 3900, loss = 8.92846
I0306 19:09:30.317157 38597 solver.cpp:211] Train net output #0: accuracy = 0.0078125
I0306 19:09:30.317291 38597 solver.cpp:211] Train net output #2: loss = 8.92846 (* 1 = 8.92846 loss)
I0306 19:09:30.317464 38597 solver.cpp:420] Iteration 3900, lr = 0.01
I0306 19:10:32.443650 38597 solver.cpp:196] Iteration 3920, loss = 8.88563
I0306 19:10:32.443862 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:10:32.443994 38597 solver.cpp:211] Train net output #2: loss = 8.88563 (* 1 = 8.88563 loss)
I0306 19:10:32.444160 38597 solver.cpp:420] Iteration 3920, lr = 0.01
I0306 19:11:34.688952 38597 solver.cpp:196] Iteration 3940, loss = 8.87257
I0306 19:11:34.689177 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:11:34.689304 38597 solver.cpp:211] Train net output #2: loss = 8.87257 (* 1 = 8.87257 loss)
I0306 19:11:34.689477 38597 solver.cpp:420] Iteration 3940, lr = 0.01
I0306 19:12:34.818542 38597 solver.cpp:196] Iteration 3960, loss = 8.88778
I0306 19:12:34.818768 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:12:34.818907 38597 solver.cpp:211] Train net output #2: loss = 8.88778 (* 1 = 8.88778 loss)
I0306 19:12:34.819080 38597 solver.cpp:420] Iteration 3960, lr = 0.01
I0306 19:13:33.571383 38597 solver.cpp:196] Iteration 3980, loss = 8.87781
I0306 19:13:35.596309 38597 solver.cpp:211] Train net output #0: accuracy = 0.0078125
I0306 19:13:35.596472 38597 solver.cpp:211] Train net output #2: loss = 8.87781 (* 1 = 8.87781 loss)
I0306 19:13:35.597108 38597 solver.cpp:420] Iteration 3980, lr = 0.01
I0306 19:14:30.326297 38597 solver.cpp:196] Iteration 4000, loss = 8.83755
I0306 19:14:30.326486 38597 solver.cpp:211] Train net output #0: accuracy = 0
I0306 19:14:30.326617 38597 solver.cpp:211] Train net output #2: loss = 8.83755 (* 1 = 8.83755 loss)
The text was updated successfully, but these errors were encountered: