Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training imagenet: loss does not decrease #401

Closed
to3i opened this issue May 8, 2014 · 26 comments
Closed

Training imagenet: loss does not decrease #401

to3i opened this issue May 8, 2014 · 26 comments

Comments

@to3i
Copy link
Contributor

to3i commented May 8, 2014

Hi, first of all thank you for sharing caffe, it really looks great! I run into an issue training the imagenet model. The training error is not decreasing even after 40k iterations. I am using the imagenet configuration provided in the examples folder. The model is trained with caffe from the dev branch (since master has not been merged with boost-eigen yet).

The train loss starts out around 7.1, then decreases close to 6.9 in the first 150 iterations, and then remains above that value for 40k iterations and likely beyond. I run the same configuration on the boost-eigen branch and had not trouble with the training. The error started decreasing after about 1k iterations and everything seems to be working fine.

Am I missing something? Can you reproduce this behavior? I am running caffe on Ubuntu 12.04 with Cuda 5.5 (GTX 770).

@niuzhiheng
Copy link
Contributor

Similar discussions were conducted in #59.

@to3i
Copy link
Contributor Author

to3i commented May 9, 2014

Thanks for the pointer! The last comment by huangjunshi suggest that his problem has not been resolved yet, but he found a setup that worked for him. I wonder why training works well on boost-eigen branch, but gets stuck on dev branch. Possible reason mentioned are

  1. random initialization (any modifications of random number generation from boost-eigen branch to dev branch?!)
  2. nvidia drivers ( I am still using NVIDIA-Linux-x86_64-319.82.run)
  3. convert_imageset.cpp (I used boost-eigen converter to generator imagenet leveldb)

The training gets stuck right at the beginning, maybe it is enough to just get passed this initial threshold? Since dropout also slows down training, maybe it would help to wait for the first 10k iterations before enabling dropout? Is this possible with Caffe?

@shelhamer
Copy link
Member

NVIDIA-Linux-x86_64-319.82.run

That driver is old! Upgrade the driver and everything should work.

We have trained ImageNet without trouble with the dev branch, so that is
not the problem. convert_imageset isn't different between the two either.
Try shuffling the data when you create the leveldb (look at the
convert_imageset args) if you didn't too.

Le vendredi 9 mai 2014, to3i
<notifications@github.comjavascript:_e(%7B%7D,'cvml','notifications@github.com');>
a écrit :

Thanks for the pointer! The last comment by huangjunshi suggest that his
problem has not been resolved yet, but he found a setup that worked for
him. I wonder why training works well on boost-eigen branch, but gets stuck
on dev branch. Possible reason mentioned are

  1. random initialization (any modifications of random number generation
    from boost-eigen branch to dev branch?!)
  2. nvidia drivers ( I am still using NVIDIA-Linux-x86_64-319.82.run)
  3. convert_imageset.cpp (I used boost-eigen converter to generator
    imagenet leveldb)

The training gets stuck right at the beginning, maybe it is enough to just
get passed this initial threshold? Since dropout also slows down training,
maybe it would help to wait for the first 10k iterations before enabling
dropout? Is this possible with Caffe?


Reply to this email directly or view it on GitHubhttps://github.com//issues/401#issuecomment-42645223
.

Evan Shelhamer

@to3i
Copy link
Contributor Author

to3i commented May 12, 2014

Concerning my current setup I used random shuffling for generation of level-db files (parameter set to 1). Moreover, caffe-dev passed runtest without error several times (401 tests passed..), and mnist demo also works. Additionally I checked what would happen if I resumed training on imagenet with a network "pre-trained" for the first 10k iterations (train 10k on boost-eigen branch, convert binary-proto, and continue training on dev branch). I stopped training at 80k iterations while the accuracy had risen from 27->41.

With regard to the tests I performed with my current setup caffe-dev is not completely broken. Only the very early stage of training imagenet seems to be affected.

I will look into upgrading the display drivers and report the outcome. Do you have any intuition why the new display driver changes behavior of caffe for training imagenet during the first couple thousand iterations?

@sguada
Copy link
Contributor

sguada commented May 12, 2014

Problems in the initial iterations are likely due to problems in the weights initializations. @jeffdonahue in #297 and myself in #335 discovered some problems with random number generators in caffe related to boost. @jeffdonahue fixed them #336 for dev, so that could be a possible explanation.

@gpapan
Copy link

gpapan commented May 13, 2014

I have run into the same problem. Changing the bias initialization from 1.0 to some smaller value, say 0.7 or 0.5 in the bias_filler of the convolutional layers 2, 4, and 5 has consistently solved it. Using bias equal to 1.0 seems to be too big to get training started when the weights are initialized with std = 0.01.

@shelhamer
Copy link
Member

@to3i @gpapan ImageNet training works in the latest master and dev without altering the model prototxt when run on a GTX 780 with CUDA 6.

I0522 01:19:07.953080 31791 solver.cpp:112] Iteration 20, loss = 7.18449
I0522 01:19:41.311635 31791 solver.cpp:112] Iteration 40, loss = 6.95364
I0522 01:20:14.668757 31791 solver.cpp:112] Iteration 60, loss = 6.91191
I0522 01:20:48.015966 31791 solver.cpp:112] Iteration 80, loss = 6.91892
I0522 01:21:21.400423 31791 solver.cpp:112] Iteration 100, loss = 6.92729
[...]
I0522 01:46:22.802777 31791 solver.cpp:112] Iteration 1000, loss = 6.91176
I0522 01:46:22.802798 31791 solver.cpp:139] Iteration 1000, Testing net (#0)
I0522 01:48:32.178345 31791 solver.cpp:177] Test score #0: 0.001
I0522 01:48:32.178388 31791 solver.cpp:177] Test score #1: 6.90807
I0522 01:49:05.532601 31791 solver.cpp:112] Iteration 1020, loss = 6.90832
I0522 01:49:38.898416 31791 solver.cpp:112] Iteration 1040, loss = 6.90591
I0522 01:50:12.254842 31791 solver.cpp:112] Iteration 1060, loss = 6.90701
I0522 01:50:45.617554 31791 solver.cpp:112] Iteration 1080, loss = 6.90649
I0522 01:51:18.989614 31791 solver.cpp:112] Iteration 1100, loss = 6.90821
[...]
I0522 02:16:20.569469 31791 solver.cpp:112] Iteration 2000, loss = 6.84969
I0522 02:16:20.569491 31791 solver.cpp:139] Iteration 2000, Testing net (#0)
I0522 02:18:29.979239 31791 solver.cpp:177] Test score #0: 0.00192
I0522 02:18:29.979284 31791 solver.cpp:177] Test score #1: 6.83365
I0522 02:19:03.333910 31791 solver.cpp:112] Iteration 2020, loss = 6.84268
I0522 02:19:36.720262 31791 solver.cpp:112] Iteration 2040, loss = 6.84313
I0522 02:20:10.097378 31791 solver.cpp:112] Iteration 2060, loss = 6.81291
I0522 02:20:43.488298 31791 solver.cpp:112] Iteration 2080, loss = 6.84587
[...]
I0522 06:15:57.166162 31791 solver.cpp:139] Iteration 10000, Testing net (#0)
I0522 06:18:06.673429 31791 solver.cpp:177] Test score #0: 0.2147
I0522 06:18:06.673459 31791 solver.cpp:177] Test score #1: 3.89709
I0522 06:18:07.017776 31791 solver.cpp:194] Snapshotting to caffe_imagenet_train_iter_10000
I0522 06:18:07.853206 31791 solver.cpp:201] Snapshotting solver state to caffe_imagenet_train_iter_10000.solverstate
I0522 06:18:41.734508 31791 solver.cpp:112] Iteration 10020, loss = 3.6477
I0522 06:19:15.055836 31791 solver.cpp:112] Iteration 10040, loss = 4.27069
I0522 06:19:48.405401 31791 solver.cpp:112] Iteration 10060, loss = 4.09493
I0522 06:20:21.775976 31791 solver.cpp:112] Iteration 10080, loss = 3.96134
I0522 06:20:55.130786 31791 solver.cpp:112] Iteration 10100, loss = 4.02522

@sguada
Copy link
Contributor

sguada commented May 22, 2014

Nice, I think there was some problem with the random weights initialization.

Sergio

2014-05-22 11:23 GMT-07:00 Evan Shelhamer notifications@github.com:

@to3i https://github.com/to3i @gpapan https://github.com/gpapanImageNet training works in the latest master and dev without altering the
model prototxt when run on a GTX 780 with CUDA 6.

I0522 01:19:07.953080 31791 solver.cpp:112] Iteration 20, loss = 7.18449
I0522 01:19:41.311635 31791 solver.cpp:112] Iteration 40, loss = 6.95364
I0522 01:20:14.668757 31791 solver.cpp:112] Iteration 60, loss = 6.91191
I0522 01:20:48.015966 31791 solver.cpp:112] Iteration 80, loss = 6.91892
I0522 01:21:21.400423 31791 solver.cpp:112] Iteration 100, loss = 6.92729
[...]
I0522 01:46:22.802777 31791 solver.cpp:112] Iteration 1000, loss = 6.91176
I0522 01:46:22.802798 31791 solver.cpp:139] Iteration 1000, Testing net (#0)
I0522 01:48:32.178345 31791 solver.cpp:177] Test score #0: 0.001
I0522 01:48:32.178388 31791 solver.cpp:177] Test score #1: 6.90807
I0522 01:49:05.532601 31791 solver.cpp:112] Iteration 1020, loss = 6.90832
I0522 01:49:38.898416 31791 solver.cpp:112] Iteration 1040, loss = 6.90591
I0522 01:50:12.254842 31791 solver.cpp:112] Iteration 1060, loss = 6.90701
I0522 01:50:45.617554 31791 solver.cpp:112] Iteration 1080, loss = 6.90649
I0522 01:51:18.989614 31791 solver.cpp:112] Iteration 1100, loss = 6.90821
[...]
I0522 02:16:20.569469 31791 solver.cpp:112] Iteration 2000, loss = 6.84969
I0522 02:16:20.569491 31791 solver.cpp:139] Iteration 2000, Testing net (#0)
I0522 02:18:29.979239 31791 solver.cpp:177] Test score #0: 0.00192
I0522 02:18:29.979284 31791 solver.cpp:177] Test score #1: 6.83365
I0522 02:19:03.333910 31791 solver.cpp:112] Iteration 2020, loss = 6.84268
I0522 02:19:36.720262 31791 solver.cpp:112] Iteration 2040, loss = 6.84313
I0522 02:20:10.097378 31791 solver.cpp:112] Iteration 2060, loss = 6.81291
I0522 02:20:43.488298 31791 solver.cpp:112] Iteration 2080, loss = 6.84587
[...]
I0522 06:15:57.166162 31791 solver.cpp:139] Iteration 10000, Testing net (#0)
I0522 06:18:06.673429 31791 solver.cpp:177] Test score #0: 0.2147
I0522 06:18:06.673459 31791 solver.cpp:177] Test score #1: 3.89709
I0522 06:18:07.017776 31791 solver.cpp:194] Snapshotting to caffe_imagenet_train_iter_10000
I0522 06:18:07.853206 31791 solver.cpp:201] Snapshotting solver state to caffe_imagenet_train_iter_10000.solverstate
I0522 06:18:41.734508 31791 solver.cpp:112] Iteration 10020, loss = 3.6477
I0522 06:19:15.055836 31791 solver.cpp:112] Iteration 10040, loss = 4.27069
I0522 06:19:48.405401 31791 solver.cpp:112] Iteration 10060, loss = 4.09493
I0522 06:20:21.775976 31791 solver.cpp:112] Iteration 10080, loss = 3.96134
I0522 06:20:55.130786 31791 solver.cpp:112] Iteration 10100, loss = 4.02522


Reply to this email directly or view it on GitHubhttps://github.com//issues/401#issuecomment-43925307
.

@tdomhan
Copy link
Contributor

tdomhan commented May 26, 2014

@to3i did you resolve this? was the issue the old driver?

@to3i
Copy link
Contributor Author

to3i commented May 28, 2014

Sorry for not getting back to you earlier. I am currently running experiments with different libraries, and since there is not the immediate need to update, I have postponed that so far.

I have tried the workaround by @gpapan changing the bias initialization of the conv layers 2,4,5 to 0,7 or 0,5. Without success yet. But maybe I was a bit to impatient and I should have give training a couple thousand more iterations.

@tdomhan
Copy link
Contributor

tdomhan commented May 28, 2014

So I got my setup to work. So I thought I had shuffled the input data, but I hadn't. The problem was that I created the leveldb file a while back and just assumed this without actually checking.
@to3i as you mentioned you did that, so the problem probably lies somewhere else. I also wanted to note that I'm using driver version 319, so don't think that's the problem either.

@nguyentu1602
Copy link

@shelhamer Hi Evan, I'm Cuong working for J. Weinman at Grinnell.
I'm trying to understand why our training loss value keeps oscillating and how the training loss was calculated. In your output above:

I0522 01:19:07.953080 31791 solver.cpp:112] Iteration 20, loss = 7.18449

Is the loss value calculated using only data from each batch or using the whole training data?
Thank you!

@shelhamer
Copy link
Member

Hey Cuong,

The training iteration loss is over the minibatches, not the whole training
set. Oscillation is expected, not only because the batches differ but
because the optimization is stochastic.

The loss over the whole validation set is computed once in a
while according to the test interval in the solver settings. The validation
loss is the actual measure to track in your training.

If your loss is oscillating too wildly or diverging, try smaller weight
initializations or learning rates.

Good luck!

Le mercredi 4 juin 2014, nguyentu1602 notifications@github.com a écrit :

@shelhamer https://github.com/shelhamer Hi Evan, I'm Cuong working for
J. Weinman at Grinnell.
I'm trying to understand why our training loss value keeps oscillating and
how the training loss was calculated. In your output above:

I0522 01:19:07.953080 31791 solver.cpp:112] Iteration 20, loss = 7.18449
Is the loss value calculated using only data from each batch or using the
whole training data?

Thank you!


Reply to this email directly or view it on GitHub
#401 (comment).

@weinman
Copy link
Contributor

weinman commented Jun 4, 2014

Thanks @nguyentu1602 for clarifying; that was my guess from reading the logs, but reading the code from solver.cpp made me think it was the loss for the whole data.

My sense is that to get a good sense of the correct learning values (learning rate, decay, momentum) for a particular problem, one needs to see the total train loss (even if it's just the stochastic sum over the batch losses).

@shelhamer , is there a straightforward way to report the loss for the whole training set, perhaps in analogue to SGDSolver::Test()? The hack I could think of is making the validation set to be the train set to use the built-in code, but I'm wondering if one can have cake and eat it, too.

@to3i
Copy link
Contributor Author

to3i commented Jun 20, 2014

I reproduced the problem with training imagenet on a different system with Ubuntu 14.04, GTX 770, driver version 331.38, Cuda 5.5. Thus I can confirm tdonhams findings that this problem is not primarily related to a different driver version.

I was able to solve the training issues reliably by changing the initial bias fillers to the same values (0.1) as provided in the alexnet config.

I0620 11:35:25.746945 26838 solver.cpp:86] Solving CaffeNet
I0620 11:35:25.746959 26838 solver.cpp:139] Iteration 0, Testing net (#0)
I0620 11:37:05.869024 26838 solver.cpp:177] Test score #0: 0.00134
I0620 11:37:05.869077 26838 solver.cpp:177] Test score #1: 6.91029
I0620 11:37:34.930341 26838 solver.cpp:272] Iteration 20, lr = 0.01
I0620 11:37:34.936677 26838 solver.cpp:112] Iteration 20, loss = 6.91209
I0620 11:38:04.014205 26838 solver.cpp:272] Iteration 40, lr = 0.01
I0620 11:38:04.020514 26838 solver.cpp:112] Iteration 40, loss = 6.92166
I0620 11:38:33.146855 26838 solver.cpp:272] Iteration 60, lr = 0.01
I0620 11:38:33.153205 26838 solver.cpp:112] Iteration 60, loss = 6.90463
I0620 11:39:02.294443 26838 solver.cpp:272] Iteration 80, lr = 0.01
I0620 11:39:02.300791 26838 solver.cpp:112] Iteration 80, loss = 6.91619
I0620 11:39:31.436177 26838 solver.cpp:272] Iteration 100, lr = 0.01
I0620 11:39:31.442518 26838 solver.cpp:112] Iteration 100, loss = 6.9243
I0620 11:40:00.782153 26838 solver.cpp:272] Iteration 120, lr = 0.01
I0620 11:40:00.788450 26838 solver.cpp:112] Iteration 120, loss = 6.92774
I0620 11:40:29.929620 26838 solver.cpp:272] Iteration 140, lr = 0.01
I0620 11:40:29.935956 26838 solver.cpp:112] Iteration 140, loss = 6.90268
I0620 11:40:59.050607 26838 solver.cpp:272] Iteration 160, lr = 0.01
I0620 11:40:59.056922 26838 solver.cpp:112] Iteration 160, loss = 6.9208
I0620 11:41:28.196375 26838 solver.cpp:272] Iteration 180, lr = 0.01
I0620 11:41:28.202695 26838 solver.cpp:112] Iteration 180, loss = 6.91881
I0620 11:41:57.302649 26838 solver.cpp:272] Iteration 200, lr = 0.01
I0620 11:41:57.308981 26838 solver.cpp:112] Iteration 200, loss = 6.8856
I0620 11:42:26.424986 26838 solver.cpp:272] Iteration 220, lr = 0.01
I0620 11:42:26.431308 26838 solver.cpp:112] Iteration 220, loss = 6.89744
I0620 11:42:55.537070 26838 solver.cpp:272] Iteration 240, lr = 0.01
I0620 11:42:55.543407 26838 solver.cpp:112] Iteration 240, loss = 6.82595
I0620 11:43:24.652662 26838 solver.cpp:272] Iteration 260, lr = 0.01
I0620 11:43:24.658993 26838 solver.cpp:112] Iteration 260, loss = 6.86962
I0620 11:43:53.768822 26838 solver.cpp:272] Iteration 280, lr = 0.01
I0620 11:43:53.775164 26838 solver.cpp:112] Iteration 280, loss = 6.8444
I0620 11:44:23.205131 26838 solver.cpp:272] Iteration 300, lr = 0.01
I0620 11:44:23.211465 26838 solver.cpp:112] Iteration 300, loss = 6.79811
I0620 11:44:52.397264 26838 solver.cpp:272] Iteration 320, lr = 0.01
I0620 11:44:52.403578 26838 solver.cpp:112] Iteration 320, loss = 6.80282
I0620 11:45:21.647888 26838 solver.cpp:272] Iteration 340, lr = 0.01
I0620 11:45:21.654194 26838 solver.cpp:112] Iteration 340, loss = 6.77742
I0620 11:45:50.868847 26838 solver.cpp:272] Iteration 360, lr = 0.01
I0620 11:45:50.875157 26838 solver.cpp:112] Iteration 360, loss = 6.78845
I0620 11:46:20.113735 26838 solver.cpp:272] Iteration 380, lr = 0.01
I0620 11:46:20.120142 26838 solver.cpp:112] Iteration 380, loss = 6.79837
I0620 11:46:49.407412 26838 solver.cpp:272] Iteration 400, lr = 0.01
I0620 11:46:49.413812 26838 solver.cpp:112] Iteration 400, loss = 6.78369
I0620 11:47:18.692961 26838 solver.cpp:272] Iteration 420, lr = 0.01
I0620 11:47:18.699283 26838 solver.cpp:112] Iteration 420, loss = 6.72305
I0620 11:47:47.971900 26838 solver.cpp:272] Iteration 440, lr = 0.01
I0620 11:47:47.978301 26838 solver.cpp:112] Iteration 440, loss = 6.68158
I0620 11:48:17.336835 26838 solver.cpp:272] Iteration 460, lr = 0.01
I0620 11:48:17.343233 26838 solver.cpp:112] Iteration 460, loss = 6.76245
I0620 11:48:46.673154 26838 solver.cpp:272] Iteration 480, lr = 0.01
I0620 11:48:46.679589 26838 solver.cpp:112] Iteration 480, loss = 6.72955
I0620 11:49:16.035604 26838 solver.cpp:272] Iteration 500, lr = 0.01
I0620 11:49:16.042004 26838 solver.cpp:112] Iteration 500, loss = 6.71371
I0620 11:49:45.359704 26838 solver.cpp:272] Iteration 520, lr = 0.01
I0620 11:49:45.366116 26838 solver.cpp:112] Iteration 520, loss = 6.73382
I0620 11:50:14.717906 26838 solver.cpp:272] Iteration 540, lr = 0.01
I0620 11:50:14.724305 26838 solver.cpp:112] Iteration 540, loss = 6.71847
I0620 11:50:44.084612 26838 solver.cpp:272] Iteration 560, lr = 0.01
I0620 11:50:44.091033 26838 solver.cpp:112] Iteration 560, loss = 6.68393
I0620 11:51:13.508644 26838 solver.cpp:272] Iteration 580, lr = 0.01
I0620 11:51:13.515055 26838 solver.cpp:112] Iteration 580, loss = 6.65607
I0620 11:51:42.890516 26838 solver.cpp:272] Iteration 600, lr = 0.01
I0620 11:51:42.896893 26838 solver.cpp:112] Iteration 600, loss = 6.64652
I0620 11:52:12.316871 26838 solver.cpp:272] Iteration 620, lr = 0.01
I0620 11:52:12.323268 26838 solver.cpp:112] Iteration 620, loss = 6.60725
I0620 11:52:41.705210 26838 solver.cpp:272] Iteration 640, lr = 0.01
I0620 11:52:41.711623 26838 solver.cpp:112] Iteration 640, loss = 6.49469
I0620 11:53:11.126482 26838 solver.cpp:272] Iteration 660, lr = 0.01
I0620 11:53:11.132851 26838 solver.cpp:112] Iteration 660, loss = 6.47581
I0620 11:53:40.556813 26838 solver.cpp:272] Iteration 680, lr = 0.01
I0620 11:53:40.563215 26838 solver.cpp:112] Iteration 680, loss = 6.51747
I0620 11:54:09.961843 26838 solver.cpp:272] Iteration 700, lr = 0.01
I0620 11:54:09.968148 26838 solver.cpp:112] Iteration 700, loss = 6.59504
I0620 11:54:39.407412 26838 solver.cpp:272] Iteration 720, lr = 0.01
I0620 11:54:39.413851 26838 solver.cpp:112] Iteration 720, loss = 6.56846
I0620 11:55:08.796895 26838 solver.cpp:272] Iteration 740, lr = 0.01
I0620 11:55:08.803326 26838 solver.cpp:112] Iteration 740, loss = 6.51677
I0620 11:55:38.196754 26838 solver.cpp:272] Iteration 760, lr = 0.01
I0620 11:55:38.203130 26838 solver.cpp:112] Iteration 760, loss = 6.46721
I0620 11:56:07.612365 26838 solver.cpp:272] Iteration 780, lr = 0.01
I0620 11:56:07.618757 26838 solver.cpp:112] Iteration 780, loss = 6.51402
I0620 11:56:36.993381 26838 solver.cpp:272] Iteration 800, lr = 0.01
I0620 11:56:36.999761 26838 solver.cpp:112] Iteration 800, loss = 6.31996

@to3i
Copy link
Contributor Author

to3i commented Jun 20, 2014

I am afraid I have to correct my earlier comment. After training about 8k iterations suddenly things are falling apart (see output below at iteration 7900). I have not witnessed this before, and it would be great if someone could give me a pointer to why the optimization can break down like this. My only clue is that maybe something is wrong with the training data. But things have been working fine with the older caffe version....

I0620 14:39:44.536012 26838 solver.cpp:177] Test score #0: 0.18864
I0620 14:39:44.542511 26838 solver.cpp:177] Test score #1: 4.13298
I0620 14:40:13.815701 26838 solver.cpp:272] Iteration 7020, lr = 0.01
I0620 14:40:13.822108 26838 solver.cpp:112] Iteration 7020, loss = 4.04092
I0620 14:40:43.172262 26838 solver.cpp:272] Iteration 7040, lr = 0.01
I0620 14:40:43.178668 26838 solver.cpp:112] Iteration 7040, loss = 4.29253
I0620 14:41:12.563097 26838 solver.cpp:272] Iteration 7060, lr = 0.01
I0620 14:41:12.569507 26838 solver.cpp:112] Iteration 7060, loss = 4.41015
I0620 14:41:41.932728 26838 solver.cpp:272] Iteration 7080, lr = 0.01
I0620 14:41:41.939139 26838 solver.cpp:112] Iteration 7080, loss = 4.38662
I0620 14:42:11.350173 26838 solver.cpp:272] Iteration 7100, lr = 0.01
I0620 14:42:11.356637 26838 solver.cpp:112] Iteration 7100, loss = 4.10524
I0620 14:42:40.728377 26838 solver.cpp:272] Iteration 7120, lr = 0.01
I0620 14:42:40.734822 26838 solver.cpp:112] Iteration 7120, loss = 4.06197
I0620 14:43:10.170919 26838 solver.cpp:272] Iteration 7140, lr = 0.01
I0620 14:43:10.177328 26838 solver.cpp:112] Iteration 7140, loss = 4.0084
I0620 14:43:39.577159 26838 solver.cpp:272] Iteration 7160, lr = 0.01
I0620 14:43:39.583540 26838 solver.cpp:112] Iteration 7160, loss = 4.01986
I0620 14:44:08.962162 26838 solver.cpp:272] Iteration 7180, lr = 0.01
I0620 14:44:08.968629 26838 solver.cpp:112] Iteration 7180, loss = 4.16073
I0620 14:44:38.433136 26838 solver.cpp:272] Iteration 7200, lr = 0.01
I0620 14:44:38.439510 26838 solver.cpp:112] Iteration 7200, loss = 4.18245
I0620 14:45:07.810672 26838 solver.cpp:272] Iteration 7220, lr = 0.01
I0620 14:45:07.817034 26838 solver.cpp:112] Iteration 7220, loss = 4.18903
I0620 14:45:37.202044 26838 solver.cpp:272] Iteration 7240, lr = 0.01
I0620 14:45:37.208451 26838 solver.cpp:112] Iteration 7240, loss = 4.02316
I0620 14:46:06.610736 26838 solver.cpp:272] Iteration 7260, lr = 0.01
I0620 14:46:06.617142 26838 solver.cpp:112] Iteration 7260, loss = 4.13984
I0620 14:46:36.020895 26838 solver.cpp:272] Iteration 7280, lr = 0.01
I0620 14:46:36.027274 26838 solver.cpp:112] Iteration 7280, loss = 4.23084
I0620 14:47:05.448349 26838 solver.cpp:272] Iteration 7300, lr = 0.01
I0620 14:47:05.454779 26838 solver.cpp:112] Iteration 7300, loss = 4.11645
I0620 14:47:34.837702 26838 solver.cpp:272] Iteration 7320, lr = 0.01
I0620 14:47:34.844108 26838 solver.cpp:112] Iteration 7320, loss = 4.07949
I0620 14:48:04.255254 26838 solver.cpp:272] Iteration 7340, lr = 0.01
I0620 14:48:04.261632 26838 solver.cpp:112] Iteration 7340, loss = 4.26324
I0620 14:48:33.703403 26838 solver.cpp:272] Iteration 7360, lr = 0.01
I0620 14:48:33.709803 26838 solver.cpp:112] Iteration 7360, loss = 4.23769
I0620 14:49:03.120488 26838 solver.cpp:272] Iteration 7380, lr = 0.01
I0620 14:49:03.126855 26838 solver.cpp:112] Iteration 7380, loss = 3.95633
I0620 14:49:32.534546 26838 solver.cpp:272] Iteration 7400, lr = 0.01
I0620 14:49:32.540951 26838 solver.cpp:112] Iteration 7400, loss = 4.04365
I0620 14:50:01.962718 26838 solver.cpp:272] Iteration 7420, lr = 0.01
I0620 14:50:01.969089 26838 solver.cpp:112] Iteration 7420, loss = 4.11874
I0620 14:50:31.364446 26838 solver.cpp:272] Iteration 7440, lr = 0.01
I0620 14:50:31.370842 26838 solver.cpp:112] Iteration 7440, loss = 4.17533
I0620 14:51:00.805496 26838 solver.cpp:272] Iteration 7460, lr = 0.01
I0620 14:51:00.811980 26838 solver.cpp:112] Iteration 7460, loss = 4.25556
I0620 14:51:30.238541 26838 solver.cpp:272] Iteration 7480, lr = 0.01
I0620 14:51:30.244962 26838 solver.cpp:112] Iteration 7480, loss = 4.37627
I0620 14:51:59.654762 26838 solver.cpp:272] Iteration 7500, lr = 0.01
I0620 14:51:59.661149 26838 solver.cpp:112] Iteration 7500, loss = 4.23419
I0620 14:52:29.117156 26838 solver.cpp:272] Iteration 7520, lr = 0.01
I0620 14:52:29.123595 26838 solver.cpp:112] Iteration 7520, loss = 3.99898
I0620 14:52:58.528599 26838 solver.cpp:272] Iteration 7540, lr = 0.01
I0620 14:52:58.535665 26838 solver.cpp:112] Iteration 7540, loss = 4.28032
I0620 14:53:27.920863 26838 solver.cpp:272] Iteration 7560, lr = 0.01
I0620 14:53:27.927288 26838 solver.cpp:112] Iteration 7560, loss = 3.93051
I0620 14:53:57.349653 26838 solver.cpp:272] Iteration 7580, lr = 0.01
I0620 14:53:57.356073 26838 solver.cpp:112] Iteration 7580, loss = 4.14223
I0620 14:54:26.789634 26838 solver.cpp:272] Iteration 7600, lr = 0.01
I0620 14:54:26.796046 26838 solver.cpp:112] Iteration 7600, loss = 4.29514
I0620 14:54:56.201872 26838 solver.cpp:272] Iteration 7620, lr = 0.01
I0620 14:54:56.208299 26838 solver.cpp:112] Iteration 7620, loss = 3.99356
I0620 14:55:25.640661 26838 solver.cpp:272] Iteration 7640, lr = 0.01
I0620 14:55:25.647030 26838 solver.cpp:112] Iteration 7640, loss = 4.1392
I0620 14:55:55.103888 26838 solver.cpp:272] Iteration 7660, lr = 0.01
I0620 14:55:55.110354 26838 solver.cpp:112] Iteration 7660, loss = 4.09701
I0620 14:56:24.493177 26838 solver.cpp:272] Iteration 7680, lr = 0.01
I0620 14:56:24.499610 26838 solver.cpp:112] Iteration 7680, loss = 4.28187
I0620 14:56:53.926969 26838 solver.cpp:272] Iteration 7700, lr = 0.01
I0620 14:56:53.933403 26838 solver.cpp:112] Iteration 7700, loss = 4.35199
I0620 14:57:23.319170 26838 solver.cpp:272] Iteration 7720, lr = 0.01
I0620 14:57:23.325537 26838 solver.cpp:112] Iteration 7720, loss = 4.08275
I0620 14:57:52.738777 26838 solver.cpp:272] Iteration 7740, lr = 0.01
I0620 14:57:52.745141 26838 solver.cpp:112] Iteration 7740, loss = 4.02336
I0620 14:58:22.212821 26838 solver.cpp:272] Iteration 7760, lr = 0.01
I0620 14:58:22.219264 26838 solver.cpp:112] Iteration 7760, loss = 4.16953
I0620 14:58:51.640851 26838 solver.cpp:272] Iteration 7780, lr = 0.01
I0620 14:58:51.647227 26838 solver.cpp:112] Iteration 7780, loss = 4.15586
I0620 14:59:21.081740 26838 solver.cpp:272] Iteration 7800, lr = 0.01
I0620 14:59:21.088212 26838 solver.cpp:112] Iteration 7800, loss = 4.09304
I0620 14:59:50.520324 26838 solver.cpp:272] Iteration 7820, lr = 0.01
I0620 14:59:50.526731 26838 solver.cpp:112] Iteration 7820, loss = 4.26121
I0620 15:00:20.001034 26838 solver.cpp:272] Iteration 7840, lr = 0.01
I0620 15:00:20.007499 26838 solver.cpp:112] Iteration 7840, loss = 4.27981
I0620 15:00:49.439652 26838 solver.cpp:272] Iteration 7860, lr = 0.01
I0620 15:00:49.446085 26838 solver.cpp:112] Iteration 7860, loss = **4.18486**
I0620 15:01:18.885591 26838 solver.cpp:272] Iteration 7880, lr = 0.01
I0620 15:01:18.892004 26838 solver.cpp:112] Iteration 7880, loss = **4.31062**
I0620 15:01:48.169371 26838 solver.cpp:272] Iteration 7900, lr = 0.01
I0620 15:01:48.175708 26838 solver.cpp:112] Iteration 7900, loss = **6.90793**
I0620 15:02:17.264022 26838 solver.cpp:272] Iteration 7920, lr = 0.01
I0620 15:02:17.270377 26838 solver.cpp:112] Iteration 7920, loss = **6.91296**
I0620 15:02:46.340982 26838 solver.cpp:272] Iteration 7940, lr = 0.01
I0620 15:02:46.348013 26838 solver.cpp:112] Iteration 7940, loss = 6.90914
I0620 15:03:15.433696 26838 solver.cpp:272] Iteration 7960, lr = 0.01
I0620 15:03:15.440011 26838 solver.cpp:112] Iteration 7960, loss = 6.90873
I0620 15:03:44.498239 26838 solver.cpp:272] Iteration 7980, lr = 0.01
I0620 15:03:44.504591 26838 solver.cpp:112] Iteration 7980, loss = 6.90801
I0620 15:04:13.588421 26838 solver.cpp:272] Iteration 8000, lr = 0.01
I0620 15:04:13.594765 26838 solver.cpp:112] Iteration 8000, loss = 6.91343
I0620 15:04:13.594774 26838 solver.cpp:139] Iteration 8000, Testing net (#0)
I0620 15:05:53.512861 26838 solver.cpp:177] Test score #0: 0.001
I0620 15:05:53.512897 26838 solver.cpp:177] Test score #1: 6.9084

@sguada
Copy link
Contributor

sguada commented Jun 20, 2014

I never seem something like this before in any of my trainings. So either
there is something weird in your data (maybe a batch with only examples
from one class), something strange happened with the random numbers
generator, or your GPU start making too many errors.

Can you give more details about your data? The way you pre-processed and if
you shuffle it or not.

Sergio

2014-06-20 8:27 GMT-07:00 to3i <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');>:

I am afraid I have to correct my earlier comment. After training about 8k
iterations suddenly things are falling apart (see output below at iteration
7900). I have not witnessed this before, and it would be great if someone
could give me a pointer to why the optimization can break down like this.
My only clue is that maybe something is wrong with the training data. But
things have been working fine with the older caffe version....

I0620 14:39:44.536012 26838 solver.cpp:177] Test score #0: 0.18864
I0620 14:39:44.542511 26838 solver.cpp:177] Test score #1
#1: 4.13298
I0620 14:40:13.815701 26838 solver.cpp:272] Iteration 7020, lr = 0.01
I0620 14:40:13.822108 26838 solver.cpp:112] Iteration 7020, loss = 4.04092
I0620 14:40:43.172262 26838 solver.cpp:272] Iteration 7040, lr = 0.01
I0620 14:40:43.178668 26838 solver.cpp:112] Iteration 7040, loss = 4.29253
I0620 14:41:12.563097 26838 solver.cpp:272] Iteration 7060, lr = 0.01
I0620 14:41:12.569507 26838 solver.cpp:112] Iteration 7060, loss = 4.41015
I0620 14:41:41.932728 26838 solver.cpp:272] Iteration 7080, lr = 0.01
I0620 14:41:41.939139 26838 solver.cpp:112] Iteration 7080, loss = 4.38662
I0620 14:42:11.350173 26838 solver.cpp:272] Iteration 7100, lr = 0.01
I0620 14:42:11.356637 26838 solver.cpp:112] Iteration 7100, loss = 4.10524
I0620 14:42:40.728377 26838 solver.cpp:272] Iteration 7120, lr = 0.01
I0620 14:42:40.734822 26838 solver.cpp:112] Iteration 7120, loss = 4.06197
I0620 14:43:10.170919 26838 solver.cpp:272] Iteration 7140, lr = 0.01
I0620 14:43:10.177328 26838 solver.cpp:112] Iteration 7140, loss = 4.0084
I0620 14:43:39.577159 26838 solver.cpp:272] Iteration 7160, lr = 0.01
I0620 14:43:39.583540 26838 solver.cpp:112] Iteration 7160, loss = 4.01986
I0620 14:44:08.962162 26838 solver.cpp:272] Iteration 7180, lr = 0.01
I0620 14:44:08.968629 26838 solver.cpp:112] Iteration 7180, loss = 4.16073
I0620 14:44:38.433136 26838 solver.cpp:272] Iteration 7200, lr = 0.01
I0620 14:44:38.439510 26838 solver.cpp:112] Iteration 7200, loss = 4.18245
I0620 14:45:07.810672 26838 solver.cpp:272] Iteration 7220, lr = 0.01
I0620 14:45:07.817034 26838 solver.cpp:112] Iteration 7220, loss = 4.18903
I0620 14:45:37.202044 26838 solver.cpp:272] Iteration 7240, lr = 0.01
I0620 14:45:37.208451 26838 solver.cpp:112] Iteration 7240, loss = 4.02316
I0620 14:46:06.610736 26838 solver.cpp:272] Iteration 7260, lr = 0.01
I0620 14:46:06.617142 26838 solver.cpp:112] Iteration 7260, loss = 4.13984
I0620 14:46:36.020895 26838 solver.cpp:272] Iteration 7280, lr = 0.01
I0620 14:46:36.027274 26838 solver.cpp:112] Iteration 7280, loss = 4.23084
I0620 14:47:05.448349 26838 solver.cpp:272] Iteration 7300, lr = 0.01
I0620 14:47:05.454779 26838 solver.cpp:112] Iteration 7300, loss = 4.11645
I0620 14:47:34.837702 26838 solver.cpp:272] Iteration 7320, lr = 0.01
I0620 14:47:34.844108 26838 solver.cpp:112] Iteration 7320, loss = 4.07949
I0620 14:48:04.255254 26838 solver.cpp:272] Iteration 7340, lr = 0.01
I0620 14:48:04.261632 26838 solver.cpp:112] Iteration 7340, loss = 4.26324
I0620 14:48:33.703403 26838 solver.cpp:272] Iteration 7360, lr = 0.01
I0620 14:48:33.709803 26838 solver.cpp:112] Iteration 7360, loss = 4.23769
I0620 14:49:03.120488 26838 solver.cpp:272] Iteration 7380, lr = 0.01
I0620 14:49:03.126855 26838 solver.cpp:112] Iteration 7380, loss = 3.95633
I0620 14:49:32.534546 26838 solver.cpp:272] Iteration 7400, lr = 0.01
I0620 14:49:32.540951 26838 solver.cpp:112] Iteration 7400, loss = 4.04365
I0620 14:50:01.962718 26838 solver.cpp:272] Iteration 7420, lr = 0.01
I0620 14:50:01.969089 26838 solver.cpp:112] Iteration 7420, loss = 4.11874
I0620 14:50:31.364446 26838 solver.cpp:272] Iteration 7440, lr = 0.01
I0620 14:50:31.370842 26838 solver.cpp:112] Iteration 7440, loss = 4.17533
I0620 14:51:00.805496 26838 solver.cpp:272] Iteration 7460, lr = 0.01
I0620 14:51:00.811980 26838 solver.cpp:112] Iteration 7460, loss = 4.25556
I0620 14:51:30.238541 26838 solver.cpp:272] Iteration 7480, lr = 0.01
I0620 14:51:30.244962 26838 solver.cpp:112] Iteration 7480, loss = 4.37627
I0620 14:51:59.654762 26838 solver.cpp:272] Iteration 7500, lr = 0.01
I0620 14:51:59.661149 26838 solver.cpp:112] Iteration 7500, loss = 4.23419
I0620 14:52:29.117156 26838 solver.cpp:272] Iteration 7520, lr = 0.01
I0620 14:52:29.123595 26838 solver.cpp:112] Iteration 7520, loss = 3.99898
I0620 14:52:58.528599 26838 solver.cpp:272] Iteration 7540, lr = 0.01
I0620 14:52:58.535665 26838 solver.cpp:112] Iteration 7540, loss = 4.28032
I0620 14:53:27.920863 26838 solver.cpp:272] Iteration 7560, lr = 0.01
I0620 14:53:27.927288 26838 solver.cpp:112] Iteration 7560, loss = 3.93051
I0620 14:53:57.349653 26838 solver.cpp:272] Iteration 7580, lr = 0.01
I0620 14:53:57.356073 26838 solver.cpp:112] Iteration 7580, loss = 4.14223
I0620 14:54:26.789634 26838 solver.cpp:272] Iteration 7600, lr = 0.01
I0620 14:54:26.796046 26838 solver.cpp:112] Iteration 7600, loss = 4.29514
I0620 14:54:56.201872 26838 solver.cpp:272] Iteration 7620, lr = 0.01
I0620 14:54:56.208299 26838 solver.cpp:112] Iteration 7620, loss = 3.99356
I0620 14:55:25.640661 26838 solver.cpp:272] Iteration 7640, lr = 0.01
I0620 14:55:25.647030 26838 solver.cpp:112] Iteration 7640, loss = 4.1392
I0620 14:55:55.103888 26838 solver.cpp:272] Iteration 7660, lr = 0.01
I0620 14:55:55.110354 26838 solver.cpp:112] Iteration 7660, loss = 4.09701
I0620 14:56:24.493177 26838 solver.cpp:272] Iteration 7680, lr = 0.01
I0620 14:56:24.499610 26838 solver.cpp:112] Iteration 7680, loss = 4.28187
I0620 14:56:53.926969 26838 solver.cpp:272] Iteration 7700, lr = 0.01
I0620 14:56:53.933403 26838 solver.cpp:112] Iteration 7700, loss = 4.35199
I0620 14:57:23.319170 26838 solver.cpp:272] Iteration 7720, lr = 0.01
I0620 14:57:23.325537 26838 solver.cpp:112] Iteration 7720, loss = 4.08275
I0620 14:57:52.738777 26838 solver.cpp:272] Iteration 7740, lr = 0.01
I0620 14:57:52.745141 26838 solver.cpp:112] Iteration 7740, loss = 4.02336
I0620 14:58:22.212821 26838 solver.cpp:272] Iteration 7760, lr = 0.01
I0620 14:58:22.219264 26838 solver.cpp:112] Iteration 7760, loss = 4.16953
I0620 14:58:51.640851 26838 solver.cpp:272] Iteration 7780, lr = 0.01
I0620 14:58:51.647227 26838 solver.cpp:112] Iteration 7780, loss = 4.15586
I0620 14:59:21.081740 26838 solver.cpp:272] Iteration 7800, lr = 0.01
I0620 14:59:21.088212 26838 solver.cpp:112] Iteration 7800, loss = 4.09304
I0620 14:59:50.520324 26838 solver.cpp:272] Iteration 7820, lr = 0.01
I0620 14:59:50.526731 26838 solver.cpp:112] Iteration 7820, loss = 4.26121
I0620 15:00:20.001034 26838 solver.cpp:272] Iteration 7840, lr = 0.01
I0620 15:00:20.007499 26838 solver.cpp:112] Iteration 7840, loss = 4.27981
I0620 15:00:49.439652 26838 solver.cpp:272] Iteration 7860, lr = 0.01
I0620 15:00:49.446085 26838 solver.cpp:112] Iteration 7860, loss = 4.18486
I0620 15:01:18.885591 26838 solver.cpp:272] Iteration 7880, lr = 0.01
I0620 15:01:18.892004 26838 solver.cpp:112] Iteration 7880, loss = 4.31062
I0620 15:01:48.169371 26838 solver.cpp:272] Iteration 7900, lr = 0.01
I0620 15:01:48.175708 26838 solver.cpp:112] Iteration 7900, loss = 6.90793
I0620 15:02:17.264022 26838 solver.cpp:272] Iteration 7920, lr = 0.01
I0620 15:02:17.270377 26838 solver.cpp:112] Iteration 7920, loss = 6.91296
I0620 15:02:46.340982 26838 solver.cpp:272] Iteration 7940, lr = 0.01
I0620 15:02:46.348013 26838 solver.cpp:112] Iteration 7940, loss = 6.90914
I0620 15:03:15.433696 26838 solver.cpp:272] Iteration 7960, lr = 0.01
I0620 15:03:15.440011 26838 solver.cpp:112] Iteration 7960, loss = 6.90873
I0620 15:03:44.498239 26838 solver.cpp:272] Iteration 7980, lr = 0.01
I0620 15:03:44.504591 26838 solver.cpp:112] Iteration 7980, loss = 6.90801
I0620 15:04:13.588421 26838 solver.cpp:272] Iteration 8000, lr = 0.01
I0620 15:04:13.594765 26838 solver.cpp:112] Iteration 8000, loss = 6.91343
I0620 15:04:13.594774 26838 solver.cpp:139] Iteration 8000, Testing net
(#0)
I0620 15:05:53.512861 26838 solver.cpp:177] Test score #0: 0.001
I0620 15:05:53.512897 26838 solver.cpp:177] Test score #1
#1: 6.9084


Reply to this email directly or view it on GitHub
#401 (comment).

Sergio

@to3i
Copy link
Contributor Author

to3i commented Jun 22, 2014

I use the training dataset from ILSVRC 2012, resize/warp everything down to 256x256, and then run create_imagenet shell script with shuffle enabled.

In order to investigate the possibility of a faulty datasets I repeated the training with the same setup and the error showed up after 45k iterations:

I0622 07:51:30.941565 16298 solver.cpp:139] Iteration 45000, Testing net (#0)
I0622 07:53:10.907624 16298 solver.cpp:177] Test score #0: 0.36516
I0622 07:53:10.907657 16298 solver.cpp:177] Test score #1: 2.98764
I0622 07:53:40.102326 16298 solver.cpp:272] Iteration 45020, lr = 0.01
I0622 07:53:40.108786 16298 solver.cpp:112] Iteration 45020, loss = 2.99365
[...]
I0622 07:58:04.682878 16298 solver.cpp:272] Iteration 45200, lr = 0.01
I0622 07:58:04.689251 16298 solver.cpp:112] Iteration 45200, loss = 3.03256
I0622 07:58:33.922636 16298 solver.cpp:272] Iteration 45220, lr = 0.01
I0622 07:58:33.928966 16298 solver.cpp:112] Iteration 45220, loss = 3.20652
I0622 07:59:03.161181 16298 solver.cpp:272] Iteration 45240, lr = 0.01
I0622 07:59:03.167497 16298 solver.cpp:112] Iteration 45240, loss = 3.14731
I0622 07:59:32.409312 16298 solver.cpp:272] Iteration 45260, lr = 0.01
I0622 07:59:32.415729 16298 solver.cpp:112] Iteration 45260, loss = 3.10704
I0622 08:00:01.672199 16298 solver.cpp:272] Iteration 45280, lr = 0.01
I0622 08:00:01.678561 16298 solver.cpp:112] Iteration 45280, loss = 3.21404
I0622 08:00:30.904839 16298 solver.cpp:272] Iteration 45300, lr = 0.01
I0622 08:00:30.911262 16298 solver.cpp:112] Iteration 45300, loss = 2.68094
I0622 08:00:59.988724 16298 solver.cpp:272] Iteration 45320, lr = 0.01
I0622 08:00:59.995066 16298 solver.cpp:112] Iteration 45320, loss = 6.90475
I0622 08:01:29.033206 16298 solver.cpp:272] Iteration 45340, lr = 0.01
I0622 08:01:29.039542 16298 solver.cpp:112] Iteration 45340, loss = 6.90794
I0622 08:01:58.082298 16298 solver.cpp:272] Iteration 45360, lr = 0.01
I0622 08:01:58.088645 16298 solver.cpp:112] Iteration 45360, loss = 6.90909
I0622 08:02:27.132822 16298 solver.cpp:272] Iteration 45380, lr = 0.01
I0622 08:02:27.139181 16298 solver.cpp:112] Iteration 45380, loss = 6.91634
[...]

I also recreated the leveldb applying create_imagenet.sh (shuffle enabled again) and using the same setup it broke down again after 26k iterations:

I0621 05:03:08.178104 23607 solver.cpp:177] Test score #0: 0.33546
I0621 05:03:08.178150 23607 solver.cpp:177] Test score #1: 3.14943
I0621 05:03:37.240947 23607 solver.cpp:272] Iteration 26020, lr = 0.01
I0621 05:03:37.247289 23607 solver.cpp:112] Iteration 26020, loss = 3.38507
[...]
I0621 05:22:33.639921 23607 solver.cpp:272] Iteration 26800, lr = 0.01
I0621 05:22:33.646255 23607 solver.cpp:112] Iteration 26800, loss = 2.78992
I0621 05:23:02.751633 23607 solver.cpp:272] Iteration 26820, lr = 0.01
I0621 05:23:02.757951 23607 solver.cpp:112] Iteration 26820, loss = 3.47056
I0621 05:23:31.884439 23607 solver.cpp:272] Iteration 26840, lr = 0.01
I0621 05:23:31.890794 23607 solver.cpp:112] Iteration 26840, loss = 3.39292
I0621 05:24:01.001684 23607 solver.cpp:272] Iteration 26860, lr = 0.01
I0621 05:24:01.008003 23607 solver.cpp:112] Iteration 26860, loss = 3.09044
I0621 05:24:30.105005 23607 solver.cpp:272] Iteration 26880, lr = 0.01
I0621 05:24:30.111356 23607 solver.cpp:112] Iteration 26880, loss = 7.01918
I0621 05:24:59.120360 23607 solver.cpp:272] Iteration 26900, lr = 0.01
I0621 05:24:59.126708 23607 solver.cpp:112] Iteration 26900, loss = 7.01855
I0621 05:25:28.150028 23607 solver.cpp:272] Iteration 26920, lr = 0.01
I0621 05:25:28.156348 23607 solver.cpp:112] Iteration 26920, loss = 6.95872
I0621 05:25:57.167618 23607 solver.cpp:272] Iteration 26940, lr = 0.01
I0621 05:25:57.173929 23607 solver.cpp:112] Iteration 26940, loss = 6.96198
I0621 05:26:26.210989 23607 solver.cpp:272] Iteration 26960, lr = 0.01
I0621 05:26:26.217345 23607 solver.cpp:112] Iteration 26960, loss = 6.92597
I0621 05:26:55.229804 23607 solver.cpp:272] Iteration 26980, lr = 0.01
I0621 05:26:55.236147 23607 solver.cpp:112] Iteration 26980, loss = 6.93402
I0621 05:27:24.253763 23607 solver.cpp:272] Iteration 27000, lr = 0.01
I0621 05:27:24.260089 23607 solver.cpp:112] Iteration 27000, loss = 6.9353
I0621 05:27:24.260102 23607 solver.cpp:139] Iteration 27000, Testing net (#0)
I0621 05:29:03.283285 23607 solver.cpp:177] Test score #0: 0.001
I0621 05:29:03.283324 23607 solver.cpp:177] Test score #1: 6.93316

@sguada With regard to the three sources of error you mentioned I would now assume there is an issue with the GPU or the random number generators. What exactly do you mean with the GPU starts making to many errors? Is this a driver or an hardware issue?

It was mentioned before in this discussion that there have been some changes to the random number generators. I would like to revert these changes on my local copy of caffe-dev and try training again to see if this helps. Do you know which files have to be changed? I am still new to the git repository world, so I wonder if there is a way to search through the changes of the last months to accomplish this?

@sguada
Copy link
Contributor

sguada commented Jun 23, 2014

@to3i at this point my guess would be that there is something going on with you GPU, it can be the drivers or the card itself. Maybe when it gets hot start behaving erratically.

The loss is around 6.9 when the system is doing random guessing, what probably means that all the weights got corrupted or just became zero.

You can try to use an older version of Caffe. Look at the releases https://github.com/BVLC/caffe/releases maybe v0.9 argentine.

@to3i
Copy link
Contributor Author

to3i commented Jun 24, 2014

@sguada I will check if switching between different nvidia drivers will resolve the problem. Thanks for your help!

@shelhamer
Copy link
Member

Closing since ImageNet training has been replicated elsewhere with Caffe. If you keep having problems, please follow-up with a comment in this thread.

@una-dinosauria
Copy link

I have run the imagenet tutorial around 8 times now, and seen it stall 3/8 times. As mentioned by @sguada on #59, if there is no increment after 5000 iterations that's probably a good sign that the net won't converge, and you should restart it and hope for the best.

@lugiavn
Copy link

lugiavn commented Feb 27, 2015

I'm running this, the loss stuck at 6.9 for 8k iteration before going down. But now it keeps going down so that's good.

@zizhaozhang
Copy link

I haven't test imagenet by myself yet. I re-implement a network in a paper that the loss does not go down. But it works when I use another deep learning framework (so my understanding of the paper and dataset preparation is correct). If I can get correct answer with mnist dataset. Could I say my cuda and caffe version is no problem?

@anurikadisha
Copy link

Hi @to3i
I am also facing the same problem. Loss first decreases and then suddenly shoots up. How did you solve this problem?

@Dror370
Copy link

Dror370 commented Mar 26, 2018

Hi all ,
I suggest you all to check the softmaxwithloss layer, see that you define the layer correctly,
In case you define Phase:train, it is not work properly,
The layer definition should not include any phase ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests