On custom data training diverges (loss = NaN) #409

smiley19 · 2014-05-13T03:19:48Z

I try to train my own dataset ( 4 hand gestures )
I didn't change the overall structure of the example program ( ex mnist & imagenet ), the only thing I modify is the input data set.
No matter how I adjust the possible parameters( ex learning rates, weight decay) the results of both weighting and loss function turn out to diverge to NaN.
Or the loss function didn't decrease.

I use "convert_imageset.bin" to transform the input image to leveldb input.
Creating leveldb...
E0513 10:04:46.223364 19650 convert_imageset.cpp:96] Processed 1000 files.
E0513 10:04:53.909580 19650 convert_imageset.cpp:96] Processed 2000 files.
E0513 10:04:59.556373 19650 convert_imageset.cpp:96] Processed 3000 files.
E0513 10:05:05.393556 19650 convert_imageset.cpp:96] Processed 4000 files.
E0513 10:05:11.244086 19650 convert_imageset.cpp:96] Processed 5000 files.
E0513 10:05:16.990255 19650 convert_imageset.cpp:96] Processed 6000 files.
E0513 10:05:25.553741 19650 convert_imageset.cpp:96] Processed 7000 files.
E0513 10:05:31.347475 19650 convert_imageset.cpp:96] Processed 8000 files.
E0513 10:05:36.977419 19650 convert_imageset.cpp:96] Processed 9000 files.
E0513 10:05:43.507733 19650 convert_imageset.cpp:96] Processed 10000 files.
E0513 10:05:51.023560 19650 convert_imageset.cpp:96] Processed 11000 files.
E0513 10:05:56.628383 19650 convert_imageset.cpp:96] Processed 12000 files.
E0513 10:06:02.121335 19650 convert_imageset.cpp:104] Processed 12800 files.
E0513 10:06:06.624284 19994 convert_imageset.cpp:96] Processed 1000 files.
E0513 10:06:08.399435 19994 convert_imageset.cpp:104] Processed 1871 files.
Done.

And the training is like,

I0513 11:15:46.741041 30505 train_net.cpp:26] Starting Optimization
I0513 11:15:46.741169 30505 solver.cpp:41] Creating training net.
I0513 11:15:46.741538 30505 net.cpp:75] Creating Layer hand
I0513 11:15:46.741564 30505 net.cpp:111] hand -> data
I0513 11:15:46.741597 30505 net.cpp:111] hand -> label
I0513 11:15:46.741647 30505 data_layer.cpp:145] Opening leveldb hand-train-leveldb
I0513 11:15:46.876204 30505 data_layer.cpp:185] output data size: 128,3,50,50
I0513 11:15:47.147683 30505 net.cpp:126] Top shape: 128 3 50 50 (960000)
I0513 11:15:47.147743 30505 net.cpp:126] Top shape: 128 1 1 1 (128)
I0513 11:15:47.147759 30505 net.cpp:157] hand does not need backward computation.
I0513 11:15:47.147783 30505 net.cpp:75] Creating Layer conv1
I0513 11:15:47.147797 30505 net.cpp:85] conv1 <- data
I0513 11:15:47.147817 30505 net.cpp:111] conv1 -> conv1
I0513 11:15:47.147897 30505 net.cpp:126] Top shape: 128 20 45 45 (5184000)
I0513 11:15:47.147917 30505 net.cpp:152] conv1 needs backward computation.
I0513 11:15:47.147933 30505 net.cpp:75] Creating Layer pool1
I0513 11:15:47.147945 30505 net.cpp:85] pool1 <- conv1
I0513 11:15:47.147958 30505 net.cpp:111] pool1 -> pool1
I0513 11:15:47.147977 30505 net.cpp:126] Top shape: 128 20 15 15 (576000)
I0513 11:15:47.147997 30505 net.cpp:152] pool1 needs backward computation.
I0513 11:15:47.148012 30505 net.cpp:75] Creating Layer conv2
I0513 11:15:47.148025 30505 net.cpp:85] conv2 <- pool1
I0513 11:15:47.148036 30505 net.cpp:111] conv2 -> conv2
I0513 11:15:47.148380 30505 net.cpp:126] Top shape: 128 50 10 10 (640000)
I0513 11:15:47.148401 30505 net.cpp:152] conv2 needs backward computation.
I0513 11:15:47.148416 30505 net.cpp:75] Creating Layer pool2
I0513 11:15:47.148429 30505 net.cpp:85] pool2 <- conv2
I0513 11:15:47.148442 30505 net.cpp:111] pool2 -> pool2
I0513 11:15:47.148458 30505 net.cpp:126] Top shape: 128 50 5 5 (160000)
I0513 11:15:47.148470 30505 net.cpp:152] pool2 needs backward computation.
I0513 11:15:47.148485 30505 net.cpp:75] Creating Layer ip1
I0513 11:15:47.148497 30505 net.cpp:85] ip1 <- pool2
I0513 11:15:47.148510 30505 net.cpp:111] ip1 -> ip1
I0513 11:15:47.154276 30505 net.cpp:126] Top shape: 128 500 1 1 (64000)
I0513 11:15:47.154330 30505 net.cpp:152] ip1 needs backward computation.
I0513 11:15:47.154347 30505 net.cpp:75] Creating Layer relu1
I0513 11:15:47.154361 30505 net.cpp:85] relu1 <- ip1
I0513 11:15:47.154376 30505 net.cpp:99] relu1 -> ip1 (in-place)
I0513 11:15:47.154392 30505 net.cpp:126] Top shape: 128 500 1 1 (64000)
I0513 11:15:47.154404 30505 net.cpp:152] relu1 needs backward computation.
I0513 11:15:47.154420 30505 net.cpp:75] Creating Layer ip2
I0513 11:15:47.154431 30505 net.cpp:85] ip2 <- ip1
I0513 11:15:47.154443 30505 net.cpp:111] ip2 -> ip2
I0513 11:15:47.154484 30505 net.cpp:126] Top shape: 128 4 1 1 (512)
I0513 11:15:47.154500 30505 net.cpp:152] ip2 needs backward computation.
I0513 11:15:47.154520 30505 net.cpp:75] Creating Layer loss
I0513 11:15:47.154532 30505 net.cpp:85] loss <- ip2
I0513 11:15:47.154546 30505 net.cpp:85] loss <- label
I0513 11:15:47.154562 30505 net.cpp:152] loss needs backward computation.
I0513 11:15:47.154587 30505 net.cpp:180] Collecting Learning Rate and Weight Decay.
I0513 11:15:47.154608 30505 net.cpp:173] Network initialization done.
I0513 11:15:47.154623 30505 net.cpp:174] Memory required for Data 30338560
I0513 11:15:47.154680 30505 solver.cpp:44] Creating testing net.
I0513 11:15:47.155036 30505 net.cpp:75] Creating Layer hand
I0513 11:15:47.155061 30505 net.cpp:111] hand -> data
I0513 11:15:47.155079 30505 net.cpp:111] hand -> label
I0513 11:15:47.155096 30505 data_layer.cpp:145] Opening leveldb hand-test-leveldb
I0513 11:15:47.268432 30505 data_layer.cpp:185] output data size: 1871,3,50,50
I0513 11:15:47.285804 30505 net.cpp:126] Top shape: 1871 3 50 50 (14032500)
I0513 11:15:47.285868 30505 net.cpp:126] Top shape: 1871 1 1 1 (1871)
I0513 11:15:47.285884 30505 net.cpp:157] hand does not need backward computation.
I0513 11:15:47.285908 30505 net.cpp:75] Creating Layer conv1
I0513 11:15:47.285922 30505 net.cpp:85] conv1 <- data
I0513 11:15:47.285936 30505 net.cpp:111] conv1 -> conv1
I0513 11:15:47.286005 30505 net.cpp:126] Top shape: 1871 20 45 45 (75775500)
I0513 11:15:47.286023 30505 net.cpp:152] conv1 needs backward computation.
I0513 11:15:47.286039 30505 net.cpp:75] Creating Layer pool1
I0513 11:15:47.286052 30505 net.cpp:85] pool1 <- conv1
I0513 11:15:47.286066 30505 net.cpp:111] pool1 -> pool1
I0513 11:15:47.286079 30505 net.cpp:126] Top shape: 1871 20 15 15 (8419500)
I0513 11:15:47.286092 30505 net.cpp:152] pool1 needs backward computation.
I0513 11:15:47.286108 30505 net.cpp:75] Creating Layer conv2
I0513 11:15:47.286119 30505 net.cpp:85] conv2 <- pool1
I0513 11:15:47.286133 30505 net.cpp:111] conv2 -> conv2
I0513 11:15:47.286484 30505 net.cpp:126] Top shape: 1871 50 10 10 (9355000)
I0513 11:15:47.286504 30505 net.cpp:152] conv2 needs backward computation.
I0513 11:15:47.286522 30505 net.cpp:75] Creating Layer pool2
I0513 11:15:47.286535 30505 net.cpp:85] pool2 <- conv2
I0513 11:15:47.286548 30505 net.cpp:111] pool2 -> pool2
I0513 11:15:47.286561 30505 net.cpp:126] Top shape: 1871 50 5 5 (2338750)
I0513 11:15:47.286574 30505 net.cpp:152] pool2 needs backward computation.
I0513 11:15:47.286591 30505 net.cpp:75] Creating Layer ip1
I0513 11:15:47.286602 30505 net.cpp:85] ip1 <- pool2
I0513 11:15:47.286615 30505 net.cpp:111] ip1 -> ip1
I0513 11:15:47.292402 30505 net.cpp:126] Top shape: 1871 500 1 1 (935500)
I0513 11:15:47.292474 30505 net.cpp:152] ip1 needs backward computation.
I0513 11:15:47.292493 30505 net.cpp:75] Creating Layer relu1
I0513 11:15:47.292506 30505 net.cpp:85] relu1 <- ip1
I0513 11:15:47.292522 30505 net.cpp:99] relu1 -> ip1 (in-place)
I0513 11:15:47.292536 30505 net.cpp:126] Top shape: 1871 500 1 1 (935500)
I0513 11:15:47.292548 30505 net.cpp:152] relu1 needs backward computation.
I0513 11:15:47.292564 30505 net.cpp:75] Creating Layer ip2
I0513 11:15:47.292577 30505 net.cpp:85] ip2 <- ip1
I0513 11:15:47.292588 30505 net.cpp:111] ip2 -> ip2
I0513 11:15:47.292644 30505 net.cpp:126] Top shape: 1871 4 1 1 (7484)
I0513 11:15:47.292659 30505 net.cpp:152] ip2 needs backward computation.
I0513 11:15:47.292680 30505 net.cpp:75] Creating Layer prob
I0513 11:15:47.292692 30505 net.cpp:85] prob <- ip2
I0513 11:15:47.292706 30505 net.cpp:111] prob -> prob
I0513 11:15:47.292722 30505 net.cpp:126] Top shape: 1871 4 1 1 (7484)
I0513 11:15:47.292736 30505 net.cpp:152] prob needs backward computation.
I0513 11:15:47.292748 30505 net.cpp:75] Creating Layer accuracy
I0513 11:15:47.292764 30505 net.cpp:85] accuracy <- prob
I0513 11:15:47.292776 30505 net.cpp:85] accuracy <- label
I0513 11:15:47.292790 30505 net.cpp:111] accuracy -> accuracy
I0513 11:15:47.292806 30505 net.cpp:126] Top shape: 1 2 1 1 (2)
I0513 11:15:47.292819 30505 net.cpp:152] accuracy needs backward computation.
I0513 11:15:47.292831 30505 net.cpp:163] This network produces output accuracy
I0513 11:15:47.292848 30505 net.cpp:180] Collecting Learning Rate and Weight Decay.
I0513 11:15:47.292865 30505 net.cpp:173] Network initialization done.
I0513 11:15:47.292877 30505 net.cpp:174] Memory required for Data 443494364
I0513 11:15:47.292930 30505 solver.cpp:49] Solver scaffolding done.
I0513 11:15:47.292948 30505 solver.cpp:60] Solving Hand
I0513 11:15:47.292968 30505 solver.cpp:105] Iteration 0, Testing net
I0513 11:15:47.720150 30505 solver.cpp:141] Test score #0: 0.207376
I0513 11:15:47.720239 30505 solver.cpp:141] Test score #1: 1.38612
I0513 11:15:56.640269 30505 solver.cpp:236] Iteration 100, lr = 0.00992565
I0513 11:15:56.640511 30505 solver.cpp:86] Iteration 100, loss = 2.70168
I0513 11:16:05.554852 30505 solver.cpp:236] Iteration 200, lr = 0.00985258
I0513 11:16:05.555102 30505 solver.cpp:86] Iteration 200, loss = nan
I0513 11:16:14.469753 30505 solver.cpp:236] Iteration 300, lr = 0.00978075
I0513 11:16:14.470007 30505 solver.cpp:86] Iteration 300, loss = nan
I0513 11:16:23.383903 30505 solver.cpp:236] Iteration 400, lr = 0.00971013
I0513 11:16:23.384151 30505 solver.cpp:86] Iteration 400, loss = nan
I0513 11:16:23.384174 30505 solver.cpp:105] Iteration 400, Testing net
I0513 11:16:23.702975 30505 solver.cpp:141] Test score #0: 0
I0513 11:16:23.703033 30505 solver.cpp:141] Test score #1: nan

Also,I want to check the input leveldb, but I have no idea to do that.

sguada · 2014-05-13T04:53:22Z

Try reducing the base learning rate

On Monday, May 12, 2014, smiley19 notifications@github.com wrote:

I try to train my own dataset ( 4 hand gestures )
I didn't change the overall structure of the example program ( ex mnist &
imagenet ), the only thing I modify is the input data set.
No matter how I adjust the possible parameters( ex learning rates, weight
decay) the results of both weighting and loss function turn out to diverge
to NaN.
Or the loss function didn't decrease.

I use "convert_imageset.bin" to transform the input image to leveldb input.
Creating leveldb...
E0513 10:04:46.223364 19650 convert_imageset.cpp:96] Processed 1000 files.
E0513 10:04:53.909580 19650 convert_imageset.cpp:96] Processed 2000 files.
E0513 10:04:59.556373 19650 convert_imageset.cpp:96] Processed 3000 files.
E0513 10:05:05.393556 19650 convert_imageset.cpp:96] Processed 4000 files.
E0513 10:05:11.244086 19650 convert_imageset.cpp:96] Processed 5000 files.
E0513 10:05:16.990255 19650 convert_imageset.cpp:96] Processed 6000 files.
E0513 10:05:25.553741 19650 convert_imageset.cpp:96] Processed 7000 files.
E0513 10:05:31.347475 19650 convert_imageset.cpp:96] Processed 8000 files.
E0513 10:05:36.977419 19650 convert_imageset.cpp:96] Processed 9000 files.
E0513 10:05:43.507733 19650 convert_imageset.cpp:96] Processed 10000 files.
E0513 10:05:51.023560 19650 convert_imageset.cpp:96] Processed 11000 files.
E0513 10:05:56.628383 19650 convert_imageset.cpp:96] Processed 12000 files.
E0513 10:06:02.121335 19650 convert_imageset.cpp:104] Processed 12800
files.
E0513 10:06:06.624284 19994 convert_imageset.cpp:96] Processed 1000 files.
E0513 10:06:08.399435 19994 convert_imageset.cpp:104] Processed 1871 files.
Done.

And the training is like,

I0513 11:15:46.741041 30505 train_net.cpp:26] Starting Optimization
I0513 11:15:46.741169 30505 solver.cpp:41] Creating training net.
I0513 11:15:46.741538 30505 net.cpp:75] Creating Layer hand
I0513 11:15:46.741564 30505 net.cpp:111] hand -> data
I0513 11:15:46.741597 30505 net.cpp:111] hand -> label
I0513 11:15:46.741647 30505 data_layer.cpp:145] Opening leveldb
hand-train-leveldb
I0513 11:15:46.876204 30505 data_layer.cpp:185] output data size:
128,3,50,50
I0513 11:15:47.147683 30505 net.cpp:126] Top shape: 128 3 50 50 (960000)
I0513 11:15:47.147743 30505 net.cpp:126] Top shape: 128 1 1 1 (128)
I0513 11:15:47.147759 30505 net.cpp:157] hand does not need backward
computation.
I0513 11:15:47.147783 30505 net.cpp:75] Creating Layer conv1
I0513 11:15:47.147797 30505 net.cpp:85] conv1 <- data
I0513 11:15:47.147817 30505 net.cpp:111] conv1 -> conv1
I0513 11:15:47.147897 30505 net.cpp:126] Top shape: 128 20 45 45 (5184000)
I0513 11:15:47.147917 30505 net.cpp:152] conv1 needs backward computation.
I0513 11:15:47.147933 30505 net.cpp:75] Creating Layer pool1
I0513 11:15:47.147945 30505 net.cpp:85] pool1 <- conv1
I0513 11:15:47.147958 30505 net.cpp:111] pool1 -> pool1
I0513 11:15:47.147977 30505 net.cpp:126] Top shape: 128 20 15 15 (576000)
I0513 11:15:47.147997 30505 net.cpp:152] pool1 needs backward computation.
I0513 11:15:47.148012 30505 net.cpp:75] Creating Layer conv2
I0513 11:15:47.148025 30505 net.cpp:85] conv2 <- pool1
I0513 11:15:47.148036 30505 net.cpp:111] conv2 -> conv2
I0513 11:15:47.148380 30505 net.cpp:126] Top shape: 128 50 10 10 (640000)
I0513 11:15:47.148401 30505 net.cpp:152] conv2 needs backward computation.
I0513 11:15:47.148416 30505 net.cpp:75] Creating Layer pool2
I0513 11:15:47.148429 30505 net.cpp:85] pool2 <- conv2
I0513 11:15:47.148442 30505 net.cpp:111] pool2 -> pool2
I0513 11:15:47.148458 30505 net.cpp:126] Top shape: 128 50 5 5 (160000)
I0513 11:15:47.148470 30505 net.cpp:152] pool2 needs backward computation.
I0513 11:15:47.148485 30505 net.cpp:75] Creating Layer ip1
I0513 11:15:47.148497 30505 net.cpp:85] ip1 <- pool2
I0513 11:15:47.148510 30505 net.cpp:111] ip1 -> ip1
I0513 11:15:47.154276 30505 net.cpp:126] Top shape: 128 500 1 1 (64000)
I0513 11:15:47.154330 30505 net.cpp:152] ip1 needs backward computation.
I0513 11:15:47.154347 30505 net.cpp:75] Creating Layer relu1
I0513 11:15:47.154361 30505 net.cpp:85] relu1 <- ip1
I0513 11:15:47.154376 30505 net.cpp:99] relu1 -> ip1 (in-place)
I0513 11:15:47.154392 30505 net.cpp:126] Top shape: 128 500 1 1 (64000)
I0513 11:15:47.154404 30505 net.cpp:152] relu1 needs backward computation.
I0513 11:15:47.154420 30505 net.cpp:75] Creating Layer ip2
I0513 11:15:47.154431 30505 net.cpp:85] ip2 <- ip1
I0513 11:15:47.154443 30505 net.cpp:111] ip2 -> ip2
I0513 11:15:47.154484 30505 net.cpp:126] Top shape: 128 4 1 1 (512)
I0513 11:15:47.154500 30505 net.cpp:152] ip2 needs backward computation.
I0513 11:15:47.154520 30505 net.cpp:75] Creating Layer loss
I0513 11:15:47.154532 30505 net.cpp:85] loss <- ip2
I0513 11:15:47.154546 30505 net.cpp:85] loss <- label
I0513 11:15:47.154562 30505 net.cpp:152] loss needs backward computation.
I0513 11:15:47.154587 30505 net.cpp:180] Collecting Learning Rate and
Weight Decay.
I0513 11:15:47.154608 30505 net.cpp:173] Network initialization done.
I0513 11:15:47.154623 30505 net.cpp:174] Memory required for Data 30338560
I0513 11:15:47.154680 30505 solver.cpp:44] Creating testing net.
I0513 11:15:47.155036 30505 net.cpp:75] Creating Layer hand
I0513 11:15:47.155061 30505 net.cpp:111] hand -> data
I0513 11:15:47.155079 30505 net.cpp:111] hand -> label
I0513 11:15:47.155096 30505 data_layer.cpp:145] Opening leveldb
hand-test-leveldb
I0513 11:15:47.268432 30505 data_layer.cpp:185] output data size:
1871,3,50,50
I0513 11:15:47.285804 30505 net.cpp:126] Top shape: 1871 3 50 50 (14032500)
I0513 11:15:47.285868 30505 net.cpp:126] Top shape: 1871 1 1 1 (1871)
I0513 11:15:47.285884 30505 net.cpp:157] hand does not need backward
computation.
I0513 11:15:47.285908 30505 net.cpp:75] Creating Layer conv1
I0513 11:15:47.285922 30505 net.cpp:85] conv1 <- data
I0513 11:15:47.285936 30505 net.cpp:111] conv1 -> conv1
I0513 11:15:47.286005 30505 net.cpp:126] Top shape: 1871 20 45 45
(75775500)
I0513 11:15:47.286023 30505 net.cpp:152] conv1 needs backward computation.
I0513 11:15:47.286039 30505 net.cpp:75] Creating Layer pool1
I0513 11:15:47.286052 30505 net.cpp:85] pool1 <- conv1
I0513 11:15:47.286066 30505 net.cpp:111] pool1 -> pool1
I0513 11:15:47.286079 30505 net.cpp:126] Top shape: 1871 20 15 15 (8419500)
I0513 11:15:47.286092 30505 net.cpp:152] pool1 needs backward computation.
I0513 11:15:47.286108 30505 net.cpp:75] Creating Layer conv2
I0513 11:15:47.286119 30505 net.cpp:85] conv2 <- pool1
I0513 11:15:47.286133 30505 net.cpp:111] conv2 -> conv2
I0513 11:15:47.286484 30505 net.cpp:126] Top shape: 1871 50 10 10 (9355000)
I0513 11:15:47.286504 30505 net.cpp:152] conv2 needs backward computation.
I0513 11:15:47.286522 30505 net.cpp:75] Creating Layer pool2
I0513 11:15:47.286535 30505 net.cpp:85] pool2 <- conv2
I0513 11:15:47.286548 30505 net.cpp:111] pool2 -> pool2
I0513 11:15:47.286561 30505 net.cpp:126] Top shape: 1871 50 5 5 (2338750)
I0513 11:15:47.286574 30505 net.cpp:152] pool2 needs backward computation.
I0513 11:15:47.286591 30505 net.cpp:75] Creating Layer ip1
I0513 11:15:47.286602 30505 net.cpp:85] ip1 <- pool2
I0513 11:15:47.286615 30505 net.cpp:111] ip1 -> ip1
I0513 11:15:47.292402 30505 net.cpp:126] Top shape: 1871 500 1 1 (935500)
I0513 11:15:47.292474 30505 net.cpp:152] ip1 needs backward computation.
I0513 11:15:47.292493 30505 net.cpp:75] Creating Layer relu1
I0513 11:15:47.292506 30505 net.cpp:85] relu1 <- ip1
I0513 11:15:47.292522 30505 net.cpp:99] relu1 -> ip1 (in-place)
I0513 11:15:47.292536 30505 net.cpp:126] Top shape: 1871 500 1 1 (935500)
I0513 11:15:47.292548 30505 net.cpp:152] relu1 needs backward computation.
I0513 11:15:47.292564 30505 net.cpp:75] Creating Layer ip2
I0513 11:15:47.292577 30505 net.cpp:85] ip2 <- ip1
I0513 11:15:47.292588 30505 net.cpp:111] ip2 -> ip2
I0513 11:15:47.292644 30505 net.cpp:126] Top shape: 1871 4 1 1 (7484)
I0513 11:15:47.292659 30505 net.cpp:152] ip2 needs backward computation.
I0513 11:15:47.292680 30505 net.cpp:75] Creating Layer prob
I0513 11:15:47.292692 30505 net.cpp:85] prob <- ip2
I0513 11:15:47.292706 30505 net.cpp:111] prob -> prob
I0513 11:15:47.292722 30505 net.cpp:126] Top shape: 1871 4 1 1 (7484)
I0513 11:15:47.292736 30505 net.cpp:152] prob needs backward computation.
I0513 11:15:47.292748 30505 net.cpp:75] Creating Layer accuracy
I0513 11:15:47.292764 30505 net.cpp:85] accuracy <- prob
I0513 11:15:47.292776 30505 net.cpp:85] accuracy <- label
I0513 11:15:47.292790 30505 net.cpp:111] accuracy -> accuracy
I0513 11:15:47.292806 30505 net.cpp:126] Top shape: 1 2 1 1 (2)
I0513 11:15:47.292819 30505 net.cpp:152] accuracy needs backward
computation.
I0513 11:15:47.292831 30505 net.cpp:163] This network produces output
accuracy
I0513 11:15:47.292848 30505 net.cpp:180] Collecting Learning Rate and
Weight Decay.
I0513 11:15:47.292865 30505 net.cpp:173] Network initialization done.
I0513 11:15:47.292877 30505 net.cpp:174] Memory required for Data 443494364
I0513 11:15:47.292930 30505 solver.cpp:49] Solver scaffolding done.
I0513 11:15:47.292948 30505 solver.cpp:60] Solving Hand
I0513 11:15:47.292968 30505 solver.cpp:105] Iteration 0, Testing net
I0513 11:15:47.720150 30505 solver.cpp:141] Test score #0: 0.207376
I0513 11:15:47.720239 30505 solver.cpp:141] Test score #1 #1:
1.38612
I0513 11:15:56.640269 30505 solver.cpp:236] Iteration 100, lr = 0.00992565
I0513 11:15:56.640511 30505 solver.cpp:86] Iteration 100, loss = 2.70168
I0513 11:16:05.554852 30505 solver.cpp:236] Iteration 200, lr = 0.00985258
I0513 11:16:05.555102 30505 solver.cpp:86] Iteration 200, loss = nan
I0513 11:16:14.469753 30505 solver.cpp:236] Iteration 300, lr = 0.00978075
I0513 11:16:14.470007 30505 solver.cpp:86] Iteration 300, loss = nan
I0513 11:16:23.383903 30505 solver.cpp:236] Iteration 400, lr = 0.00971013
I0513 11:16:23.384151 30505 solver.cpp:86] Iteration 400, loss = nan
I0513 11:16:23.384174 30505 solver.cpp:105] Iteration 400, Testing net
I0513 11:16:23.702975 30505 solver.cpp:141] Test score #0: 0
I0513 11:16:23.703033 30505 solver.cpp:141] Test score #1 #1:
nan

Also,I want to check the input leveldb, but I have no idea to do that.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/409
.

Sergio

smiley19 · 2014-05-13T04:59:44Z

I try learning rate from 0.0001~0.001 , and there may be two possible
weight is nan or weight still random(loss function didn't decrease)

Is there other way to solve it?
Thank a lot!

sguada · 2014-05-13T05:03:42Z

Try different initializations, for instance bias set to 0.1

On Monday, May 12, 2014, smiley19 notifications@github.com wrote:

I try learning rate from 0.0001~0.001 , and there may be two possible
weight is nan or weight still random(loss function didn't decrease)

Is there other way to solve it?
Thank a lot!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/409#issuecomment-42917231
.

Sergio

Yangqing · 2014-05-13T16:19:31Z

For a sanity check, try running with a learning rate 0 to see if any nan
errors pop up (they shouldn't, since no learning takes place). If data is
not initialized well, it might be possible that even 0.0001 is a too high
learning rate.

Yangqing

On Mon, May 12, 2014 at 10:03 PM, Sergio Guadarrama <
notifications@github.com> wrote:

Try different initializations, for instance bias set to 0.1

On Monday, May 12, 2014, smiley19 notifications@github.com wrote:

I try learning rate from 0.0001~0.001 , and there may be two possible
weight is nan or weight still random(loss function didn't decrease)

Is there other way to solve it?
Thank a lot!

Reply to this email directly or view it on GitHub<
https://github.com/BVLC/caffe/issues/409#issuecomment-42917231>
.

Sergio

Reply to this email directly or view it on GitHubhttps://github.com//issues/409#issuecomment-42917397
.

smiley19 · 2014-05-13T16:43:20Z

I try to set learning rate 0 , and training is like,
I0514 00:29:17.438696 23719 solver.cpp:105] Iteration 0, Testing net
I0514 00:29:17.788913 23719 solver.cpp:141] Test score #0: 0.10155
I0514 00:29:17.788996 23719 solver.cpp:141] Test score #1: 1.41697
I0514 00:29:34.406818 23719 solver.cpp:236] Iteration 100, lr = 0
I0514 00:29:34.406980 23719 solver.cpp:86] Iteration 100, loss = 1.70696
I0514 00:29:51.012387 23719 solver.cpp:236] Iteration 200, lr = 0
I0514 00:29:51.012557 23719 solver.cpp:86] Iteration 200, loss = 1.45677
I0514 00:30:07.619954 23719 solver.cpp:236] Iteration 300, lr = 0
I0514 00:30:07.620121 23719 solver.cpp:86] Iteration 300, loss = 1.70696
I0514 00:30:24.231683 23719 solver.cpp:236] Iteration 400, lr = 0
I0514 00:30:24.231848 23719 solver.cpp:86] Iteration 400, loss = 1.45677
I0514 00:30:24.231871 23719 solver.cpp:105] Iteration 400, Testing net
I0514 00:30:24.453197 23719 solver.cpp:141] Test score #0: 0.10155
I0514 00:30:24.453258 23719 solver.cpp:141] Test score #1: 1.41697
I0514 00:30:41.060545 23719 solver.cpp:236] Iteration 500, lr = 0
I0514 00:30:41.060708 23719 solver.cpp:86] Iteration 500, loss = 1.70696
I0514 00:30:57.671411 23719 solver.cpp:236] Iteration 600, lr = 0
I0514 00:30:57.671574 23719 solver.cpp:86] Iteration 600, loss = 1.45677
I0514 00:31:14.280128 23719 solver.cpp:236] Iteration 700, lr = 0
I0514 00:31:14.280324 23719 solver.cpp:86] Iteration 700, loss = 1.70696
I0514 00:31:30.889340 23719 solver.cpp:236] Iteration 800, lr = 0
I0514 00:31:30.889503 23719 solver.cpp:86] Iteration 800, loss = 1.45677
I0514 00:31:30.889528 23719 solver.cpp:105] Iteration 800, Testing net
I0514 00:31:31.110818 23719 solver.cpp:141] Test score #0: 0.10155
I0514 00:31:31.110880 23719 solver.cpp:141] Test score #1: 1.41697

It seems that it doesn't change. Is that means my data initial is okay?
( batch size = 320 , all training data = 12800(3200 for each)

I also try to set learning rate 0.00001 and weight bias = 0.1 ( it may turn out to nan )
I try learning rate 0.00001 and weight bias = 0.05 , it wont get nan case.
But even i learn for 120000 iteration , the weight is like below:

Do I need more iteration to training my dataset?

shelhamer · 2014-07-18T11:15:33Z

Sorry, we cannot train and tune your model for you. Consult references on deep learning and tutorials such as Marc'aurelio Ranzato's CVPR '12 tutorial slides on tips and tricks.

AlienWareLeaguen · 2017-05-06T03:24:07Z

I met the same problem as you , could you tell me how you to save it,thanks

shelhamer changed the title ~~Training loss become nan ( kernel become nan )~~ On custom data training diverges (loss = NaN) Jul 18, 2014

shelhamer closed this as completed Jul 18, 2014

shelhamer added the downstream problem? label Jul 18, 2014

artiit mentioned this issue Nov 30, 2015

Training error loss = NAN in multitask net #3398

Closed

Phalange96 mentioned this issue Oct 16, 2017

From the iteration 0,loss =NAN #5986

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On custom data training diverges (loss = NaN) #409

On custom data training diverges (loss = NaN) #409

smiley19 commented May 13, 2014

sguada commented May 13, 2014

smiley19 commented May 13, 2014

sguada commented May 13, 2014

Yangqing commented May 13, 2014

smiley19 commented May 13, 2014

shelhamer commented Jul 18, 2014

AlienWareLeaguen commented May 6, 2017

On custom data training diverges (loss = NaN) #409

On custom data training diverges (loss = NaN) #409

Comments

smiley19 commented May 13, 2014

sguada commented May 13, 2014

smiley19 commented May 13, 2014

sguada commented May 13, 2014

Yangqing commented May 13, 2014

smiley19 commented May 13, 2014

shelhamer commented Jul 18, 2014

AlienWareLeaguen commented May 6, 2017