Training error loss = NAN in multitask net #3398

artiit · 2015-11-30T15:14:14Z

Hi everyone,

I use caffe to train a multitask net which do both classification and regression.
My loss layers are:
layer {
name: "loss1"
type: "SoftmaxWithLoss"
bottom: "fc8_1"
bottom: "label1" //size100 for classfication
top: "loss1"
loss_weight: 2.0
}
layer {
name: "loss2"
type: "EuclideanLoss"
bottom: "fc8_2"
bottom: "label2" //size4 for regression
top: "loss2"
}
I finetune it with Caffenet.caffemodel, and the solver.prototxt is:
net: "/_/train_vol.prototxt"
test_iter: 100
test_interval: 1000
base_lr: 0.001
lr_policy: "step"
gamma: 0.1
stepsize: 20000
display: 20
max_iter: 100000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "_******"
solver_mode: GPU

The error is:
Iteration 20, loss = nan
Train net output #0: loss1 = 2.58978 (* 4 = 10.3591 loss)
Train net output #1: loss2 = nan (* 1 = nan loss)
Only iteration 0 is without nan:
Iteration 0, loss = 14.9363
Train net output #0: loss1 = 3.10484 (* 4 = 12.4193 loss)
Train net output #1: loss2 = 2.51693 (* 1 = 2.51693 loss)
The other iterations are always with nan loss.
I follow #409 it is not helpful for me.
When I set base_lr to 0, nan is gone, but even base_lr is small as 0.0001 there is nan again.

Any advice will be appreciated!
Thanks!
artiit.

xyxxyx · 2016-07-23T03:23:02Z

@artiit Hi, I meet the same problem as you, multitask CNN training error loss = nan. How did you solve the problem? can you give me some advice? Thanks a lot.

rkakamilan · 2016-07-23T04:43:35Z

I met the same problem, but I solved it by reducing the learning rate from 1e-003 to 1e-007 or below. At the time more iterations were required (>200000) for the convergence of the model.

artiit · 2016-08-02T07:11:38Z

Hi,@xyxxyx. Just like @rkakamilan did, the best way is to reduce your learning rate and have a little patience. If it still doesn't work or the learning result is too bad, you should considerate whether your model is correct, and start with a simple net.

artiit changed the title ~~multitask CNN training error loss = nan~~ Training error loss = NAN in multitask net Nov 30, 2015

artiit closed this as completed Dec 2, 2015

artiit reopened this Aug 2, 2016

artiit closed this as completed Aug 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training error loss = NAN in multitask net #3398

Training error loss = NAN in multitask net #3398

artiit commented Nov 30, 2015

xyxxyx commented Jul 23, 2016

rkakamilan commented Jul 23, 2016

artiit commented Aug 2, 2016

Training error loss = NAN in multitask net #3398

Training error loss = NAN in multitask net #3398

Comments

artiit commented Nov 30, 2015

xyxxyx commented Jul 23, 2016

rkakamilan commented Jul 23, 2016

artiit commented Aug 2, 2016