Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training error loss = NAN in multitask net #3398

Closed
artiit opened this issue Nov 30, 2015 · 3 comments
Closed

Training error loss = NAN in multitask net #3398

artiit opened this issue Nov 30, 2015 · 3 comments

Comments

@artiit
Copy link

artiit commented Nov 30, 2015

Hi everyone,

I use caffe to train a multitask net which do both classification and regression.
My loss layers are:
layer {
name: "loss1"
type: "SoftmaxWithLoss"
bottom: "fc8_1"
bottom: "label1" //size100 for classfication
top: "loss1"
loss_weight: 2.0
}
layer {
name: "loss2"
type: "EuclideanLoss"
bottom: "fc8_2"
bottom: "label2" //size4 for regression
top: "loss2"
}
I finetune it with Caffenet.caffemodel, and the solver.prototxt is:
net: "/_/train_vol.prototxt"
test_iter: 100
test_interval: 1000
base_lr: 0.001
lr_policy: "step"
gamma: 0.1
stepsize: 20000
display: 20
max_iter: 100000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "_
******"
solver_mode: GPU

The error is:
Iteration 20, loss = nan
Train net output #0: loss1 = 2.58978 (* 4 = 10.3591 loss)
Train net output #1: loss2 = nan (* 1 = nan loss)
Only iteration 0 is without nan:
Iteration 0, loss = 14.9363
Train net output #0: loss1 = 3.10484 (* 4 = 12.4193 loss)
Train net output #1: loss2 = 2.51693 (* 1 = 2.51693 loss)
The other iterations are always with nan loss.
I follow #409 it is not helpful for me.
When I set base_lr to 0, nan is gone, but even base_lr is small as 0.0001 there is nan again.

Any advice will be appreciated!
Thanks!
artiit.

@artiit artiit changed the title multitask CNN training error loss = nan Training error loss = NAN in multitask net Nov 30, 2015
@artiit artiit closed this as completed Dec 2, 2015
@xyxxyx
Copy link

xyxxyx commented Jul 23, 2016

@artiit Hi, I meet the same problem as you, multitask CNN training error loss = nan. How did you solve the problem? can you give me some advice? Thanks a lot.

@rkakamilan
Copy link

I met the same problem, but I solved it by reducing the learning rate from 1e-003 to 1e-007 or below. At the time more iterations were required (>200000) for the convergence of the model.

@artiit artiit reopened this Aug 2, 2016
@artiit
Copy link
Author

artiit commented Aug 2, 2016

Hi,@xyxxyx. Just like @rkakamilan did, the best way is to reduce your learning rate and have a little patience. If it still doesn't work or the learning result is too bad, you should considerate whether your model is correct, and start with a simple net.

@artiit artiit closed this as completed Aug 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants