Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect gradient from a SoftmaxWithLossLayer with loss_weight 0 #2895

Closed
Nanne opened this issue Aug 10, 2015 · 6 comments · Fixed by #6202
Closed

Incorrect gradient from a SoftmaxWithLossLayer with loss_weight 0 #2895

Nanne opened this issue Aug 10, 2015 · 6 comments · Fixed by #6202
Labels

Comments

@Nanne
Copy link

Nanne commented Aug 10, 2015

I was debugging a network with two loss layers and wanted to disable one of them (a SoftmaxWithLossLayer), as such I set the loss_weight to 0. However, this does not do what I expected at all. The clearest way to explain this is probably using an example on how to reproduce it.

To reproduce one can take the examples/mnist/lenet_train_test.prototxt and add a second loss layer, with weight 0:

layer {
 name: "bad_loss"
 type: "SoftmaxWithLoss"
 bottom: "ip2"
 bottom: "label"
 top: "bad_loss"
 loss_weight: 0
}

and then run this python script:

caffe_root = '/roaming/nanne/caffe/' # Update this path to the correct path 
import sys
sys.path.insert(0, caffe_root + 'python')
import os
os.chdir(caffe_root)
import caffe
import numpy as np

caffe.set_mode_gpu()
solver = caffe.SGDSolver(caffe_root + 'examples/mnist/lenet_solver.prototxt')

solver.step(1)

print solver.net.blobs['ip2_ip2_0_split_0'].diff.squeeze()[5:7, :]
print solver.net.blobs['ip2_ip2_0_split_1'].diff.squeeze()[5:7, :]

print solver.net.blobs['ip2'].diff.squeeze()[5:7, :]

The diff for the split belonging to the SoftmaxWithLoss with loss_weight 0 will contain 64 (batchsize) values equal to the loss (NOT the gradient) for that input, and all the other elements will be 0. The other split will correctly contain all the diff values (64*10) for the loss with weight 1.

However, these two splits still get combined, creating the diff for 'ip2' for which the first 64 values are not comparable to the last 576. Am I wrong in how I tried to use the loss_weight or is this a bug? (It doesn't seem to be specific to SoftmaxWithLoss, though its most clear for this layer).

@shelhamer shelhamer added the JD label Aug 10, 2015
@longjon
Copy link
Contributor

longjon commented Aug 14, 2015

This looks like a bug to me... try setting force_backward: true in your prototxt or setting the loss weight to a small nonzero value and see if the behavior changes. Here's what I think is happening: SoftmaxWithLossLayer uses its diff memory for temporary storage in forward, since backward will simply overwrite it with correct values (in this case, zeros). However, Net prunes the backward computation of this branch since the loss weight is set to zero, so the correct diff values never get set.

The easy solution (as the cost of some memory) is to outlaw writing to diff in forward.

@longjon longjon added the bug label Aug 14, 2015
@longjon longjon changed the title Unexpected behaviour from a SoftmaxWithLossLayer with loss_weight 0 Incorrect gradient from a SoftmaxWithLossLayer with loss_weight 0 Aug 14, 2015
@Nanne
Copy link
Author

Nanne commented Aug 18, 2015

Any non-zero loss weight seems to work fine. Additionally, HingeLoss also uses its diff in the forward pass. I'd be happy with that solution, as it seems several other layers already use a diff_ blob in their forward pass to store calculations for the backward pass.

@seanbell
Copy link

Another possible solution: if a backward step is skipped, and diff is allocated (this is important to check), then diff is set to 0.

@cvondrick
Copy link

Just got hit by this bug unfortunately. I think an intermediate situation would be to add a CHECK failure when the loss_weight is 0 for this layer. Otherwise, people will get incorrect results, and be optimizing a different objective than they write in a paper (!).

@BlGene
Copy link
Contributor

BlGene commented Nov 26, 2015

This bug also occurs when using python loss layers. In my case the presence of the second loss layer, with loss_weight: 0, causes the first few diff values of the first loss layer, with loss_weight: 1, to be overwritten, resulting in failed training.

@Cysu
Copy link

Cysu commented Mar 22, 2016

Trapped by this bug, too. I've made a PR #3868 based on Sean's solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants