Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better normalization options for SoftmaxWithLoss layer #3296

Merged
merged 1 commit into from
Nov 23, 2015

Conversation

cdoersch
Copy link
Contributor

@cdoersch cdoersch commented Nov 6, 2015

The current SoftmaxWithLoss layer has two options for normalizing the output: either divide by the number of 'valid' samples (those without the ignore label), or by the batch size. Notably missing is any way to turn normalization off completely. This is needed in the case where the batches are not all the same size, and hence batches with more examples need to get more weight. One might expect that normalize = false would do this, but confusingly it still divides by the batch size.

This PR replaces the existing 'normalize' boolean parameter in caffe.proto with an enum that has four different options, with more informative names. Two of the options (VALID and BATCH_SIZE) mimic existing behavior, and the current boolean parameter is still supported for backwards compatibility (but is deprecated). The NONE option allows you to turn normalization off completely, and the FULL option allows you to normalize by the full shape of the output map, i.e., like VALID but locations with the 'ignore' label are still included in the count for normalization.

Note that there's still a bit of a mess here, since it seems that SoftmaxWithLoss is the only layer that actually reads LossParameter's normalization options. This remains unchanged for this PR.

@jeffdonahue
Copy link
Contributor

Thanks @cdoersch -- this makes the normalization options clearer and looks correct to me. Ideally there would be a test for each option, but maybe that's not needed with the current set of tests passing, and the choice of normalization constant factored into a separate method that's called in both Forward and Backward. Maybe @longjon would like to take a look before merge since he added the normalize option in #1654 and this deprecates it?

@seanbell
Copy link

It makes sense to address the divide-by-zero problem in this PR as well, since that's an issue with normalization. If there are 0 valid items (which can happen in many applications for some batches), then I think the loss should be 0 and the gradient also 0.

Simple ways to address this: (a) make the minimum allowed denominator 1 (since it's always an integer) or (b) special-case the denominator code to not divide if 0.

} else {
top[0]->mutable_cpu_data()[0] = loss / outer_num_;
}
top[0]->mutable_cpu_data()[0] = loss / get_normalizer(normalization_, count);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential divide-by-zero here (and other similar locations).

Yes, the old code had the same problem, but it might as well get fixed here, since a separate PR fixing the divide-by-zero would conflict with this one.

@cdoersch
Copy link
Contributor Author

@seanbell this is a good point. I initially didn't worry about this because it indicates that you have a batch with zero examples; it seems to me like you would want to error out if this happeens. However, on second thought, I realize that in some cases, datasets may accidentally contain examples that are totally unlabeled, which means that once in a great while a randomly-selected batch will contain no labels at all, leading to heisenbug behavior.

Do you think we should log a warning if we correct the denominator away from zero?

@seanbell
Copy link

I don't think a warning is always necessary. With multi-task setups, you can have auxiliary classification tasks that only have valid labels once in a while, so the log would get flooded with warnings in those cases. I'd be happy with a warning that was turned on by default, but easily disable-able. More debug info is better than less.

@cdoersch
Copy link
Contributor Author

But would you really want normalization turned on at all in that use case? i.e., if the labeled examples are so rare that many batches don't even contain one, do people really want an example to get half the weight if it just happens to occur in the same batch with another example where the label is defined?

I guess I don't really have a strong opinion about this, so if you have a use case, then I'm fine with doing it your way.

@jeffdonahue
Copy link
Contributor

My personal opinion here is that scaling down the loss by the number of non-ignored labels is the wrong thing to do to begin with, and this normalization should not be the default, but if it is used it shouldn't have a special case for 0 non-ignored labels, as the NaN/inf problem that occurs with 0 non-ignored labels is just the most extreme and illustrative case of the general problem with this normalization strategy. For example, say you're using a batch size of 128 -- then, relative to a batch with 128 valid labels, in a batch with 64 valid labels, the gradient of the each valid instance's error gradient is scaled up by a factor of 2; for a batch with 16 valid labels, each valid instance's error gradient is scaled up by a factor of 8; for a batch with 1 valid label, the instance's error gradient is scaled up by a factor of 128. Naturally, with no valid labels, the error gradient is scaled up by infinity. Which is of course bad, but only exactly as bad as it should be. With 1 valid label things already pretty bad -- scaling that one instance's gradient up by a factor of 128 is likely to lead to quite unstable and high-variance learning if you chose your learning rate expecting your update to reflect the gradient averaged over 128 independent instances.

That said, I think we should change the default to BATCH_SIZE normalization, but that would change existing behavior so it might deserve a separate PR with more discussion later.

@seanbell
Copy link

with no valid labels, the error gradient is scaled up by infinity. Which is of course bad, but only exactly as bad as it should be.

In my use case, I have a very large model, small batchsize (4), and use gradient accumulation to fit it on the GPU. With this setup, you only need a single iteration to have no valid labels to totally kill training.

The moment you have NaN, everything is dead. So you might as well crash with a debug message in the normalizer code. I am saying that you may want to instead ignore the iteration and avoid the crash. I can see that it might not be the best for a default, I think it's an option worth considering.

@jeffdonahue
Copy link
Contributor

I'm not sure about a check as it's possible you might not actually be backpropagating the error and just want to compute the softmax loss for debugging/display purposes (e.g. you're using loss_weight: 0), which is actually the one case where I'd agree with using the VALID normalization :). I'd prefer such cases be caught by something more generic like #1349.

@seanbell
Copy link

That's a valid use case too (though the loss_weight: 0 case may need fixing, #2895).

I was just saying that I have a use case for avoiding divide-by-zero (noisy partially labeled datasets with small batchsizes, and certain multitask setups), and I thought others might as well. It could be a configurable option, with the default being to divide-by-zero and output NaN. But if nobody else runs into this problem, it could be left out for now.

Edit: Another use case. I have a multitask setup where some batches have semantic segmentation labels, and others do not. For those that do have labels, I use "valid" normalization. For those that don't, the normalizing constant is 0, and I set loss = 0 instead of divide-by-zero.

@seanbell
Copy link

My personal opinion here is that scaling down the loss by the number of non-ignored labels is the wrong thing to do to begin with

Since weight decay is the same each iteration, I think it does make sense to scale by the number of valid labels. Otherwise, the ratio of regularizer-to-data changes from batch to batch.

Sorry if I'm side-tracking this PR. I'm arguing for something (avoiding divide-by-zero) which could be discussed in another PR.

@jeffdonahue
Copy link
Contributor

Since weight decay is the same each iteration, I think it does make sense to scale by the number of valid labels. Otherwise, the ratio of regularizer-to-data changes from batch to batch.

The ratio does change from minibatch to minibatch, but it's still an unbiased estimator of the objective if minibatches are sampled uniformly. But this statement made me go write down some math, and I realized I was wrong -- the VALID normalization (with your proposed special case for 0 valid labels) also gives an unbiased estimator of an objective, just one with a different weighting of the data term from the BATCH normalization (amounting to a different weighting of the regularization term). Note that this assumes minibatches are drawn randomly, however; if you're stepping through the dataset sequentially the VALID normalization is biased in that some examples -- ones that appear in batches with more ignored instances -- are actually always weighted higher than others, whereas even with sequential training I think the BATCH strategy remains an unbiased estimator of the objective at each iteration, assuming the dataset was shuffled to begin with.

Anyway though, I retract my earlier hardliner and probably mathematically unfounded position and would support the special case you suggested @seanbell (doesn't need to be in the PR, but also fine if it is, to me). Thanks for the discussion.

@cdoersch
Copy link
Contributor Author

Considering that we have one 'don't care', one 'weak support' and one 'support' from someone who actually has a use case, I think it makes sense to add this to the PR. I'll implement it as one-line change at the end of the get_normalizer function where I return max(1.0,normalizer).

@cdoersch cdoersch force-pushed the normalize_batch branch 2 times, most recently from 9963079 to d5a78e1 Compare November 11, 2015 16:43
@jeffdonahue
Copy link
Contributor

LGTM, thanks for the additional normalization options and clarifications of existing ones @cdoersch!

jeffdonahue added a commit that referenced this pull request Nov 23, 2015
Better normalization options for SoftmaxWithLoss layer
@jeffdonahue jeffdonahue merged commit 8e8d97d into BVLC:master Nov 23, 2015
@seanbell
Copy link

Thanks! This is very helpful.

shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016
sig-ce loss handles all the same normalizations as the softmax loss;
refer to BVLC#3296 for more detail
shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016
sig-ce loss handles all the same normalizations as the softmax loss;
refer to BVLC#3296 for more detail
shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016
sig-ce loss handles all the same normalizations as the softmax loss;
refer to BVLC#3296 for more detail

n.b. this changes the default normalization for this loss! previously
this loss normalized by batch size, but now it normalizes by the total
number of outputs/targets.
shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016
sig-ce loss handles all the same normalizations as the softmax loss;
refer to BVLC#3296 for more detail

note: the default normalization remains batch size, but valid
normalization might be a better idea
shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016
sig-ce loss handles all the same normalizations as the softmax loss;
refer to BVLC#3296 for more detail

n.b. this changes the default normalization for this loss! previously
this loss normalized by batch size, but now it normalizes by the total
number of outputs/targets.
shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016
sig-ce loss handles all the same normalizations as the softmax loss;
refer to BVLC#3296 for more detail.

this preserves the default normalization for sig-ce loss: batch size.
ZhangXinNan pushed a commit to ZhangXinNan/caffe that referenced this pull request Dec 23, 2016
sig-ce loss handles all the same normalizations as the softmax loss;
refer to BVLC#3296 for more detail.

this preserves the default normalization for sig-ce loss: batch size.
acmiyaguchi pushed a commit to acmiyaguchi/caffe that referenced this pull request Nov 13, 2017
sig-ce loss handles all the same normalizations as the softmax loss;
refer to BVLC#3296 for more detail.

this preserves the default normalization for sig-ce loss: batch size.
volgy pushed a commit to Fazecast/caffe that referenced this pull request Jan 17, 2018
sig-ce loss handles all the same normalizations as the softmax loss;
refer to BVLC#3296 for more detail.

this preserves the default normalization for sig-ce loss: batch size.
oscarriddle pushed a commit to oscarriddle/caffe that referenced this pull request Mar 18, 2018
sig-ce loss handles all the same normalizations as the softmax loss;
refer to BVLC#3296 for more detail.

this preserves the default normalization for sig-ce loss: batch size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants