Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchNorm numerical instability #3963

Open
classner opened this issue Apr 8, 2016 · 6 comments
Open

BatchNorm numerical instability #3963

classner opened this issue Apr 8, 2016 · 6 comments
Labels

Comments

@classner
Copy link

classner commented Apr 8, 2016

I just tracked down an issue with unexpected BatchNorm behavior in the current caffe version. The effect can be condensed to the following scenario: for an input batch with a channel with constant values, it produces (depending on the magnitude of the values) quite different results from zero. The expected normalization would be zeros in this channel for all values (mean normalized to zeros, no variance).

To reproduce the effect, I created a test with a 1x1x3x3 input blob where all entries are 100 (also works for smaller values). In this case, the computed mean is 100.000015 and the result of the normalization is ~ -0.00483, which is far away from 0. The normalization factor increases the numerical deviation. This just highlights how numerically badly conditioned the operation is and how important it is to get the mean right (the variance is then computed as E((X-E(X))^2) ). One approach to improve could be the stable online mean and variance computation by Knuth applied in regions with a reduction.

Interestingly, chainer and mxnet do not seem to suffer from this effect. Chainer uses a reduction kernel for the mean computation here and mxnet like this. Probably the new cudnn versions handle this similarly.

This raises two issues:

  1. This should be improved in general and,
  2. it might lead to inconsistencies in whether the cudnn implementation or the caffe implementation is used.

For reference:

@shelhamer shelhamer added the focus label Apr 8, 2016
@borisgin
Copy link

I rewrote BN layer ( see https://github.com/drnikolaev/caffe/tree/caffe-0.15
or https://github.com/borisgin/caffe/tree/caffe-0.15).
Do you still observe this instability in the new BN implementation?

@wlike
Copy link

wlike commented Apr 21, 2016

@borisgin the implementation at https://github.com/borisgin/caffe/tree/caffe-0.15 still has the problem described by @classner

@wlike
Copy link

wlike commented Apr 21, 2016

@borisgin sorry, the above comment is wrong. The instability problem is NOT exist in your implementation, but there are some tiny mistakes

@wlike
Copy link

wlike commented Apr 21, 2016

@classner I did a unit test of BatchNormLayer according to your statement. According to the debug info, the computed mean is right. The deviation happens in the second caffe_cpu_gemm following "subtract mean" comment. And The deviation happens only in single precision, weired
ut
ut_code
ut_log

@classner
Copy link
Author

Thank you, @wlike, for looking into this. I wanted to post the code, but was quite busy last week and could not catch up. I had set up a near identical test case.

For me, the mean already was off. It also depends on where the normalization factor is multiplied in in the mean calculation (first gemv or second). The result remains approximately the same, though...

@borisgin
Copy link

borisgin commented Apr 26, 2016

I pushed a new version of BN https://github.com/borisgin/caffe/tree/caffe-0.15-bn
The new code:

  • fixed the "... if (bottom[0]->num_axes() > 1) ... " in Reshape(.)
  • changed initialization of global mean/ and variance
  • add "variance clipping"
    Most significant piece is "variance clipping" . I added it to improve the stability of original BN algorithm, where very high or very low variance can result in 2 types of problem:
    1. large gradients , and as a result training accuracy --> 0
    2. corruption of global mean / variance.
      "gradient clipping" can address the first problem, but not the second. In this case of corruption of global mean& variance the training accuracy is OK, but test accuracy is severely damaged.
      To prevent both issues I clip variance both from below and above. For example, in the code below I use global mean and variance for clipping :
/ clip variance 
    if (iter_ > BN_VARIANCE_CLIP_START) { 
      // temp_C_[c] = average_var + gobal_var[c]
      Dtype y = caffe_cpu_asum(C, this->blobs_[3]->cpu_data());
      caffe_cpu_scale(C, Dtype(y/C), ones_C_.cpu_data(), temp_C_.mutable_cpu_data());
      caffe_cpu_axpby(C, Dtype(1.0), this->blobs_[3]->cpu_data(), 
           Dtype(1.0), temp_C_.mutable_cpu_data());
      caffe_cpu_eltwise_min(C, Dtype(BN_VARIANCE_CLIP_CONST), temp_C_.cpu_data(),
            Dtype(1.0), variance_.mutable_cpu_data());
      // clip from below
      caffe_cpu_eltwise_max(C,   Dtype((1.)/BN_VARIANCE_CLIP_CONST), 
             this->blobs_[3]->cpu_data(), Dtype(1.0), variance_.mutable_cpu_data());
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants