BatchNorm numerical instability #3963

classner · 2016-04-08T10:28:21Z

I just tracked down an issue with unexpected BatchNorm behavior in the current caffe version. The effect can be condensed to the following scenario: for an input batch with a channel with constant values, it produces (depending on the magnitude of the values) quite different results from zero. The expected normalization would be zeros in this channel for all values (mean normalized to zeros, no variance).

To reproduce the effect, I created a test with a 1x1x3x3 input blob where all entries are 100 (also works for smaller values). In this case, the computed mean is 100.000015 and the result of the normalization is ~ -0.00483, which is far away from 0. The normalization factor increases the numerical deviation. This just highlights how numerically badly conditioned the operation is and how important it is to get the mean right (the variance is then computed as E((X-E(X))^2) ). One approach to improve could be the stable online mean and variance computation by Knuth applied in regions with a reduction.

Interestingly, chainer and mxnet do not seem to suffer from this effect. Chainer uses a reduction kernel for the mean computation here and mxnet like this. Probably the new cudnn versions handle this similarly.

This raises two issues:

This should be improved in general and,
it might lead to inconsistencies in whether the cudnn implementation or the caffe implementation is used.

For reference:

borisgin · 2016-04-17T23:25:48Z

I rewrote BN layer ( see https://github.com/drnikolaev/caffe/tree/caffe-0.15
or https://github.com/borisgin/caffe/tree/caffe-0.15).
Do you still observe this instability in the new BN implementation?

wlike · 2016-04-21T03:29:32Z

@borisgin the implementation at https://github.com/borisgin/caffe/tree/caffe-0.15 still has the problem described by @classner

wlike · 2016-04-21T07:52:25Z

@borisgin sorry, the above comment is wrong. The instability problem is NOT exist in your implementation, but there are some tiny mistakes

wlike · 2016-04-21T07:59:46Z

@classner I did a unit test of BatchNormLayer according to your statement. According to the debug info, the computed mean is right. The deviation happens in the second caffe_cpu_gemm following "subtract mean" comment. And The deviation happens only in single precision, weired

classner · 2016-04-25T17:39:32Z

Thank you, @wlike, for looking into this. I wanted to post the code, but was quite busy last week and could not catch up. I had set up a near identical test case.

For me, the mean already was off. It also depends on where the normalization factor is multiplied in in the mean calculation (first gemv or second). The result remains approximately the same, though...

borisgin · 2016-04-26T18:18:52Z

I pushed a new version of BN https://github.com/borisgin/caffe/tree/caffe-0.15-bn
The new code:

fixed the "... if (bottom[0]->num_axes() > 1) ... " in Reshape(.)
changed initialization of global mean/ and variance
add "variance clipping"
Most significant piece is "variance clipping" . I added it to improve the stability of original BN algorithm, where very high or very low variance can result in 2 types of problem:
1. large gradients , and as a result training accuracy --> 0
2. corruption of global mean / variance.
  "gradient clipping" can address the first problem, but not the second. In this case of corruption of global mean& variance the training accuracy is OK, but test accuracy is severely damaged.
  To prevent both issues I clip variance both from below and above. For example, in the code below I use global mean and variance for clipping :

/ clip variance 
    if (iter_ > BN_VARIANCE_CLIP_START) { 
      // temp_C_[c] = average_var + gobal_var[c]
      Dtype y = caffe_cpu_asum(C, this->blobs_[3]->cpu_data());
      caffe_cpu_scale(C, Dtype(y/C), ones_C_.cpu_data(), temp_C_.mutable_cpu_data());
      caffe_cpu_axpby(C, Dtype(1.0), this->blobs_[3]->cpu_data(), 
           Dtype(1.0), temp_C_.mutable_cpu_data());
      caffe_cpu_eltwise_min(C, Dtype(BN_VARIANCE_CLIP_CONST), temp_C_.cpu_data(),
            Dtype(1.0), variance_.mutable_cpu_data());
      // clip from below
      caffe_cpu_eltwise_max(C,   Dtype((1.)/BN_VARIANCE_CLIP_CONST), 
             this->blobs_[3]->cpu_data(), Dtype(1.0), variance_.mutable_cpu_data());
    }

shelhamer added the focus label Apr 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchNorm numerical instability #3963

BatchNorm numerical instability #3963

classner commented Apr 8, 2016

borisgin commented Apr 17, 2016

wlike commented Apr 21, 2016

wlike commented Apr 21, 2016

wlike commented Apr 21, 2016 •

edited

Loading

classner commented Apr 25, 2016

borisgin commented Apr 26, 2016 •

edited

Loading

BatchNorm numerical instability #3963

BatchNorm numerical instability #3963

Comments

classner commented Apr 8, 2016

borisgin commented Apr 17, 2016

wlike commented Apr 21, 2016

wlike commented Apr 21, 2016

wlike commented Apr 21, 2016 • edited Loading

classner commented Apr 25, 2016

borisgin commented Apr 26, 2016 • edited Loading

wlike commented Apr 21, 2016 •

edited

Loading

borisgin commented Apr 26, 2016 •

edited

Loading