-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BatchNorm does not converge in digits #629
Comments
Hello, do you have more details to share about this? |
Hi, Most of the details was logged here in the caffe google groups forum: The first page was copied below My model based on 3 convolutions has no problems converging when using dropout and LRN. After i substituted both layers out with a batch norm layer, it no longer converges. I believe im using it correctly but doesn't work no matter how small the learning rate i set is. Can anyone here shed some light on why? Original Model:
.... Repeat x3 New Model:
.....repeat x3 |
Try adding the following to the batch_norm_param fields:
|
Are you using the latest version of CuDNNv4.0.5 and nv-caffe 0.14? In particular, there were changes in CuDNN release which fixed some issues around batch normalization that were present in the release candidates (v4.0.4 and earlier). From the release notes:
|
Yes, it was nv-caffe 0.14 that was retrieved around Feb 2016. I only installed the latest caffe after caffe-nv was failing and did not upgrade CuDNN so I am not sure which version of CuDNN I currently have (will report later). This leads me to believe that the caffe-nv version was the cause. I'll try slayton's advice and will report back. |
Actually it looks like we're at 4.0.7 now, sorry:
|
I've just come across a slightly different batch normalization setup that is working for me in DIGITS. There is a prototxt file posted in the DIGITS user group: Digits3 GoogLeNet Batch Normalization? After changing some old names ("BN" to "BatchNorm" and "shift_filler" to "bias_filler"), this is how DIGTIS batch normalization layer followed by its activation looks like:
Now, a few observations (hopefully to get my head around it too). First, there is no mention of use_global_stats and no separate TRAIN and TEST definitions. Both train_val and deploy proto have the same layer definition. If I got it right, this is because Caffe automatically infers the correct state for the use_global_stats variable, as mentioned in BVLC/caffe#3347
Second, this approach is different. There are only two lr_multi parameters here and they are not all set to zero like suggested elsewhere. I'm guessing that these two correspond to gamma and beta parameters mentioned in the paper (Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift). I'm not sure what exactly do the three lr_multi mean, except that they relate to some global parameters which the solver isn't supposed to update in that approach. I got that from here: is Batch Normalization supported by Caffe?
Finally, there is a difference in the current BVLC and NVIDIA proto. The BVLC's doesn't have scale_filler and bias_filler in message BatchNormParameter while NVIDIA's does. https://github.com/NVIDIA/caffe/blob/caffe-0.14/src/caffe/proto/caffe.proto#L465-L483 |
@mfernezir you may be interested in BVLC/caffe#3919. It seems like you've got a good grasp of the situation, so please comment on that PR if you have any suggestions! |
@lukeyeager I've added a comment about batch normalization layer usage. I'm not sure if there are some remaining issues there, hope this helps! |
@mfernezir I'm so confused by a lot of batch norm implementations!
With this setup the train loss goes down while the accuracy keeps at zero. If I use classify one to see the batch norm effect, I get wrong(very high) mean and variance on data and weights..
I get an accuracy that keeps moving up and down, somewhat converging but not in a uniform way. |
@engharat I've had some (limited) success using the following notation:
The |
@dgschwend I'll try your implementation in the next minutes! Meanwhile, could you please confirm me that you get mean 0 var 1.0 out of the batchnorm layers when visualizing weights with digits classify one? |
@engharat
and
in the two BN layers, using three arbitrary test images. I'd say that's not too far from the ideal mean 0 / var 1. However the need for lowering the learning rate is very suspicious. |
@engharat Note that in the setup I found in DIGITS user group BN comes after each convolutional layer and just before activations (ReLU). This may or may not have some importance regarding your issues. I haven't tried using BN directly after Data layers like you are doing here. Also, there might be some differences between your Caffe fork and the current NVIDIA's one. I'm using 0.14.2. Here are my GoogleNet v3 training and deploy files generated with DIGITS. Some notes: this is modified for a 151 class problem, xavier initialization is changed to msra and most importantly, this is an older format generated with DIGITS 3.2. There is a change in DIGITS 3.3. which requires slightly different syntax to determine test and train layers (I haven't moved to that yet). This network had severe overfitting issues for the problem at hand, but no convergence issues. In the end I used different setups and smaller images, but also including one net with BN layers with good results (81% accuracy for my problem). However, I can't actually attribute the effect to BN layers since another version without them ended up practically the same (with differences in training procedure and intermediate results). I didn't specify the engine parameter manually like @dgschwend did, but hopefully Caffe inferred it correctly with engine parameter set to "DEFAULT". I'm using CUDNN4. |
@sodeypunk you can put BatchNorm before ReLU, and have a try |
@mfernezir Thanks for your provided network! It doesn't seems a v3 2016-04-14 23:02 GMT+02:00 mfernezir notifications@github.com:
|
Ah yes, it's just v1 with LRN repleaced with BN and added BN after convolutional layers. I haven't really used that net besides one quick run. I just wanted to try BN layers and then I used them on another net. In any case, I hope you've got your network running. |
I've also been meaning to comment again since I've tried a few more batch normalization networks in the last couple of days. I have also observed similar strange effects: low training loss and small validation accuracy on high learning rates and then sudden validation accuracy jumps on low learning rates. There is an interesting comment and code change in the BVLC version which should solve problems with validation accuracy calculations: |
You could try using another method to compute accuracy: on DIGITS 3.3, on the model page after training is finished, select the model corresponding to epoch 2. Then in "Test a list of images", upload the |
The issue is that batch normalization networks have to calculate global mean and variance statistics from batch statistics and then store these values inside the actual model to be used for inference. If these calculations are wrong in some extreme cases like mentioned in the above BVLC comment, then no matter how we try to calculate validation or test accuracy the results would be wrong. Still, I've just run some inference tests with DIGITS 3.3.0. I have a couple of different nets trained on the same dataset, some having BN layers and some without. The problem is that I've run into another bug. All of my BN networks use scaling parameter in the Data Layer, for example:
However, this parameter is not written in the deploy file. This causes much lower accuracy when classify many is used when compared to validation accuracy reported by graphs. This is also the case for another network that doesn't have BN layers but does have scaling. For that network, on random 50 validation images accuracy reported with classify many is zero percent while it should be around 25%. Models without scaling parameters (and batch normalization) have similar classify many and graph validation accuracies, but even this can't be said decisively since my validation set is around 100k images. Maybe engharat could confirm that graph accuracies and classify many accuracies coincide, but this is likely not relevant for the underlying Caffe issue. |
Indeed the Thanks a lot for your detailed analysis of the BN issue. cc @drnikolaev - can you suggest anything to make progress on this? |
Yep, I could use power layer instead and I'll do that for convenience in future networks. |
Seems the problem is cudnnBatchNormalizationForwardTraining produce wrong running mean or variance. I've tried replace cudnnBatchNormalizationForwardInference with standard BN code for blobs 3 and 4, result was exactly the same. Another small problem is that epsilon_ is not initialized, but fixing it is not helping. |
I was struggling with the convergence issue, but finally the following worked for me. Specifying the engine as CAFFE is important. CUDNN BatchNorm doesn't converge for me.
|
@mathmanu I just verified that "engine: CAFFE" is what makes it train vs not train. @lukeyeager - any idea why engine: CUDNN does not converge? Looks like there's almost a 10x slowdown when using engine: CAFFE (but at least it's working) |
Top and bottom blobs need to be different for engine:CUDNN BatchNorm. This constraint is not there for engine:CAFFE BatchNorm. This is the reason for non-convergence. See the following thread for a working prototxt with CUDNN BatchNorm. BVLC/caffe#3919 @borisgin @lukeyeager - Since others are also facing the same issue and wondering about the convergence issue , why not put a check/exit in CUDNN BatchNorm reshape function, if the top and bottom blobs are same - this will save a lot of headache. |
Thanks, that cleared things up a lot. Sad to see such divergence between BVLC/Caffe and NVIDIA/Caffe. I can't keep all the differences straight. Hope there will be more similarity in the future so that networks can be ported from one to the other without much effort. |
And the saddest thing is that you cannot test/use any BVLC pretrained batchnorm networks on Nvcaffe, because the two batchnorm formats are not compatible ;if when working on prototxt you can still translate between the two formats, in order to use pretrained inception/residual networks you have to install bvlc caffe. |
After reading the posted link by mathmanu, I see that bvlc caffe is converging to the CuDNN batchnorm with all the boon and the boost that it brings - still I don't fully understand if with that CuDNN implementation we can seamlessly switch bvlc caffe and nvcaffe caffemodel networks,or if there still is some different implementation that would rise some mismatch. |
Forget about compatibility to BVLC/caffe, there is no compatibility b/w engine:CAFFE and engine:CUDNN BatchNorm in NVIDIA/caffe itself. I hope someone from NVIDIA will clarify what is the plan forward to fix these inconsistencies. |
My suggestions to fix these issues are the following:
(4a). In BatchNormScale, If you change the oder of the blobs to: gloabl_mean, and global_variance, scale, bias, global_counter, then I don't have to specify 4 param fields for lr_mult and decay_mult - but only 2. (4b). If the definition of scale and bias fields in BatchNormParameter is changed to:
|
I was having problems with batchnorm layer not converging when using the nvidia version of caffe. This problem did not occur when using the caffe branch.
The text was updated successfully, but these errors were encountered: