Torch version affects the network's training performance #8

mli0603 · 2020-12-23T01:18:36Z

I am opening this issue because apparently depending on which version of pytorch you are using, the training result will be different. Here are the 3px error evaluation curves of on a minimal example of overfitting the network on a single image for 300 epochs:

The purple line is trained with Pytorch 1.7.0 and the orange line is trained with Pytorch 1.5.1. As you can see, with version 1.7.0 the error rate is flat 100%, while version 1.5.1 the error rate is dropping. Reason for this is that the BatchNorm function has changed between version 1.5.1 and Pytorch 1.7.0. In version 1.5.1, if I disable track_running_stats here, both evaluation and training will use batch stats. However in Pytorch 1.7.0, it is forced to use running_mean and running_var in evaluation mode, while in training the batch stats is used. With track_running_stats disabled, the running_mean is 0 and running_var is 1, which is clearly different from the batch stats.

Therefore, instead of trying to do something against torch's implementation, I will recommend to use Pytorch 1.5.1 if you want to retrain from scratch. Otherwise, if you want to use other Pytorch version, you can replace all BatchNorm with InstanceNorm and port the learnt values from BatchNorm (i.e. weight and bias). This is a wontfix problem because it is quite hard to accomodate all torch versions.

The text was updated successfully, but these errors were encountered:

VitorGuizilini-TRI · 2021-08-26T05:14:51Z

Hi, do you know if this is still an issue in Pytorch 1.8? Thank you!

mli0603 · 2021-08-26T13:08:56Z

Hi @VitorGuizilini-TRI

Based on my experiment, yes, Pytorch 1.8 is still an issue.

If you can only use Pytorch 1.8 due to hardware restriction (i.e. CUDA version etc.), you can replace all BatchNorm with InstanceNorm, which should avoid this.

EhrazImam · 2021-09-13T13:41:45Z

will it work with Pythorch 1.6

mli0603 · 2021-09-13T15:16:01Z

Hi @EhrazImam

It looks like the answer is no. Please find the implementation of BN from 1.5.1 (which is the one I was using) here and BN from 1.6.0 here. You will see the change of funciton signature I mentioned above.

ynjiun · 2021-12-27T00:15:42Z

Hi, Thank you brought up this issue out front.
First attempt of running your inference code with all prescribed installation (torch 1.5.1) with 1080Ti GPU w 11GB ==> result out of memory
Second attempt using another machine with RTX 6000 w 48GB ==> cannot install torch 1.5.1 because:

RTX A6000 with CUDA capability sm_86 is not compatible with the PyTorch v1.5.1 installation.
The PyTorch v1.5.1 install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the RTX A6000 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

So I am in delimma: use torch 1.5.1 OOM vs. use A6000 with enough memory but cannot run torch 1.5.1

For your information, the new generation of GPU like RTX 3090, A6000, etc. will run on torch 1.10.0 with Cuda 11.2 or later (which support sm_86)

I understand that it is almost impossible to support all version of pytorch, but how about to selectively support a least one version compatible with the "future" generation of GPU such as pytorch 1.10 with Cuda 11.2 or later? What do you think?

Thanks a lot for your help in advance!

DeH40 · 2022-01-14T09:26:07Z

hi，@mli0603 @ynjiun ,I found a way to resolve this problem,according to pytorch/pytorch#37823 (comment) & https://discuss.pytorch.org/t/performance-highly-degraded-when-eval-is-activated-in-the-test-phase/3323/66 ，I modified the code in _disable_batchnorm_tracking , setting the mean and var variables in the batch norm to None ，which resolve the problem.

 def _disable_batchnorm_tracking(self):
        """
        disable Batchnorm tracking stats to reduce dependency on dataset (this acts as InstanceNorm with affine when batch size is 1)
        """
        for m in self.modules():
            if isinstance(m, nn.BatchNorm2d):
                m.track_running_stats = False
                m.running_mean = None
                m.running_var = None

mli0603 · 2022-01-14T16:27:25Z

@DeH40

Oh nice! Thank you very much for this patch. Let me test it on my end too.

fix BN issue due to torch version #8

mli0603 added the wontfix This will not be worked on label Dec 23, 2020

mli0603 mentioned this issue Jan 20, 2021

Question: Sceneflow scheme, Bug: Loss is NaN, stop training #11

Closed

mli0603 mentioned this issue Dec 15, 2021

About training #43

Open

DeH40 added a commit to DeH40/stereo-transformer that referenced this issue Jan 14, 2022

fix the bug mentioned in Issue mli0603#8

4761da9

mli0603 linked a pull request Jan 15, 2022 that will close this issue

fix BN issue due to torch version #8 #49

Merged

mli0603 closed this as completed in #49 Jan 16, 2022

mli0603 pushed a commit that referenced this issue Jan 16, 2022

Merge pull request #49 from DeH40/main

6054799

fix BN issue due to torch version #8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch version affects the network's training performance #8

Torch version affects the network's training performance #8

mli0603 commented Dec 23, 2020

VitorGuizilini-TRI commented Aug 26, 2021

mli0603 commented Aug 26, 2021

EhrazImam commented Sep 13, 2021

mli0603 commented Sep 13, 2021

ynjiun commented Dec 27, 2021 •

edited

Loading

DeH40 commented Jan 14, 2022

mli0603 commented Jan 14, 2022

Torch version affects the network's training performance #8

Torch version affects the network's training performance #8

Comments

mli0603 commented Dec 23, 2020

VitorGuizilini-TRI commented Aug 26, 2021

mli0603 commented Aug 26, 2021

EhrazImam commented Sep 13, 2021

mli0603 commented Sep 13, 2021

ynjiun commented Dec 27, 2021 • edited Loading

DeH40 commented Jan 14, 2022

mli0603 commented Jan 14, 2022

ynjiun commented Dec 27, 2021 •

edited

Loading