Regression between Lightning 1.1.3 and 1.1.5 #5656

SeanNaren · 2021-01-26T10:06:02Z

🐛 Bug

Posted originally by @okuchaiev:

Has anyone observed a model performance degradation when switching from 1.1.3 to 1.1.4 and 1.1.5? On the plot below you can see exactly the same model/hyperparams trained using 1.1.3 (runs named enes3) and 1.1.5 (runs named enes5). You can see that 1.1.3 outperforms 1.1.5 consistently.

Please reproduce using the BoringModel

Currently not reproduced.

cc @ericharper

SeanNaren · 2021-01-26T11:30:56Z

I ran the simple image classifier with fp32/fp16 using pytorch lightning 1.1.3 vs 1.1.5 and was unable to reproduce this bug.

My hunch given history is that this is similar to the GradScaler bug we had with the LightningOptimizer a while back.

There is only a handful of commits I can see causing this issue between 1.1.3/1.1.5:

Optimizer:

Logging:

[bugfix] Logging only on not should_accumulate() during training #5417

I'll sync with the NVIDIA team to see if we can get a reproducible example on our end to debug per commit.

ericharper · 2021-01-27T04:14:34Z

We're now able get the accuracy back by forcing find_unused_parameters to be True in DDP (the default was True in 1.1.3 and before):

    def configure_ddp(self, model: LightningModule, device_ids: List[int]) -> DistributedDataParallel:
        logging.info('overriding ddp to test find_unused_parameters flag')
        # if unset, default `find_unused_parameters` `False`
        model = LightningDistributedDataParallel(model, device_ids=device_ids, find_unused_parameters=True)
        return model
    def setup(self, stage):
        # Update PTL trainer to use our configure_ddp
        self._trainer.accelerator_backend.ddp_plugin.configure_ddp = self.configure_ddp

awaelchli · 2021-01-27T07:04:44Z

Sorry about that. We changed the default, and we mentioned it in the changelog, but I did not know it has an influence on the model performance.
The reason for changing it to False was to increase DDP performance and adopt also the same setting as in PyTorch.
We should have warned about it in the changelog

SeanNaren · 2021-01-28T10:03:05Z

I think it was mentioned in the changelog, but we definitely should've made it clearer. Thanks for the digging @ericharper!

We need some way to be able to turn this on again without having to change the code (currently we have to create a ddp plugin and pass this through). There are definitely cases where we'd need to enable this flag. Any suggestions @awaelchli?

EDIT: just had a look at the slack channel, and the discussed is happening there, with the jist being exposing all DDP parameters would be quite ugly to add to the trainer (as it means we need to add all arguments across all plugins).

Tiiiger · 2021-01-29T02:00:42Z

So is this fixed in 1.1.6?

edenlightning · 2021-02-09T23:11:08Z

@awaelchli anything left TODO here?

awaelchli · 2021-02-10T22:05:12Z

@Tiiiger Fixed is maybe not the right term here. We changed the value of find_unused_parameters from True to False as this is also the default in PyTorch. This change impacts users in different ways. In some models, it leads to better performance in terms of iterations/s. Others may see an impact in accuracy as reported by @ericharper.
It means in your ddp model you may need to try both options. You can currently change it like so:

from pytorch_lightning.plugins import DDPPlugin
trainer = Trainer(..., plugins=[DDPPlugin(find_unused_parameters=True/False)])

@edenlightning as far as the regression problem goes, this seems solved. We found the reason for it.
We can further discuss either here or in a separate issue whether the access to this flag needs to be simplified.
A mention in the docs for find_unused_parameters should also be considered.

SeanNaren · 2021-02-10T22:49:58Z

There still seems to be the issue of why does this causes an accuracy difference...

edenlightning · 2021-02-16T18:28:58Z

@awaelchli any idea on the accuracy difference?

What is your recommendation for exposing this flag? Is there a way we can automate it? I wouldn't recommend exposing ddp flags as trainer args...

awaelchli · 2021-02-18T22:36:51Z

@edenlightning unfortunately no :(

What is your recommendation for exposing this flag? Is there a way we can automate it? I wouldn't recommend exposing ddp flags as trainer args...

I recommend keeping it as a plugin argument for now (as shown in my example above).

Automation: That's tricky, because the only way to know is to observe the gradient computation on the loss. This would mean we would have to be already training, but the flag needs to be set before training. Sometimes it can happen that we see this error message when find_unused_parameters=False when it possibly should be True:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss.

However, as you can see the error message itself mentions in case (2) it could simply be a silly mistake by the user not computing the loss correctly. This to me is a very strong hint that we cannot automate this choice.

So in conclusion, given the above observation I recommend to keeping it as is.

edenlightning · 2021-02-23T21:54:43Z

Closing this for now!

SeanNaren added bug Something isn't working help wanted Open to be worked on labels Jan 26, 2021

SeanNaren assigned SeanNaren and tchaton Jan 26, 2021

edenlightning assigned awaelchli and unassigned SeanNaren and tchaton Feb 9, 2021

edenlightning added the distributed Generic distributed-related topic label Feb 9, 2021

edenlightning added this to the 1.2.x milestone Feb 16, 2021

edenlightning closed this as completed Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression between Lightning 1.1.3 and 1.1.5 #5656

Regression between Lightning 1.1.3 and 1.1.5 #5656

SeanNaren commented Jan 26, 2021 •

edited

Loading

SeanNaren commented Jan 26, 2021

ericharper commented Jan 27, 2021

awaelchli commented Jan 27, 2021

SeanNaren commented Jan 28, 2021 •

edited

Loading

Tiiiger commented Jan 29, 2021

edenlightning commented Feb 9, 2021

awaelchli commented Feb 10, 2021 •

edited

Loading

SeanNaren commented Feb 10, 2021

edenlightning commented Feb 16, 2021

awaelchli commented Feb 18, 2021 •

edited

Loading

edenlightning commented Feb 23, 2021

Regression between Lightning 1.1.3 and 1.1.5 #5656

Regression between Lightning 1.1.3 and 1.1.5 #5656

Comments

SeanNaren commented Jan 26, 2021 • edited Loading

🐛 Bug

Please reproduce using the BoringModel

SeanNaren commented Jan 26, 2021

ericharper commented Jan 27, 2021

awaelchli commented Jan 27, 2021

SeanNaren commented Jan 28, 2021 • edited Loading

Tiiiger commented Jan 29, 2021

edenlightning commented Feb 9, 2021

awaelchli commented Feb 10, 2021 • edited Loading

SeanNaren commented Feb 10, 2021

edenlightning commented Feb 16, 2021

awaelchli commented Feb 18, 2021 • edited Loading

edenlightning commented Feb 23, 2021

SeanNaren commented Jan 26, 2021 •

edited

Loading

SeanNaren commented Jan 28, 2021 •

edited

Loading

awaelchli commented Feb 10, 2021 •

edited

Loading

awaelchli commented Feb 18, 2021 •

edited

Loading