Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression between Lightning 1.1.3 and 1.1.5 #5656

Closed
SeanNaren opened this issue Jan 26, 2021 · 11 comments
Closed

Regression between Lightning 1.1.3 and 1.1.5 #5656

SeanNaren opened this issue Jan 26, 2021 · 11 comments
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on
Milestone

Comments

@SeanNaren
Copy link
Contributor

SeanNaren commented Jan 26, 2021

🐛 Bug

Posted originally by @okuchaiev:

Has anyone observed a model performance degradation when switching from 1.1.3 to 1.1.4 and 1.1.5? On the plot below you can see exactly the same model/hyperparams trained using 1.1.3 (runs named enes3) and 1.1.5 (runs named enes5). You can see that 1.1.3 outperforms 1.1.5 consistently.

image

Please reproduce using the BoringModel

Currently not reproduced.

cc @ericharper

@SeanNaren SeanNaren added bug Something isn't working help wanted Open to be worked on labels Jan 26, 2021
@SeanNaren
Copy link
Contributor Author

I ran the simple image classifier with fp32/fp16 using pytorch lightning 1.1.3 vs 1.1.5 and was unable to reproduce this bug.

My hunch given history is that this is similar to the GradScaler bug we had with the LightningOptimizer a while back.

There is only a handful of commits I can see causing this issue between 1.1.3/1.1.5:

Optimizer:

Logging:

I'll sync with the NVIDIA team to see if we can get a reproducible example on our end to debug per commit.

@ericharper
Copy link
Contributor

We're now able get the accuracy back by forcing find_unused_parameters to be True in DDP (the default was True in 1.1.3 and before):

    def configure_ddp(self, model: LightningModule, device_ids: List[int]) -> DistributedDataParallel:
        logging.info('overriding ddp to test find_unused_parameters flag')
        # if unset, default `find_unused_parameters` `False`
        model = LightningDistributedDataParallel(model, device_ids=device_ids, find_unused_parameters=True)
        return model
    def setup(self, stage):
        # Update PTL trainer to use our configure_ddp
        self._trainer.accelerator_backend.ddp_plugin.configure_ddp = self.configure_ddp

@awaelchli
Copy link
Contributor

Sorry about that. We changed the default, and we mentioned it in the changelog, but I did not know it has an influence on the model performance.
The reason for changing it to False was to increase DDP performance and adopt also the same setting as in PyTorch.
We should have warned about it in the changelog

@SeanNaren
Copy link
Contributor Author

SeanNaren commented Jan 28, 2021

I think it was mentioned in the changelog, but we definitely should've made it clearer. Thanks for the digging @ericharper!

We need some way to be able to turn this on again without having to change the code (currently we have to create a ddp plugin and pass this through). There are definitely cases where we'd need to enable this flag. Any suggestions @awaelchli?

EDIT: just had a look at the slack channel, and the discussed is happening there, with the jist being exposing all DDP parameters would be quite ugly to add to the trainer (as it means we need to add all arguments across all plugins).

@Tiiiger
Copy link

Tiiiger commented Jan 29, 2021

So is this fixed in 1.1.6?

@edenlightning
Copy link
Contributor

@awaelchli anything left TODO here?

@edenlightning edenlightning assigned awaelchli and unassigned SeanNaren and tchaton Feb 9, 2021
@edenlightning edenlightning added the distributed Generic distributed-related topic label Feb 9, 2021
@awaelchli
Copy link
Contributor

awaelchli commented Feb 10, 2021

@Tiiiger Fixed is maybe not the right term here. We changed the value of find_unused_parameters from True to False as this is also the default in PyTorch. This change impacts users in different ways. In some models, it leads to better performance in terms of iterations/s. Others may see an impact in accuracy as reported by @ericharper.
It means in your ddp model you may need to try both options. You can currently change it like so:

from pytorch_lightning.plugins import DDPPlugin
trainer = Trainer(..., plugins=[DDPPlugin(find_unused_parameters=True/False)])

@edenlightning as far as the regression problem goes, this seems solved. We found the reason for it.
We can further discuss either here or in a separate issue whether the access to this flag needs to be simplified.
A mention in the docs for find_unused_parameters should also be considered.

@SeanNaren
Copy link
Contributor Author

There still seems to be the issue of why does this causes an accuracy difference...

@edenlightning
Copy link
Contributor

@awaelchli any idea on the accuracy difference?

What is your recommendation for exposing this flag? Is there a way we can automate it? I wouldn't recommend exposing ddp flags as trainer args...

@edenlightning edenlightning added this to the 1.2.x milestone Feb 16, 2021
@awaelchli
Copy link
Contributor

awaelchli commented Feb 18, 2021

@edenlightning unfortunately no :(

What is your recommendation for exposing this flag? Is there a way we can automate it? I wouldn't recommend exposing ddp flags as trainer args...

I recommend keeping it as a plugin argument for now (as shown in my example above).

Automation: That's tricky, because the only way to know is to observe the gradient computation on the loss. This would mean we would have to be already training, but the flag needs to be set before training. Sometimes it can happen that we see this error message when find_unused_parameters=False when it possibly should be True:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss.

However, as you can see the error message itself mentions in case (2) it could simply be a silly mistake by the user not computing the loss correctly. This to me is a very strong hint that we cannot automate this choice.

So in conclusion, given the above observation I recommend to keeping it as is.

@edenlightning
Copy link
Contributor

Closing this for now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

6 participants