Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping DDP override in sync with upstream torch #4630

Closed
edenlightning opened this issue Nov 11, 2020 · 6 comments · Fixed by #5185
Closed

Keeping DDP override in sync with upstream torch #4630

edenlightning opened this issue Nov 11, 2020 · 6 comments · Fixed by #5185
Assignees
Labels
discussion In a discussion stage distributed Generic distributed-related topic refactor
Milestone

Comments

@edenlightning
Copy link
Contributor

From @ananthsub:
how should Lightning keep its DDP override in sync with the upstream torch DistributedDataParallel? these implementations have now diverged. I think this leads to performance degradations with Lightning + gradient accumulations, since the require_backward_grad_sync attribute isn't checked before the backwards pass

@edenlightning edenlightning added distributed Generic distributed-related topic discussion In a discussion stage labels Nov 11, 2020
@edenlightning
Copy link
Contributor Author

@awaelchli @tchaton

@ananthsub
Copy link
Contributor

ananthsub commented Nov 12, 2020

@pritamdamania87 suggested this workaround: Instead of overriding DDP, we can wrap the LightningModule in another nn.Module. This wrapper module will define its forward function to call the LightningModule's *_step function, depending on the training/testing flags.

Then when we wrap this module in DDP, we can rely on the wrapper's forward to do the right thing, and we don't have to worry about keeping the overrides in sync

@pritamdamania87
Copy link

pritamdamania87 commented Nov 12, 2020

As @ananthsub mentioned, I'd suggest always calling DDP.forward and not relying on the internals of DDP. The current implementation in LightningDistributedDataParallel could break if we make changes in DDP and the reducer. As an example, mmcv was doing something similar and broke when we refactored some code related to DDP and reducer: open-mmlab/mmcv#636.

@awaelchli
Copy link
Contributor

awaelchli commented Nov 18, 2020

@ananthsub I think this could work, nice idea! And I guess we should unwrap the model when passing it to the callbacks etc. so the user never sees the wrapper.

@stale
Copy link

stale bot commented Dec 18, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Dec 18, 2020
@awaelchli awaelchli removed the won't fix This will not be worked on label Dec 18, 2020
@awaelchli awaelchli added this to the 1.2 milestone Dec 18, 2020
@awaelchli awaelchli self-assigned this Dec 18, 2020
@awaelchli
Copy link
Contributor

solved by the linked PR
I will open a new issue for a similar refactor to DP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion In a discussion stage distributed Generic distributed-related topic refactor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants