Keeping DDP override in sync with upstream torch #4630

edenlightning · 2020-11-11T22:31:51Z

From @ananthsub:
how should Lightning keep its DDP override in sync with the upstream torch DistributedDataParallel? these implementations have now diverged. I think this leads to performance degradations with Lightning + gradient accumulations, since the require_backward_grad_sync attribute isn't checked before the backwards pass

edenlightning · 2020-11-11T22:32:08Z

@awaelchli @tchaton

ananthsub · 2020-11-12T00:17:34Z

@pritamdamania87 suggested this workaround: Instead of overriding DDP, we can wrap the LightningModule in another nn.Module. This wrapper module will define its forward function to call the LightningModule's *_step function, depending on the training/testing flags.

Then when we wrap this module in DDP, we can rely on the wrapper's forward to do the right thing, and we don't have to worry about keeping the overrides in sync

pritamdamania87 · 2020-11-12T00:24:03Z

As @ananthsub mentioned, I'd suggest always calling DDP.forward and not relying on the internals of DDP. The current implementation in LightningDistributedDataParallel could break if we make changes in DDP and the reducer. As an example, mmcv was doing something similar and broke when we refactored some code related to DDP and reducer: open-mmlab/mmcv#636.

awaelchli · 2020-11-18T15:13:03Z

@ananthsub I think this could work, nice idea! And I guess we should unwrap the model when passing it to the callbacks etc. so the user never sees the wrapper.

stale · 2020-12-18T17:43:26Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

awaelchli · 2021-01-13T21:08:29Z

solved by the linked PR
I will open a new issue for a similar refactor to DP.

edenlightning added distributed Generic distributed-related topic discussion In a discussion stage labels Nov 11, 2020

ananthsub mentioned this issue Nov 13, 2020

Sharded Plugin #4178

Closed

awaelchli added the refactor label Nov 18, 2020

awaelchli mentioned this issue Dec 6, 2020

[hotfix] ddp + manual_optimisation #4976

Merged

11 tasks

stale bot added the won't fix This will not be worked on label Dec 18, 2020

awaelchli removed the won't fix This will not be worked on label Dec 18, 2020

awaelchli added this to the 1.2 milestone Dec 18, 2020

awaelchli mentioned this issue Dec 18, 2020

Refactor LightningDistributedDataParallel #5185

Merged

3 tasks

awaelchli self-assigned this Dec 18, 2020

awaelchli closed this as completed Jan 13, 2021

awaelchli mentioned this issue Jan 13, 2021

Keeping DP override in sync with upstream torch #5506

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keeping DDP override in sync with upstream torch #4630

Keeping DDP override in sync with upstream torch #4630

edenlightning commented Nov 11, 2020

edenlightning commented Nov 11, 2020

ananthsub commented Nov 12, 2020 •

edited

Loading

pritamdamania87 commented Nov 12, 2020 •

edited

Loading

awaelchli commented Nov 18, 2020 •

edited

Loading

stale bot commented Dec 18, 2020

awaelchli commented Jan 13, 2021

Keeping DDP override in sync with upstream torch #4630

Keeping DDP override in sync with upstream torch #4630

Comments

edenlightning commented Nov 11, 2020

edenlightning commented Nov 11, 2020

ananthsub commented Nov 12, 2020 • edited Loading

pritamdamania87 commented Nov 12, 2020 • edited Loading

awaelchli commented Nov 18, 2020 • edited Loading

stale bot commented Dec 18, 2020

awaelchli commented Jan 13, 2021

ananthsub commented Nov 12, 2020 •

edited

Loading

pritamdamania87 commented Nov 12, 2020 •

edited

Loading

awaelchli commented Nov 18, 2020 •

edited

Loading