Avoid wrapping LightningModule in *DataParallel overrides when not fitting #8632

ninginthecloud · 2021-07-29T22:50:38Z

What does this PR do?

Avoid wrapping LightningModule in *DataParallel overrides when not fitting
Specifically,
- Update configure_ddp function in DDPPlugin, DDPSpawnPlugin, DDPShardedPlugin and DDPSpawnShardedPlugin by checking the state of LightningModule and avoiding wrapping LihgningModule as *DataParallel when the state is not TrainerFn.FITTING.
- Update validation_step function in DDPPlugin and DDPSpawnPlugin to use LightningModule's validation_step function if self.model is not DistributedDataParallel instance.
- Update test_step and prediction_step functions in DDPPlugin and DDPSpawnPlugin to use LightningModule's *_step functions directly.

Fixes #6977

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2021-07-29T22:50:42Z

Hello @ninginthecloud! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-08-12 18:20:15 UTC

codecov · 2021-07-29T22:52:40Z

Codecov Report

Merging #8632 (4bab65f) into master (938a191) will decrease coverage by 0%.
The diff coverage is 42%.

@@          Coverage Diff           @@
##           master   #8632   +/-   ##
======================================
- Coverage      89%     89%   -0%     
======================================
  Files         176     176           
  Lines       14268   14291   +23     
======================================
+ Hits        12679   12687    +8     
- Misses       1589    1604   +15

pytorch_lightning/plugins/training_type/ddp.py

pytorch_lightning/plugins/training_type/ddp_spawn.py

pytorch_lightning/plugins/training_type/sharded.py

pytorch_lightning/plugins/training_type/sharded_spawn.py

justusschock · 2021-07-30T05:54:33Z

@ananthsub @SeanNaren @ninginthecloud
A general question here:
depending on how the processes are spawned and the weights are loaded (could happen only on master), wouldn't it be better to still wrap for weights syncing and use the no_sync context manager instead to disable gradient syncs?

ananthsub

Note:
With this change,
self.model may not be a DistributedDataParallel model. therefore, in L400-410 of the DDP plugin, for training_step, validation_step, test_step, and predict_step - we must check if isinstance(self.model, DistributedDataParallel)
If it is, we should call self.model(*args, **kwargs)
This works because self.model's forward function internally routes to the LightningModule's *_step function: https://github.com/PyTorchLightning/pytorch-lightning/blob/1f01db8b303e647b102f48c36e91ddb17784414f/pytorch_lightning/overrides/base.py#L77-L99

however, if isinstance(self.model, LightningModule) then we should simply call self.model.training_step / validation_step/test_step / predict_step directly

More context on this design can be found here: #4630

@awaelchli @justusschock to double check this approach

pytorch_lightning/plugins/training_type/ddp.py

ananthsub · 2021-07-31T01:32:18Z

depending on how the processes are spawned and the weights are loaded (could happen only on master)

@justusschock could you describe why the processes spawned would affect this? do you mean loading weights on rank 0 only and broadcasting weights via DDP's synchronization, and then running evaluation?

I think applying the no-sync wrapper regardless for validation/test/predict should be done, and that might resolve the uneven end of data in a simpler way. what's the clearest way to add the no_sync for these? is that handled by the plugin or the loops? and is that already accounted for by the no_grad context manager applied at the start of those loops?

pytorch_lightning/plugins/training_type/ddp.py

CHANGELOG.md

pytorch_lightning/plugins/training_type/ddp.py

awaelchli · 2021-08-11T18:34:38Z

@ninginthecloud It's because the tests are failing and if they do, the CI job won't submit coverage. Coverage is merged from all different jobs so if say the GPU code path fails coverage for that will be missing.

if tests pass the codecov bot will also update the message on this pr here.

for more information, see https://pre-commit.ci

ananthsub · 2021-08-25T23:21:30Z

Closing this out in favor of #9096

ananthsub reviewed Jul 29, 2021

View reviewed changes

ninginthecloud force-pushed the fix_issue6977 branch 2 times, most recently from 3caea11 to 4a15a61 Compare July 31, 2021 00:35

ananthsub reviewed Jul 31, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/ddp.py Outdated Show resolved Hide resolved

pytorch_lightning/plugins/training_type/ddp.py Outdated Show resolved Hide resolved

ninginthecloud force-pushed the fix_issue6977 branch from 4a15a61 to 5f081bc Compare August 2, 2021 17:18

ananthsub reviewed Aug 2, 2021

View reviewed changes

ninginthecloud force-pushed the fix_issue6977 branch from 5f081bc to 3fc1e16 Compare August 3, 2021 07:35

ananthsub reviewed Aug 3, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/ddp.py Outdated Show resolved Hide resolved

ninginthecloud force-pushed the fix_issue6977 branch 2 times, most recently from 6fe7e55 to 705d663 Compare August 4, 2021 20:53

ninginthecloud marked this pull request as ready for review August 4, 2021 22:27

ninginthecloud requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners August 4, 2021 22:27

awaelchli reviewed Aug 4, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/ddp.py Outdated Show resolved Hide resolved

awaelchli changed the title ~~Fix iss6977: Avoid wrapping LightningModule in *DataParallel overrides when not fitting~~ Avoid wrapping LightningModule in *DataParallel overrides when not fitting Aug 4, 2021

awaelchli added the feature Is an improvement or enhancement label Aug 4, 2021

awaelchli added this to the v1.5 milestone Aug 4, 2021

awaelchli added the distributed Generic distributed-related topic label Aug 4, 2021

ananthsub reviewed Aug 4, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

pytorch_lightning/plugins/training_type/ddp.py Outdated Show resolved Hide resolved

ninginthecloud force-pushed the fix_issue6977 branch from 705d663 to 98fe2b0 Compare August 9, 2021 20:22

ninginthecloud force-pushed the fix_issue6977 branch from 3869780 to a8e1fe6 Compare August 12, 2021 18:20

ananthsub mentioned this pull request Aug 14, 2021

Support fit with DDP then test without DDP #8375

Closed

ninginthecloud and others added 19 commits August 16, 2021 13:40

fix issue 6977 by checking the lightning module status

e4ee4d4

[pre-commit.ci] auto fixes from pre-commit.com hooks

35b4b66

for more information, see https://pre-commit.ci

fix merge conflict

5477110

skip configure_ddp if it's not training stage

ef9ccc3

update model type

0eba5ca

update formatting

d96360d

update non training steps in ddp and ddp_spawn

77607d3

update *_step functions in ddp and ddp_spawn

1bf171e

remove unused comments

b123d4e

update predict_step in ddp and ddp_spawn

9de2587

update CHANGELOG and fix bugs in ddp and ddp_spawn

d7d9e10

move _wrap_optimizers() before updating _model

b5924c5

update changelog to make it simpler.

0e0b43e

add test to check module wrapper with ddp plugins

8da76be

update test by checking test and prediction stage

4d485f4

fix lint errors

2a13086

remove unused comments

fedb043

remove unnecessary comments :-)

88d1582

move tests to each individual plugin

4bab65f

ninginthecloud force-pushed the fix_issue6977 branch from a8e1fe6 to 4bab65f Compare August 16, 2021 21:43

ninginthecloud mentioned this pull request Aug 17, 2021

Deprecate prepare_data_per_node flag on Trainer and set it as a property for DataHooks #8958

Merged

12 tasks

mergify bot added the has conflicts label Aug 23, 2021

four4fish mentioned this pull request Aug 24, 2021

Avoid wrapping LightningModule in DDP plugins when not fitting #9096

Merged

12 tasks

awaelchli changed the title ~~Avoid wrapping LightningModule in *DataParallel overrides when not fitting~~ Avoid wrapping LightningModule in *DataParallel overrides when not fitting [2/2] Aug 25, 2021

awaelchli changed the title ~~Avoid wrapping LightningModule in *DataParallel overrides when not fitting [2/2]~~ Avoid wrapping LightningModule in *DataParallel overrides when not fitting Aug 25, 2021

ananthsub closed this Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid wrapping LightningModule in *DataParallel overrides when not fitting #8632

Avoid wrapping LightningModule in *DataParallel overrides when not fitting #8632

Uh oh!

ninginthecloud commented Jul 29, 2021 •

edited

Loading

Uh oh!

pep8speaks commented Jul 29, 2021 •

edited

Loading

Uh oh!

codecov bot commented Jul 29, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justusschock commented Jul 30, 2021

Uh oh!

ananthsub left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ananthsub commented Jul 31, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaelchli commented Aug 11, 2021 •

edited

Loading

Uh oh!

ananthsub commented Aug 25, 2021

Uh oh!

Uh oh!

Avoid wrapping LightningModule in *DataParallel overrides when not fitting #8632

Avoid wrapping LightningModule in *DataParallel overrides when not fitting #8632

Uh oh!

Conversation

ninginthecloud commented Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

pep8speaks commented Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-08-12 18:20:15 UTC

Uh oh!

codecov bot commented Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justusschock commented Jul 30, 2021

Uh oh!

ananthsub left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ananthsub commented Jul 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaelchli commented Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ananthsub commented Aug 25, 2021

Uh oh!

Uh oh!

ninginthecloud commented Jul 29, 2021 •

edited

Loading

pep8speaks commented Jul 29, 2021 •

edited

Loading

codecov bot commented Jul 29, 2021 •

edited

Loading

ananthsub left a comment •

edited

Loading

ananthsub commented Jul 31, 2021 •

edited

Loading

awaelchli commented Aug 11, 2021 •

edited

Loading