Connect the model to the training type plugin at the start of run #8536

carmocca · 2021-07-23T16:55:36Z

What does this PR do?

With this PR, the trainer.lightning_module reference will be set at the very beginning of _run. Concretely, in the self.accelerator.connect(model) call.

We want this because we need it available during all hooks, even the earliest ones like setup.

Also, If the trainer had been reused and the model has changed, the trainer.lightning_module reference would be stale.

Part of #8498

Does your PR introduce any breaking changes ? If yes, please list them.

(BETA TrainingTypePlugin API):
The setup hook no longer takes a model argument.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
[n/a] Did you make sure to update the documentation with your changes? (if necessary)
[n/a] Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)
Did you list all the breaking changes introduced by this pull request?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

for more information, see https://pre-commit.ci

codecov · 2021-07-23T16:57:48Z

Codecov Report

Merging #8536 (65ede1b) into master (a1be621) will increase coverage by 0%.
The diff coverage is 98%.

@@          Coverage Diff           @@
##           master   #8536   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         167     169    +2     
  Lines       14037   14072   +35     
======================================
+ Hits        13008   13043   +35     
  Misses       1029    1029

tchaton

LGTM !

tchaton · 2021-07-26T07:26:47Z

pytorch_lightning/accelerators/accelerator.py

        """
-        self.setup_training_type_plugin(model)
+        self.setup_training_type_plugin()


small note: This could be an issue if we ever decide to expose the Accelerator API.

Why an issue? There's the connect hook to connect the model already.

self.model should be available for the plugin when setup is called

@carmocca is this an invariant? should the accelerator assert that the model is available before calling setup training type plugin?

@ananthsub We could

pytorch_lightning/accelerators/accelerator.py

for more information, see https://pre-commit.ci

carmocca · 2021-08-04T14:30:37Z

pytorch_lightning/plugins/training_type/fully_sharded.py

+    def setup_environment(self) -> None:
+        super().setup_environment()
+        model_call_configure_sharded_model_hook = getattr(
+            self.lightning_module, "call_configure_sharded_model_hook", False
+        )


Documenting the explanation for this change:

Without this, the test tests/plugins/test_ddp_fully_sharded_with_full_state_dict.py::test_fully_sharded_plugin_checkpoint fails.

That test calls model.setup() inside the on_load_checkpoint hook. Its setup implementation manually modifies the value of self.call_configure_sharded_model_hook:

def setup(self, stage: str) -> None: self.call_configure_sharded_model_hook = False ... def on_load_checkpoint(self, checkpoint: Dict[str, Any]) -> None: self.setup("fit")

This means, that before this change, the call_configure_sharded_model_hook value would diverge between the model and the training type plugin. Because its check in the training type plugin was done in the connect hook. (See the diff)

Some pseudocode to illustrate:

Order before this PR

load_checkpoint_weights model.on_load_checkpoint (in the test) model.setup -> this sets model.call_configure_sharded acc.connect -> this checks model.call_configure_sharded acc.setup_env model.setup configure_sharded_model -> this sets model.call_configure_sharded acc.setup

Order after this PR without this change:

acc.connect -> this checks model.call_configure_sharded load_checkpoint_weights model.on_load_checkpoint (in the test) model.setup -> this sets model.call_configure_sharded **BUG** as we already checked it acc.setup_env model.setup configure_sharded_model -> this sets model.call_configure_sharded acc.setup

Order after this PR with this change:

acc.connect load_checkpoint_weights model.on_load_checkpoint (in the test) model.setup -> this sets model.call_configure_sharded acc.setup_env -> this checks model.call_configure_sharded model.setup configure_sharded_model -> this sets model.call_configure_sharded acc.setup

cc @SeanNaren @ananthsub

ananthsub · 2021-07-28T21:43:06Z

pytorch_lightning/accelerators/accelerator.py

        """
-        self.setup_training_type_plugin(model)
+        self.setup_training_type_plugin()


@carmocca is this an invariant? should the accelerator assert that the model is available before calling setup training type plugin?

pytorch_lightning/trainer/connectors/callback_connector.py

carmocca added 2 commits July 23, 2021 18:45

Refactor training type plugin model connect

c02621e

Fix tests

0b3f14b

carmocca added bug Something isn't working refactor labels Jul 23, 2021

carmocca added this to the v1.4.x milestone Jul 23, 2021

carmocca self-assigned this Jul 23, 2021

carmocca requested review from awaelchli, Borda, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners July 23, 2021 16:55

[pre-commit.ci] auto fixes from pre-commit.com hooks

942b836

for more information, see https://pre-commit.ci

carmocca requested a review from edenlightning as a code owner July 23, 2021 16:57

carmocca added 2 commits July 23, 2021 19:00

Update CHANGELOG

5beb5c9

mypy

0adb3a7

carmocca mentioned this pull request Jul 23, 2021

Always use trainer.call_hook #8498

Merged

11 tasks

Unwanted changes

89eb948

tchaton approved these changes Jul 26, 2021

View reviewed changes

Fix tests

c42c084

mergify bot added the has conflicts label Jul 26, 2021

Merge branch 'master' into refactor/ttp-model-connect

014a03c

mergify bot removed the has conflicts label Jul 26, 2021

[pre-commit.ci] auto fixes from pre-commit.com hooks

7b51b8a

for more information, see https://pre-commit.ci

mergify bot added the has conflicts label Jul 27, 2021

Merge branch 'master' into refactor/ttp-model-connect

1a2c581

mergify bot added has conflicts and removed has conflicts labels Jul 27, 2021

mergify bot added the has conflicts label Jul 30, 2021

carmocca added 2 commits July 30, 2021 14:32

Merge branch 'master' into refactor/ttp-model-connect

a1149ef

State functions

75fb2f4

mergify bot added has conflicts and removed has conflicts labels Jul 30, 2021

Merge branch 'master' into refactor/ttp-model-connect

04b2a3b

mergify bot removed the has conflicts label Aug 2, 2021

carmocca modified the milestones: v1.4.x, v1.5 Aug 2, 2021

carmocca enabled auto-merge (squash) August 2, 2021 14:04

whitespace

6dad76d

mergify bot added the has conflicts label Aug 2, 2021

Merge branch 'master' into refactor/ttp-model-connect

b5ab1b5

mergify bot removed the has conflicts label Aug 2, 2021

[pre-commit.ci] auto fixes from pre-commit.com hooks

589d101

for more information, see https://pre-commit.ci

mergify bot added the has conflicts label Aug 2, 2021

Merge branch 'master' into refactor/ttp-model-connect

48d5718

mergify bot removed the has conflicts label Aug 2, 2021

Merge branch 'master' into refactor/ttp-model-connect

cd4c738

Borda approved these changes Aug 3, 2021

View reviewed changes

carmocca marked this pull request as draft August 4, 2021 11:44

auto-merge was automatically disabled August 4, 2021 11:44
Pull request was converted to draft

SeanNaren approved these changes Aug 4, 2021

View reviewed changes

Fix fully sharded call_configure_sharded_hook flag check

65ede1b

carmocca commented Aug 4, 2021

View reviewed changes

carmocca marked this pull request as ready for review August 4, 2021 14:33

carmocca merged commit ed13040 into master Aug 4, 2021

carmocca deleted the refactor/ttp-model-connect branch August 4, 2021 15:43

ananthsub reviewed Aug 9, 2021

View reviewed changes

ananthsub mentioned this pull request Aug 9, 2021

Use trainer.lightning_module in CallbackConnector._attach_callbacks #8805

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connect the model to the training type plugin at the start of run #8536

Connect the model to the training type plugin at the start of run #8536

carmocca commented Jul 23, 2021 •

edited

Loading

codecov bot commented Jul 23, 2021 •

edited

Loading

tchaton left a comment

tchaton Jul 26, 2021

carmocca Jul 26, 2021

ananthsub Jul 28, 2021

carmocca Aug 9, 2021

carmocca Aug 4, 2021 •

edited

Loading

ananthsub Jul 28, 2021

Connect the model to the training type plugin at the start of run #8536

Connect the model to the training type plugin at the start of run #8536

Conversation

carmocca commented Jul 23, 2021 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes ? If yes, please list them.

Before submitting

PR review

codecov bot commented Jul 23, 2021 • edited Loading

Codecov Report

tchaton left a comment

Choose a reason for hiding this comment

tchaton Jul 26, 2021

Choose a reason for hiding this comment

carmocca Jul 26, 2021

Choose a reason for hiding this comment

ananthsub Jul 28, 2021

Choose a reason for hiding this comment

carmocca Aug 9, 2021

Choose a reason for hiding this comment

carmocca Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

ananthsub Jul 28, 2021

Choose a reason for hiding this comment

carmocca commented Jul 23, 2021 •

edited

Loading

codecov bot commented Jul 23, 2021 •

edited

Loading

carmocca Aug 4, 2021 •

edited

Loading