[feat] Add `{,load_}state_dict` to `ResultCollection` 1/n #7948

tchaton · 2021-06-11T20:19:08Z

What does this PR do?

Tracking Issue: #7898

This PR adds a mechanism to reload from state_dict and restore logged values.

Fixes #<issue_number>

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-06-11T20:20:34Z

Codecov Report

Merging #7948 (5c8ce17) into master (b71aa55) will decrease coverage by 0%.
The diff coverage is 97%.

@@          Coverage Diff           @@
##           master   #7948   +/-   ##
======================================
- Coverage      92%     91%   -0%     
======================================
  Files         207     207           
  Lines       13375   13464   +89     
======================================
+ Hits        12245   12295   +50     
- Misses       1130    1169   +39

ananthsub

This PR adds a mechanism to reload from state_dict and restore logged values.

Could you describe more what is the problem we're trying to solve? Is this the only option for restoring logged values?

Relying on self.log to checkpoint metric states is adding even more responsibilities when I think we should be going the other direction and moving responsibilities out of self.log. #7183 (comment)

tchaton · 2021-06-12T09:05:02Z

This PR adds a mechanism to reload from state_dict and restore logged values.

Could you describe more what is the problem we're trying to solve? Is this the only option for restoring logged values?

Relying on self.log to checkpoint metric states is adding even more responsibilities when I think we should be going the other direction and moving responsibilities out of self.log. #7183 (comment)

Hey @ananthsub,

Currently, the ResultCollection isn't fault tolerant.

Problems:

On failure, the collection is being lost.
On restoration, the ResultMetric lost reference to the Metrics.

In order to resolve this, this PR adds the following:

A state_dict and load_from_state_dict function to the ResultCollection to restore in case of failure.
A new field in the metadata object called attribute_name. This is used to automatically re-create references to the Metric on reload by auto-inspecting the LightningModule object. If the Metric aren't part of the LightningModule (not recommended), the user should still be able to restore by providing them manually.

The restoration isn't implemented yet, it would be done in 2/n and 3/n PRs.

The state dumping / restoration will be added to CheckpointConnector. There will be a restore_result_collections function.

PR 2/n: Add ResultCollection state_dict of all loops to model checkpoint inside CheckpointConnector dump_checkpoint function

PR 3/n: Add ResultCollection restore within restoring calls.

Note: The self.log doesn't checkpoint metric states. the self.log function of the LightningModule is responsible to properly populate the Metadata by extracting the attribute_name and propagate the logged values to the current ResultCollection.
The CheckpointConnector will be responsible for this.

Best,
T.C

tchaton · 2021-06-12T16:33:22Z

Hey Ananth,

Actually it might be possible to get rid of the attritube name. I will try this on Monday, which will reduce self.log responsability.

Confirm I could remove it.

pytorch_lightning/trainer/connectors/logger_connector/result.py

awaelchli

apart from the pickling concern LGTM
confidence low, not yet so familiar with new logger connector etc.

tests/core/test_metric_result_integration.py

SeanNaren

Again low confidence but the plan makes sense and test for restoration looks good

This reverts commit 68dac4a.

pytorch_lightning/trainer/connectors/logger_connector/result.py

tchaton added 2 commits June 11, 2021 20:59

add metric reload

4cb7e89

add tests

4176447

tchaton added the logging Related to the `LoggerConnector` and `log()` label Jun 11, 2021

tchaton added this to the v1.4 milestone Jun 11, 2021

tchaton self-assigned this Jun 11, 2021

tchaton requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren and williamFalcon as code owners June 11, 2021 20:19

update changelog

9594653

tchaton added 2 commits June 11, 2021 21:21

udpate

0fa64ed

remove print

9828e72

tchaton changed the title ~~[feat] Add load_from_state_dict to ResultCollection~~ [feat] Add load_from_state_dict to ResultCollection 1/n Jun 11, 2021

ananthsub reviewed Jun 11, 2021

View reviewed changes

tchaton added 4 commits June 14, 2021 08:05

remove attribute_name

f85d590

update

31d390d

update

e7644de

resolve test

659a25a

awaelchli reviewed Jun 14, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/logger_connector/result.py Outdated Show resolved Hide resolved

tchaton added 2 commits June 14, 2021 10:38

update on comments

c453994

bypass typing bug

c102176

tchaton mentioned this pull request Jun 14, 2021

[feat] Add Logging Restoration on Failure 2/2 #7966

Merged

11 tasks

awaelchli reviewed Jun 14, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/logger_connector/result.py Outdated Show resolved Hide resolved

awaelchli approved these changes Jun 14, 2021

View reviewed changes

tests/core/test_metric_result_integration.py Outdated Show resolved Hide resolved

update on comments

2783a4a

ethanwharris approved these changes Jun 14, 2021

View reviewed changes

SeanNaren approved these changes Jun 14, 2021

View reviewed changes

tchaton enabled auto-merge (squash) June 14, 2021 15:25

justusschock approved these changes Jun 15, 2021

View reviewed changes

Merge branch 'master' into fault_tolerant_log

511f55f

carmocca changed the title ~~[feat] Add load_from_state_dict to ResultCollection 1/n~~ [feat] Add {,load}_state_dict} to ResultCollection` 1/n Jun 16, 2021

carmocca changed the title ~~[feat] Add {,load}_state_dict} to ResultCollection` 1/n~~ [feat] Add {,load}_state_dict} to ResultCollection 1/n Jun 16, 2021

carmocca added 4 commits June 16, 2021 04:56

Update CHANGELOG

a8e3d9d

Update tests

e93ac05

Update code

4725cd1

Check if TODO persists

68dac4a

carmocca force-pushed the fault_tolerant_log branch from 2fccea4 to 68dac4a Compare June 16, 2021 02:57

carmocca added 2 commits June 16, 2021 04:59

Remove unrelated changes

b5560c1

Fixes

45f6ce7

carmocca changed the title ~~[feat] Add {,load}_state_dict} to ResultCollection 1/n~~ [feat] Add {,load_}state_dict to ResultCollection 1/n Jun 16, 2021

carmocca added 2 commits June 16, 2021 05:31

Revert "Check if TODO persists"

0854667

This reverts commit 68dac4a.

Merge branch 'master' into fault_tolerant_log

f427315

carmocca reviewed Jun 16, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/logger_connector/result.py Outdated Show resolved Hide resolved

tchaton commented Jun 16, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/logger_connector/result.py Outdated Show resolved Hide resolved

carmocca added 4 commits June 17, 2021 01:58

Do not serialize dataclasses

5e087e2

Avoid recostructing meta twice

ef1251d

Keep previous sync_fn

5332ed2

Move to device and map_location

e1c9893

carmocca force-pushed the fault_tolerant_log branch from 7b95db5 to e1c9893 Compare June 17, 2021 00:40

Fix bug

5c8ce17

carmocca approved these changes Jun 17, 2021

View reviewed changes

carmocca disabled auto-merge June 17, 2021 00:54

tchaton merged commit 3fece17 into master Jun 17, 2021

tchaton deleted the fault_tolerant_log branch June 17, 2021 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add `{,load_}state_dict` to `ResultCollection` 1/n #7948

[feat] Add `{,load_}state_dict` to `ResultCollection` 1/n #7948

tchaton commented Jun 11, 2021 •

edited by carmocca

Loading

codecov bot commented Jun 11, 2021 •

edited

Loading

ananthsub left a comment •

edited

Loading

tchaton commented Jun 12, 2021 •

edited

Loading

tchaton commented Jun 12, 2021 •

edited

Loading

awaelchli left a comment

SeanNaren left a comment

[feat] Add {,load_}state_dict to ResultCollection 1/n #7948

[feat] Add {,load_}state_dict to ResultCollection 1/n #7948

Conversation

tchaton commented Jun 11, 2021 • edited by carmocca Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Jun 11, 2021 • edited Loading

Codecov Report

ananthsub left a comment • edited Loading

Choose a reason for hiding this comment

tchaton commented Jun 12, 2021 • edited Loading

tchaton commented Jun 12, 2021 • edited Loading

awaelchli left a comment

Choose a reason for hiding this comment

SeanNaren left a comment

Choose a reason for hiding this comment

[feat] Add `{,load_}state_dict` to `ResultCollection` 1/n #7948

[feat] Add `{,load_}state_dict` to `ResultCollection` 1/n #7948

tchaton commented Jun 11, 2021 •

edited by carmocca

Loading

codecov bot commented Jun 11, 2021 •

edited

Loading

ananthsub left a comment •

edited

Loading

tchaton commented Jun 12, 2021 •

edited

Loading

tchaton commented Jun 12, 2021 •

edited

Loading