Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add {,load_}state_dict to ResultCollection 1/n #7948

Merged
merged 26 commits into from
Jun 17, 2021

Conversation

tchaton
Copy link
Contributor

@tchaton tchaton commented Jun 11, 2021

What does this PR do?

Tracking Issue: #7898

This PR adds a mechanism to reload from state_dict and restore logged values.

Fixes #<issue_number>

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@tchaton tchaton added the logging Related to the `LoggerConnector` and `log()` label Jun 11, 2021
@tchaton tchaton added this to the v1.4 milestone Jun 11, 2021
@tchaton tchaton self-assigned this Jun 11, 2021
@codecov
Copy link

codecov bot commented Jun 11, 2021

Codecov Report

Merging #7948 (5c8ce17) into master (b71aa55) will decrease coverage by 0%.
The diff coverage is 97%.

@@          Coverage Diff           @@
##           master   #7948   +/-   ##
======================================
- Coverage      92%     91%   -0%     
======================================
  Files         207     207           
  Lines       13375   13464   +89     
======================================
+ Hits        12245   12295   +50     
- Misses       1130    1169   +39     

@tchaton tchaton changed the title [feat] Add load_from_state_dict to ResultCollection [feat] Add load_from_state_dict to ResultCollection 1/n Jun 11, 2021
Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a mechanism to reload from state_dict and restore logged values.

Could you describe more what is the problem we're trying to solve? Is this the only option for restoring logged values?

Relying on self.log to checkpoint metric states is adding even more responsibilities when I think we should be going the other direction and moving responsibilities out of self.log. #7183 (comment)

@tchaton
Copy link
Contributor Author

tchaton commented Jun 12, 2021

This PR adds a mechanism to reload from state_dict and restore logged values.

Could you describe more what is the problem we're trying to solve? Is this the only option for restoring logged values?

Relying on self.log to checkpoint metric states is adding even more responsibilities when I think we should be going the other direction and moving responsibilities out of self.log. #7183 (comment)

Hey @ananthsub,

Currently, the ResultCollection isn't fault tolerant.

Problems:

  • On failure, the collection is being lost.
  • On restoration, the ResultMetric lost reference to the Metrics.

In order to resolve this, this PR adds the following:

  • A state_dict and load_from_state_dict function to the ResultCollection to restore in case of failure.
  • A new field in the metadata object called attribute_name. This is used to automatically re-create references to the Metric on reload by auto-inspecting the LightningModule object. If the Metric aren't part of the LightningModule (not recommended), the user should still be able to restore by providing them manually.

The restoration isn't implemented yet, it would be done in 2/n and 3/n PRs.

The state dumping / restoration will be added to CheckpointConnector. There will be a restore_result_collections function.

PR 2/n: Add ResultCollection state_dict of all loops to model checkpoint inside CheckpointConnector dump_checkpoint function

PR 3/n: Add ResultCollection restore within restoring calls.

Note: The self.log doesn't checkpoint metric states. the self.log function of the LightningModule is responsible to properly populate the Metadata by extracting the attribute_name and propagate the logged values to the current ResultCollection.
The CheckpointConnector will be responsible for this.

Best,
T.C

@tchaton
Copy link
Contributor Author

tchaton commented Jun 12, 2021

Hey Ananth,

Actually it might be possible to get rid of the attritube name. I will try this on Monday, which will reduce self.log responsability.

Confirm I could remove it.

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from the pickling concern LGTM
confidence low, not yet so familiar with new logger connector etc.

tests/core/test_metric_result_integration.py Outdated Show resolved Hide resolved
Copy link
Contributor

@SeanNaren SeanNaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again low confidence but the plan makes sense and test for restoration looks good

@tchaton tchaton enabled auto-merge (squash) June 14, 2021 15:25
@carmocca carmocca changed the title [feat] Add load_from_state_dict to ResultCollection 1/n [feat] Add {,load}_state_dict} to ResultCollection` 1/n Jun 16, 2021
@carmocca carmocca changed the title [feat] Add {,load}_state_dict} to ResultCollection` 1/n [feat] Add {,load}_state_dict} to ResultCollection 1/n Jun 16, 2021
@carmocca carmocca force-pushed the fault_tolerant_log branch from 2fccea4 to 68dac4a Compare June 16, 2021 02:57
@carmocca carmocca changed the title [feat] Add {,load}_state_dict} to ResultCollection 1/n [feat] Add {,load_}state_dict to ResultCollection 1/n Jun 16, 2021
@carmocca carmocca force-pushed the fault_tolerant_log branch from 7b95db5 to e1c9893 Compare June 17, 2021 00:40
@carmocca carmocca disabled auto-merge June 17, 2021 00:54
@tchaton tchaton merged commit 3fece17 into master Jun 17, 2021
@tchaton tchaton deleted the fault_tolerant_log branch June 17, 2021 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logging Related to the `LoggerConnector` and `log()`
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants