Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resuming from specific checkpoint #515

Closed
dreamgonfly opened this issue Nov 15, 2019 · 8 comments · Fixed by #516
Closed

Add resuming from specific checkpoint #515

dreamgonfly opened this issue Nov 15, 2019 · 8 comments · Fixed by #516
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@dreamgonfly
Copy link
Contributor

dreamgonfly commented Nov 15, 2019

Is your feature request related to a problem? Please describe.
In current version, there is no way to resume training from a specific checkpoint (not the last checkpoint).
Sometimes (very often in my case), one needs experiment training with different hyperparameters (e.g. dropout rate, augmentation) from a specific checkpoint.

Describe the solution you'd like
Add resume_from_checkpointargument to Trainer class.

Describe alternatives you've considered
I tried to use restore from Trainer class, but was not successful because it is meant to be used only when .fit() is called.

Additional context
FYI, I made a PR to demonstrate my idea: #516

@dreamgonfly dreamgonfly added feature Is an improvement or enhancement help wanted Open to be worked on labels Nov 15, 2019
dreamgonfly added a commit to dreamgonfly/pytorch-lightning that referenced this issue Nov 18, 2019
dreamgonfly added a commit to dreamgonfly/pytorch-lightning that referenced this issue Nov 18, 2019
dreamgonfly added a commit to dreamgonfly/pytorch-lightning that referenced this issue Nov 18, 2019
dreamgonfly added a commit to dreamgonfly/pytorch-lightning that referenced this issue Nov 18, 2019
williamFalcon pushed a commit that referenced this issue Nov 30, 2019
* Add resume_from_checkpoint

* Fix variable name

* #515 Remove did_restore

* #515 Simplify code

* #515 Update doc for resume_from_checkpoint

* #515 Add on_gpu
@shijianjian
Copy link
Contributor

shijianjian commented Jan 22, 2020

Thanks for this. But may I ask why do we have two different stuff for restoring?

According to the doc, below should be the proper way to restore the whole session.

        logger = TestTubeLogger(
            save_dir='./lightning_logs',
            version=restore_version
        )

Also, there is another one documented in test method.

LightningModule.load_from_checkpoint()

But now, we have another function inside Trainer to restore the weights. A natural problem is which one will come finally? The final weights will be from the TestTubeLogger or the Trainer? Or the documentation is wrongly described the TestTubeLogger that it will restore the hyperparams setttings but not the weights? It causes confusion that forced me to read through the implementation.

To be clearer, shouldn't it make more sense to put the weights restorer like below:

        restore_version:int
        restore_checkpoint: int or Path-like
        mode: enum. 'session-only', 'weights-only', 'session-weights'

        logger = TestTubeLogger(
            save_dir='./lightning_logs',
            version=restore_version,
            checkpoint=restore_checkpoint,
            mode=mode
        )

@Borda
Copy link
Member

Borda commented Jan 22, 2020

the Test-Tube logger is not mandatory for the lightning project, having an test-tube restore is an alternative option...

@shijianjian
Copy link
Contributor

@Borda Thanks the clarify.
So for now, we have at least:

  1. TestTube
  2. LightningModule.load_from_checkpoint
  3. resume_from_checkpoint

As a user, I thought it was recommended to use TestTube as it was documented here. Also, I did not find a similar doc in 6.0 talking about how should we restore from a checkpoint "properly".

Apart from the doc, I hope the API can be refactored to be aggregated together for simplicity. If TestTube is optional, I personally think we should move all checkpoint-restoring stuff to load_from_checkpoint.

@stevenguh
Copy link

I have been trying to figure out what's the right API to use to resume from checkpoint for half a day, and it's not in any of the example. The current method of resuming is not very well documented and not simple enough to use imo.

@williamFalcon
Copy link
Contributor

@stevenguh sorry, we just migrated docs.

https://pytorch-lightning.readthedocs.io/en/latest/lightning-module.html

load_from_checkpoint

@stevenguh
Copy link

Thank you for replying. I am still confused by differences between load_from_checkpoint in LightningModule and resume_from_checkpoint parameter in Trainer. It would be nice if there is a working example of resume. Also, the project advertises auto restore, but I have no idea how to activate it.

@Vichoko
Copy link
Contributor

Vichoko commented Aug 23, 2020

I can't even resume training from the last checkpoint based on new Trainer. Is this feature broken?
I can't find on the docs the correct way to set auto-resume/auto-load from the last best checkpoint neither.

TubeLogger did the job before, but since I updated libs, every time I train it starts from the beginning.

@williamFalcon
Copy link
Contributor

williamFalcon commented Aug 23, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants