Add resuming from specific checkpoint #515

dreamgonfly · 2019-11-15T13:29:34Z

Is your feature request related to a problem? Please describe.
In current version, there is no way to resume training from a specific checkpoint (not the last checkpoint).
Sometimes (very often in my case), one needs experiment training with different hyperparameters (e.g. dropout rate, augmentation) from a specific checkpoint.

Describe the solution you'd like
Add resume_from_checkpointargument to Trainer class.

Describe alternatives you've considered
I tried to use restore from Trainer class, but was not successful because it is meant to be used only when .fit() is called.

Additional context
FYI, I made a PR to demonstrate my idea: #516

The text was updated successfully, but these errors were encountered:

* Add resume_from_checkpoint * Fix variable name * #515 Remove did_restore * #515 Simplify code * #515 Update doc for resume_from_checkpoint * #515 Add on_gpu

shijianjian · 2020-01-22T02:49:56Z

Thanks for this. But may I ask why do we have two different stuff for restoring?

According to the doc, below should be the proper way to restore the whole session.

        logger = TestTubeLogger(
            save_dir='./lightning_logs',
            version=restore_version
        )

Also, there is another one documented in test method.

LightningModule.load_from_checkpoint()

But now, we have another function inside Trainer to restore the weights. A natural problem is which one will come finally? The final weights will be from the TestTubeLogger or the Trainer? Or the documentation is wrongly described the TestTubeLogger that it will restore the hyperparams setttings but not the weights? It causes confusion that forced me to read through the implementation.

To be clearer, shouldn't it make more sense to put the weights restorer like below:

        restore_version:int
        restore_checkpoint: int or Path-like
        mode: enum. 'session-only', 'weights-only', 'session-weights'

        logger = TestTubeLogger(
            save_dir='./lightning_logs',
            version=restore_version,
            checkpoint=restore_checkpoint,
            mode=mode
        )

Borda · 2020-01-22T10:25:25Z

the Test-Tube logger is not mandatory for the lightning project, having an test-tube restore is an alternative option...

shijianjian · 2020-01-22T15:39:44Z

@Borda Thanks the clarify.
So for now, we have at least:

TestTube
LightningModule.load_from_checkpoint
resume_from_checkpoint

As a user, I thought it was recommended to use TestTube as it was documented here. Also, I did not find a similar doc in 6.0 talking about how should we restore from a checkpoint "properly".

Apart from the doc, I hope the API can be refactored to be aggregated together for simplicity. If TestTube is optional, I personally think we should move all checkpoint-restoring stuff to load_from_checkpoint.

stevenguh · 2020-02-07T06:37:06Z

I have been trying to figure out what's the right API to use to resume from checkpoint for half a day, and it's not in any of the example. The current method of resuming is not very well documented and not simple enough to use imo.

williamFalcon · 2020-02-07T09:58:31Z

@stevenguh sorry, we just migrated docs.

https://pytorch-lightning.readthedocs.io/en/latest/lightning-module.html

load_from_checkpoint

stevenguh · 2020-02-07T21:03:13Z

Thank you for replying. I am still confused by differences between load_from_checkpoint in LightningModule and resume_from_checkpoint parameter in Trainer. It would be nice if there is a working example of resume. Also, the project advertises auto restore, but I have no idea how to activate it.

Vichoko · 2020-08-23T01:10:38Z

I can't even resume training from the last checkpoint based on new Trainer. Is this feature broken?
I can't find on the docs the correct way to set auto-resume/auto-load from the last best checkpoint neither.

TubeLogger did the job before, but since I updated libs, every time I train it starts from the beginning.

williamFalcon · 2020-08-23T01:53:29Z

post an example? our tests for this are passing.

example: https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/models/self_supervised/simclr/simclr_finetuner.py#L20

dreamgonfly added feature Is an improvement or enhancement help wanted Open to be worked on labels Nov 15, 2019

dreamgonfly mentioned this issue Nov 15, 2019

Add resuming from specific checkpoint #516

Merged

dreamgonfly added a commit to dreamgonfly/pytorch-lightning that referenced this issue Nov 18, 2019

Lightning-AI#515 Remove did_restore

0db8785

dreamgonfly added a commit to dreamgonfly/pytorch-lightning that referenced this issue Nov 18, 2019

Lightning-AI#515 Simplify code

bcebf60

dreamgonfly added a commit to dreamgonfly/pytorch-lightning that referenced this issue Nov 18, 2019

Lightning-AI#515 Update doc for resume_from_checkpoint

70cbd8c

dreamgonfly added a commit to dreamgonfly/pytorch-lightning that referenced this issue Nov 18, 2019

Lightning-AI#515 Add on_gpu

3477108

williamFalcon closed this as completed in #516 Nov 30, 2019

williamFalcon pushed a commit that referenced this issue Nov 30, 2019

Add resuming from specific checkpoint (#516)

2b8475f

* Add resume_from_checkpoint * Fix variable name * #515 Remove did_restore * #515 Simplify code * #515 Update doc for resume_from_checkpoint * #515 Add on_gpu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resuming from specific checkpoint #515

Add resuming from specific checkpoint #515

dreamgonfly commented Nov 15, 2019 •

edited

Loading

shijianjian commented Jan 22, 2020 •

edited

Loading

Borda commented Jan 22, 2020

shijianjian commented Jan 22, 2020

stevenguh commented Feb 7, 2020

williamFalcon commented Feb 7, 2020

stevenguh commented Feb 7, 2020

Vichoko commented Aug 23, 2020 •

edited

Loading

williamFalcon commented Aug 23, 2020 •

edited

Loading

Add resuming from specific checkpoint #515

Add resuming from specific checkpoint #515

Comments

dreamgonfly commented Nov 15, 2019 • edited Loading

shijianjian commented Jan 22, 2020 • edited Loading

Borda commented Jan 22, 2020

shijianjian commented Jan 22, 2020

stevenguh commented Feb 7, 2020

williamFalcon commented Feb 7, 2020

stevenguh commented Feb 7, 2020

Vichoko commented Aug 23, 2020 • edited Loading

williamFalcon commented Aug 23, 2020 • edited Loading

dreamgonfly commented Nov 15, 2019 •

edited

Loading

shijianjian commented Jan 22, 2020 •

edited

Loading

Vichoko commented Aug 23, 2020 •

edited

Loading

williamFalcon commented Aug 23, 2020 •

edited

Loading