-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add resuming from specific checkpoint #515
Comments
Thanks for this. But may I ask why do we have two different stuff for restoring? According to the doc, below should be the proper way to restore the whole session. logger = TestTubeLogger(
save_dir='./lightning_logs',
version=restore_version
) Also, there is another one documented in test method. LightningModule.load_from_checkpoint() But now, we have another function inside To be clearer, shouldn't it make more sense to put the weights restorer like below: restore_version:int
restore_checkpoint: int or Path-like
mode: enum. 'session-only', 'weights-only', 'session-weights'
logger = TestTubeLogger(
save_dir='./lightning_logs',
version=restore_version,
checkpoint=restore_checkpoint,
mode=mode
) |
the Test-Tube logger is not mandatory for the lightning project, having an test-tube restore is an alternative option... |
@Borda Thanks the clarify.
As a user, I thought it was recommended to use TestTube as it was documented here. Also, I did not find a similar doc in 6.0 talking about how should we restore from a checkpoint "properly". Apart from the doc, I hope the API can be refactored to be aggregated together for simplicity. If TestTube is optional, I personally think we should move all checkpoint-restoring stuff to |
I have been trying to figure out what's the right API to use to resume from checkpoint for half a day, and it's not in any of the example. The current method of resuming is not very well documented and not simple enough to use imo. |
@stevenguh sorry, we just migrated docs. https://pytorch-lightning.readthedocs.io/en/latest/lightning-module.html load_from_checkpoint |
Thank you for replying. I am still confused by differences between |
I can't even resume training from the last checkpoint based on new Trainer. Is this feature broken? TubeLogger did the job before, but since I updated libs, every time I train it starts from the beginning. |
post an example? our tests for this are passing. |
Is your feature request related to a problem? Please describe.
In current version, there is no way to resume training from a specific checkpoint (not the last checkpoint).
Sometimes (very often in my case), one needs experiment training with different hyperparameters (e.g. dropout rate, augmentation) from a specific checkpoint.
Describe the solution you'd like
Add
resume_from_checkpoint
argument to Trainer class.Describe alternatives you've considered
I tried to use
restore
from Trainer class, but was not successful because it is meant to be used only when .fit() is called.Additional context
FYI, I made a PR to demonstrate my idea: #516
The text was updated successfully, but these errors were encountered: