-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ModelCheckpoint #1658
Comments
I second this feature request. I would be satisfied with a model checkpoint callback which would both save two sets of checkpoints, one for the latest epoch, and one for the top_k according to validation metrics. |
And I also suggest adding another kind of checkpoint logger in time manner, i.e. save checkpoints every X seconds/hours |
also, allow users to save checkpoints at a specific step, e.g. save checkpoints every 5000 steps, |
@daden-ms you might be able to use this Trainer flag https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#val-check-interval |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
🚀 Feature
Add two optional features:
save_on_shutdown
) in ModelCheckpoint to save the current trainer state before shutdown. Value ofsave_on_shutdown
can only beNone
or the file path for saving.latest.ckpt
) linking to the latest saved model (across multiple runs of training): add an optional argument (e.g.create_link_for_latest
), the value can only beNone
or file path for saving.Motivation
For the first one, if training is interrupted in the middle, no checkpoint is left after last saving, which could be several epochs ago. If I want to continue, I can only resume, at most, with the one saved at last epoch.
For the second one, this is a feature I always implement, maybe it's not essential for everyone. This is useful when I'm doing frequent training, I have to find all the way to the exact model saved last time. So I create a file called
latest.ckpt
at somewhere easy to reach, linking to the lastest model.The text was updated successfully, but these errors were encountered: