Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ModelCheckpoint #1658

Closed
cmpute opened this issue Apr 29, 2020 · 5 comments
Closed

Improve ModelCheckpoint #1658

cmpute opened this issue Apr 29, 2020 · 5 comments
Labels
feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on won't fix This will not be worked on

Comments

@cmpute
Copy link
Contributor

cmpute commented Apr 29, 2020

🚀 Feature

Add two optional features:

  1. Save the trainer checkpoint just before shutdown: add an optional argument (e.g. save_on_shutdown) in ModelCheckpoint to save the current trainer state before shutdown. Value of save_on_shutdown can only be None or the file path for saving.
  2. Maintain a file (e.g. latest.ckpt) linking to the latest saved model (across multiple runs of training): add an optional argument (e.g. create_link_for_latest), the value can only be None or file path for saving.

Motivation

For the first one, if training is interrupted in the middle, no checkpoint is left after last saving, which could be several epochs ago. If I want to continue, I can only resume, at most, with the one saved at last epoch.

For the second one, this is a feature I always implement, maybe it's not essential for everyone. This is useful when I'm doing frequent training, I have to find all the way to the exact model saved last time. So I create a file called latest.ckpt at somewhere easy to reach, linking to the lastest model.

@cmpute cmpute added feature Is an improvement or enhancement help wanted Open to be worked on labels Apr 29, 2020
@benob
Copy link

benob commented Apr 29, 2020

I second this feature request. I would be satisfied with a model checkpoint callback which would both save two sets of checkpoints, one for the latest epoch, and one for the top_k according to validation metrics.

@cmpute
Copy link
Contributor Author

cmpute commented May 1, 2020

And I also suggest adding another kind of checkpoint logger in time manner, i.e. save checkpoints every X seconds/hours

@daden-ms
Copy link

daden-ms commented May 8, 2020

also, allow users to save checkpoints at a specific step, e.g. save checkpoints every 5000 steps,

@riklopfer
Copy link

@Borda Borda added the good first issue Good for newcomers label Aug 4, 2020
@stale
Copy link

stale bot commented Oct 24, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 24, 2020
@stale stale bot closed this as completed Nov 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants