Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model checkpointing for sub-epoch frequency #1758

Closed
Anjum48 opened this issue May 8, 2020 · 12 comments · Fixed by #3807
Closed

Model checkpointing for sub-epoch frequency #1758

Anjum48 opened this issue May 8, 2020 · 12 comments · Fixed by #3807
Assignees
Labels
question Further information is requested won't fix This will not be worked on

Comments

@Anjum48
Copy link

Anjum48 commented May 8, 2020

❓ Questions and Help

What is your question?

In my Trainer I have val_check_interval=0.25, which is great and I can see that the validation loop is run 4 times per epoch as expected.

Is there a way for ModelCheckpoint to use the validation checks to trigger model saving at a sub-epoch frequency? I can see (by setting verbose=True) that the check is only done on epoch end

What's your environment?

  • OS: Linux
  • Packaging: pip (bleeding edge)
  • Version 0.7.5
@ekgren
Copy link

ekgren commented May 12, 2020

Agree, when training on very large datasets (like you do with large transformer models) it would be very nice to be able to save checkpoints on validation!

@williamFalcon
Copy link
Contributor

i thought this was already the case?
agree that should be fixed

@artidoro
Copy link

This is a bit of a hack but it seems to work:
#1809 (comment)

@Anjum48
Copy link
Author

Anjum48 commented May 19, 2020

I think @artidoro has struck on the issue. The second condition is always True when on_validation_end is called within the same epoch:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3459a546672303204a4ae6efcc2613a90f003903/pytorch_lightning/callbacks/model_checkpoint.py#L214-L216

Perhaps the easiest fix is to set period=-1 by default? Would that break anything else?

@awaelchli
Copy link
Contributor

awaelchli commented Jul 14, 2020

Currently period is an integer value and the behaviour @Anjum48 described is expected: If we check validation multiple times during one epoch but period=1, we expect only one time to save a checkpoint. So I don't see any bugs here (but perhaps we could document better the edge cases).

In PL currently it only makes sense to checkpoint at epoch intervals, because the training can anyway not be restored mid-epoch currently. So the edge case where we have val_check_interval < 1 is unfortunate. In my opinion this case should not at all save checkpoints mid epoch and only on (val) epoch end because it otherwise just creates issues with restoring training.

Maybe @jeremyjordan has also comments about this.

@jeremyjordan
Copy link
Contributor

In my opinion this case should not at all save checkpoints mid epoch and only on (val) epoch end because it otherwise just creates issues with restoring training.

I'm confused by this terminology. As I understand it, epoch end refers to when we have made a full pass through the training dataset. Validation end would refer to when we have made a full pass through the validation set, which depending on the trainer setting, could happen multiple times per training epoch. I think I understand what you're saying though that we should only be saving one checkpoint at the end of an epoch?

I'm also confused about mid-epoch checkpoints. The docs reference:

You might want to not only load a model but also continue training it. Use this method to restore the trainer state as well. This will continue from the epoch and global step you last left off.

which implies that we can restart training mid-epoch. However, if you look at the training loop it does appear that we always start at the beginning of an epoch and don't respect the global step loaded from a checkpoint.

@awaelchli
Copy link
Contributor

I used the terminology wrong. What I meant to say is simply that currenly we save checkpoints on validation_end (for a good reason) but it might be undesireable to do so in case validation_end happens mid-epoch, because restoring from such a checkpoint leads to incorrectly restored Trainer state (as you pointed out with global_step). So we need to solve this problem first.
Second is the conflicting setting of the Trainer arg "val_check_interval" and ModelCheckpoint arg "period" that simply do not work togther.

@stale
Copy link

stale bot commented Sep 16, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Sep 16, 2020
@teddykoker teddykoker self-assigned this Sep 16, 2020
@stale stale bot removed the won't fix This will not be worked on label Sep 16, 2020
@teddykoker
Copy link
Contributor

What is the status on this? We do support sub-epoch checkpointing, but as @awaelchli mentioned, it doesn't restore to the global checkpoint?

@williamFalcon
Copy link
Contributor

yeah... i don't think we can actually restore sub epoch state though?

know of a way to do that? we'd have to pull the shuffle state out of the loaders

@teddykoker
Copy link
Contributor

teddykoker commented Sep 16, 2020

Idk we could save the whole dataloader but I don't think that makes sense

@teddykoker teddykoker removed the priority: 0 High priority task label Sep 23, 2020
@stale
Copy link

stale bot commented Oct 23, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 23, 2020
@stale stale bot closed this as completed Oct 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested won't fix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants