Model checkpointing for sub-epoch frequency #1758

Anjum48 · 2020-05-08T09:41:47Z

❓ Questions and Help

What is your question?

In my Trainer I have val_check_interval=0.25, which is great and I can see that the validation loop is run 4 times per epoch as expected.

Is there a way for ModelCheckpoint to use the validation checks to trigger model saving at a sub-epoch frequency? I can see (by setting verbose=True) that the check is only done on epoch end

What's your environment?

OS: Linux
Packaging: pip (bleeding edge)
Version 0.7.5

The text was updated successfully, but these errors were encountered:

ekgren · 2020-05-12T12:20:11Z

Agree, when training on very large datasets (like you do with large transformer models) it would be very nice to be able to save checkpoints on validation!

williamFalcon · 2020-05-12T13:23:34Z

i thought this was already the case?
agree that should be fixed

artidoro · 2020-05-18T00:01:27Z

This is a bit of a hack but it seems to work:
#1809 (comment)

Anjum48 · 2020-05-19T18:38:28Z

I think @artidoro has struck on the issue. The second condition is always True when on_validation_end is called within the same epoch:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3459a546672303204a4ae6efcc2613a90f003903/pytorch_lightning/callbacks/model_checkpoint.py#L214-L216

Perhaps the easiest fix is to set period=-1 by default? Would that break anything else?

awaelchli · 2020-07-14T19:48:52Z

Currently period is an integer value and the behaviour @Anjum48 described is expected: If we check validation multiple times during one epoch but period=1, we expect only one time to save a checkpoint. So I don't see any bugs here (but perhaps we could document better the edge cases).

In PL currently it only makes sense to checkpoint at epoch intervals, because the training can anyway not be restored mid-epoch currently. So the edge case where we have val_check_interval < 1 is unfortunate. In my opinion this case should not at all save checkpoints mid epoch and only on ~~(val)~~ epoch end because it otherwise just creates issues with restoring training.

Maybe @jeremyjordan has also comments about this.

jeremyjordan · 2020-07-15T02:14:16Z

In my opinion this case should not at all save checkpoints mid epoch and only on (val) epoch end because it otherwise just creates issues with restoring training.

I'm confused by this terminology. As I understand it, epoch end refers to when we have made a full pass through the training dataset. Validation end would refer to when we have made a full pass through the validation set, which depending on the trainer setting, could happen multiple times per training epoch. I think I understand what you're saying though that we should only be saving one checkpoint at the end of an epoch?

I'm also confused about mid-epoch checkpoints. The docs reference:

You might want to not only load a model but also continue training it. Use this method to restore the trainer state as well. This will continue from the epoch and global step you last left off.

which implies that we can restart training mid-epoch. However, if you look at the training loop it does appear that we always start at the beginning of an epoch and don't respect the global step loaded from a checkpoint.

awaelchli · 2020-07-18T03:10:25Z

I used the terminology wrong. What I meant to say is simply that currenly we save checkpoints on validation_end (for a good reason) but it might be undesireable to do so in case validation_end happens mid-epoch, because restoring from such a checkpoint leads to incorrectly restored Trainer state (as you pointed out with global_step). So we need to solve this problem first.
Second is the conflicting setting of the Trainer arg "val_check_interval" and ModelCheckpoint arg "period" that simply do not work togther.

stale · 2020-09-16T05:31:10Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

teddykoker · 2020-09-16T18:21:24Z

What is the status on this? We do support sub-epoch checkpointing, but as @awaelchli mentioned, it doesn't restore to the global checkpoint?

williamFalcon · 2020-09-16T18:22:51Z

yeah... i don't think we can actually restore sub epoch state though?

know of a way to do that? we'd have to pull the shuffle state out of the loaders

teddykoker · 2020-09-16T18:27:50Z

Idk we could save the whole dataloader but I don't think that makes sense

stale · 2020-10-23T15:26:32Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Anjum48 added the question Further information is requested label May 8, 2020

yakobyd mentioned this issue May 9, 2020

Unexpected Behaviour with Model Checkpointing and val_check_interval #1764

Closed

williamFalcon added the priority: 0 High priority task label May 12, 2020

awaelchli mentioned this issue May 16, 2020

How to save the model after certain steps instead of epoch? #1809

Closed

stale bot added the won't fix This will not be worked on label Sep 16, 2020

teddykoker self-assigned this Sep 16, 2020

stale bot removed the won't fix This will not be worked on label Sep 16, 2020

teddykoker removed the priority: 0 High priority task label Sep 23, 2020

stale bot added the won't fix This will not be worked on label Oct 23, 2020

awaelchli mentioned this issue Oct 25, 2020

Add step index in checkpoint name #3807

Merged

3 tasks

stale bot closed this as completed Oct 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model checkpointing for sub-epoch frequency #1758

Model checkpointing for sub-epoch frequency #1758

Anjum48 commented May 8, 2020 •

edited

Loading

ekgren commented May 12, 2020 •

edited

Loading

williamFalcon commented May 12, 2020

artidoro commented May 18, 2020

Anjum48 commented May 19, 2020

awaelchli commented Jul 14, 2020 •

edited

Loading

jeremyjordan commented Jul 15, 2020

awaelchli commented Jul 18, 2020

stale bot commented Sep 16, 2020

teddykoker commented Sep 16, 2020

williamFalcon commented Sep 16, 2020

teddykoker commented Sep 16, 2020 •

edited

Loading

stale bot commented Oct 23, 2020

Model checkpointing for sub-epoch frequency #1758

Model checkpointing for sub-epoch frequency #1758

Comments

Anjum48 commented May 8, 2020 • edited Loading

❓ Questions and Help

What is your question?

What's your environment?

ekgren commented May 12, 2020 • edited Loading

williamFalcon commented May 12, 2020

artidoro commented May 18, 2020

Anjum48 commented May 19, 2020

awaelchli commented Jul 14, 2020 • edited Loading

jeremyjordan commented Jul 15, 2020

awaelchli commented Jul 18, 2020

stale bot commented Sep 16, 2020

teddykoker commented Sep 16, 2020

williamFalcon commented Sep 16, 2020

teddykoker commented Sep 16, 2020 • edited Loading

stale bot commented Oct 23, 2020

Anjum48 commented May 8, 2020 •

edited

Loading

ekgren commented May 12, 2020 •

edited

Loading

awaelchli commented Jul 14, 2020 •

edited

Loading

teddykoker commented Sep 16, 2020 •

edited

Loading