Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ModelCheckpoint period #3630

Merged
merged 5 commits into from
Sep 29, 2020
Merged

Fix ModelCheckpoint period #3630

merged 5 commits into from
Sep 29, 2020

Conversation

carmocca
Copy link
Contributor

@carmocca carmocca commented Sep 23, 2020

What does this PR do?

Fixes #3619

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

@pep8speaks
Copy link

pep8speaks commented Sep 23, 2020

Hello @carmocca! Thanks for updating this PR.

Line 188:13: W503 line break before binary operator
Line 189:13: W503 line break before binary operator
Line 190:13: W503 line break before binary operator
Line 191:13: W503 line break before binary operator
Line 192:13: W503 line break before binary operator

Comment last updated at 2020-09-29 13:04:12 UTC

@mergify mergify bot requested a review from a team September 23, 2020 21:47
@carmocca
Copy link
Contributor Author

Line 327:13: W503 line break before binary operator
Line 328:13: W503 line break before binary operator
Line 329:13: W503 line break before binary operator
Line 330:13: W503 line break before binary operator
Line 331:13: W503 line break before binary operatoro.gl/hqbW4r)

Ignoring this because https://www.flake8rules.com/rules/W503.html says: "line breaks should occur before the binary operator because it keeps all operators aligned."

@codecov
Copy link

codecov bot commented Sep 23, 2020

Codecov Report

Merging #3630 into master will decrease coverage by 2%.
The diff coverage is 75%.

@@           Coverage Diff            @@
##           master   #3630     +/-   ##
========================================
- Coverage      86%     84%     -2%     
========================================
  Files         110     110             
  Lines        8129    9352   +1223     
========================================
+ Hits         6963    7812    +849     
- Misses       1166    1540    +374     

@awaelchli
Copy link
Contributor

please, if you don't mind, could you split the PR in two? as I understand, these are 2 independent features/fixes?

@carmocca
Copy link
Contributor Author

Yes they are independent. I joined them because they are easier to test together.

I'll split them

@carmocca carmocca changed the title Fix ModelCheckpoint period and allow None monitor Fix ModelCheckpoint period Sep 23, 2020
@awaelchli
Copy link
Contributor

awaelchli commented Sep 23, 2020

If you think it is best to have the test combined, you may prioritize one PR, get it reviewed and merged, then focus on the other integrating the test. if that's possible

@Borda Borda added the bug Something isn't working label Sep 23, 2020
@carmocca carmocca changed the title Fix ModelCheckpoint period [blocked by #3644] Fix ModelCheckpoint period Sep 23, 2020
@carmocca carmocca changed the title [blocked by #3644] Fix ModelCheckpoint period [blocked by #3633] Fix ModelCheckpoint period Sep 23, 2020
@carmocca
Copy link
Contributor Author

Okay, this is ready again. Blocked on 3633 because i'm using monitor=None in the test

@mergify mergify bot requested a review from a team September 25, 2020 13:43
@mergify
Copy link
Contributor

mergify bot commented Sep 25, 2020

This pull request is now in conflict... :(

@mergify mergify bot requested a review from a team September 25, 2020 14:05
@carmocca carmocca changed the title [blocked by #3633] Fix ModelCheckpoint period Fix ModelCheckpoint period Sep 25, 2020
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solid.
I verified that the test fails on master.

@mergify mergify bot requested a review from a team September 25, 2020 14:42
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
@Borda Borda added the ready PRs ready to be merged label Sep 25, 2020
@mergify
Copy link
Contributor

mergify bot commented Sep 27, 2020

This pull request is now in conflict... :(

@carmocca
Copy link
Contributor Author

d79bce1 destroyed this PR so easiest thing was to rebase, undo previous commits and redo them.

Should be ready to go again

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
tests/callbacks/test_model_checkpoint.py Outdated Show resolved Hide resolved
Copy link
Contributor

@rohitgr7 rohitgr7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@awaelchli
Copy link
Contributor

anything substantial changed since we last reviewed?
we should probably check that test still fails on master. I will do that.

@carmocca
Copy link
Contributor Author

carmocca commented Sep 27, 2020

anything substantial changed since we last reviewed?
we should probably check that test still fails on master. I will do that.

The only additional change is:

https://github.com/PyTorchLightning/pytorch-lightning/pull/3630/files#diff-63eb50f09b915db4167aed7cb38ac5c0R218

which I think is a bug

@awaelchli
Copy link
Contributor

anything substantial changed since we last reviewed?
we should probably check that test still fails on master. I will do that.

The only additional change is:

https://github.com/PyTorchLightning/pytorch-lightning/pull/3630/files#diff-63eb50f09b915db4167aed7cb38ac5c0R218

which I think is a bug

why not >= 1? In order to track the top 1, we need to have a monitor, no?

@carmocca
Copy link
Contributor Author

why not >= 1? In order to track the top 1, we need to have a monitor, no?

Yes but if we change that, the default ModelCheckpoint parameters are incorrect because monitor is None by default but top_k is 1 (and would fail in that check)

@williamFalcon
Copy link
Contributor

monitor is by default None.
top_k should probably be None in this case as well.

To save ALL checkpoints, then top_k=-1.

by default we will only save the last checkpoint (otherwise you WILL blow up your disk)

@carmocca
Copy link
Contributor Author

Im okay with that. Just one thing:

If default behaviour does the same as save_last, why not have save_last=True by default to make it obvious?

@williamFalcon
Copy link
Contributor

williamFalcon commented Sep 27, 2020

because save_last is a separate operation...

We have these modes:
a. default
b. save top k
c. save last

When not in default, user CAN save last AND top k or only one of them. Thus each one is its own behavior that must also be turned on.

Once the user leaves default mode, they'd be very surprised to continue finding the last ckpts saved

@carmocca
Copy link
Contributor Author

carmocca commented Sep 27, 2020

Then maybe we should throw a warning:

if save_last and monitor is None and save_top_k is not None:
    rank_zero_warn(
        "ModelCheckpoint save_last is set to True but the last checkpoint is already saved"
        f" by having monitor=None and save_top_k={save_top_k}"
    )

(or a similar warning)

@williamFalcon
Copy link
Contributor

williamFalcon commented Sep 27, 2020

this is unnecessarily complicated...

save_top_k is NOT the same as save_last...

the last ckpt may not be the best one (ie: overfitting)

We're going to keep the current behavior with the only change which is top_k defaults to None.

Again... to be clear:
Once a user stops using the default behavior, they can save top k or the last.
If the user just wants to save the last, then they would keep it as the default behavior.

If the user sets to save the top k then they can ALSO opt in to saving the last by enabling that.

I understand what you're saying, but for a user to get the last when they set top k is very confusing... unless they also ask for the last.

Maybe the answer is to ALSO set save_last=None by default.

@carmocca
Copy link
Contributor Author

carmocca commented Sep 27, 2020

save_top_k is NOT the same as save_last..

Of course it is not always, but the default values of ModelCheckpoint monitor=None and top_k=1 (None in the future) means that only the last run epoch checkpoint is kept.

That means that ModelCheckpoint() does the same as ModelCheckpoint(save_last=True). But in the second case you get two files for the same thing.

Thinking as a new user, I would be a bit confused about this.

The warning I propose is for the case where monitor is None and save_last=True. Im not proposing that and having save_last=True by default.

@carmocca
Copy link
Contributor Author

Maybe the answer is to ALSO set save_last=None by default.

I think so, considering that save_last only makes sense when we have a monitor and we don't by default

@awaelchli
Copy link
Contributor

working on the default savetopk here #3680
want to get it done asap so we can continue on this period fix.

@williamFalcon
Copy link
Contributor

williamFalcon commented Sep 27, 2020

@awaelchli let's make save_last = None by default as well?

@awaelchli awaelchli changed the title Fix ModelCheckpoint period [blocked by #3680] Fix ModelCheckpoint period Sep 27, 2020
@mergify
Copy link
Contributor

mergify bot commented Sep 28, 2020

This pull request is now in conflict... :(

@awaelchli awaelchli changed the title [blocked by #3680] Fix ModelCheckpoint period Fix ModelCheckpoint period Sep 28, 2020
@mergify
Copy link
Contributor

mergify bot commented Sep 28, 2020

This pull request is now in conflict... :(

@carmocca
Copy link
Contributor Author

Anything holding this back?

@mergify
Copy link
Contributor

mergify bot commented Sep 29, 2020

This pull request is now in conflict... :(

Already pushed to master

This reverts commit 00d9e77.
@Borda
Copy link
Member

Borda commented Sep 29, 2020

Anything holding this back?

it shall be fine now...

@awaelchli awaelchli merged commit 3b2efe5 into Lightning-AI:master Sep 29, 2020
@carmocca carmocca deleted the bug/3619_fix-period branch September 29, 2020 13:41
@Borda Borda added this to the 0.10.0 milestone Oct 7, 2020
@carmocca carmocca self-assigned this Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ModelCheckpoint period should not always save on the first epoch
7 participants