-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes for early stopping and checkpoint callbacks #1504
fixes for early stopping and checkpoint callbacks #1504
Conversation
@PyTorchLightning/core-contributors currently, our documentation states that:
However, this is not completely true. We only look at |
Hello @jeremyjordan! Thanks for updating this PR.
Comment last updated at 2020-06-28 06:34:32 UTC |
@Borda any idea why some of the logger tests are failing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that the tests are failing on multiple places not only loggers... let's take it one by one
tests are failing when:
i need to investigate the off by one error, but not sure how the other tests failing are related to the changes in this PR i want to get these failing tests addressed, then will write more tests for the early stopping callback. |
ok, there's one remaining failing test and i've tracked down the issue. there's a thread lock being created when you create the OfflineExperiment which is preventing to object from being pickle-able. (see #1682) |
@@ -197,7 +197,7 @@ def format_checkpoint_name(self, epoch, metrics, ver=None): | |||
return filepath | |||
|
|||
@rank_zero_only | |||
def on_validation_end(self, trainer, pl_module): | |||
def on_epoch_end(self, trainer, pl_module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Does this change effect checkpointing in the middle of training epoch? Consider the usecase where we train on a large dataset and we want to checkpoint & early stop every X steps instead of every X epoches, for example X = 100, i.e. val_check_interval
= 100.
@jeremyjordan is it blocked by another pr? |
@jeremyjordan which pr is blocking this? |
This pull request is now in conflict... :( |
Strange. for me it works fine, I get these timings locally when running
max_steps seems to work. |
@awaelchli are you running on Windows by any chance? that's the only place where tests are passing :D |
Oh right that must be it.. |
This pull request is now in conflict... :( |
@jeremyjordan mind rebase/merge master? and how is the last test? 🐰 |
This pull request is now in conflict... :( |
I merged master into this and getting many failed tests 😭 don't know where to begin but i still have hope this can get merged. Will try to fix them this weekend. |
self.run_evaluation(test_mode=self.testing) | ||
self.call_checkpoint_callback() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyjordan The tests fail because the evaluation loop is not getting called after the epoch. Did you intend to movie it somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh i think i only meant to remove the call_checkpoint_callback()
, this was my mistake - glad you caught that!
if self.logger is not None: | ||
save_dir = (getattr(self.logger, 'save_dir', None) or | ||
getattr(self.logger, '_save_dir', None) or | ||
self.default_root_dir) | ||
|
||
# weights_save_path overrides anything | ||
if self.weights_save_path is not None: | ||
save_dir = self.weights_save_path | ||
|
||
version = self.logger.version if isinstance( | ||
self.logger.version, str) else f'version_{self.logger.version}' | ||
ckpt_path = os.path.join( | ||
save_dir, | ||
self.logger.name, | ||
version, | ||
"checkpoints" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyjordan this code was moved to the ModelCheckpoint.on_train_start, and I understand why. However, we have the problem that the logger is already saving a meta.yaml file to the default location before the on_train_start callback is even called an the model checkpoint has the chance to update the weights_save_path.
Any idea how to decouple the checkpoint and logger ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be unrelated, since it also happens here #2392. not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think we should provide shared configuration in the Trainer initialization and not expect these child objects (loggers and checkpoint callbacks) to reach into each other's attributes. this probably also includes moving some attributes from logging (eg. version) up into the Trainer
I was able to fix all tests and merge errors.
What about a callack method |
ae75fa4
to
ba6a5ba
Compare
This pull request is now in conflict... :( |
Yes, I was thinking the same thing. This callback would just return a state_dict which the Trainer could store. The only thing that I am unclear how we should handle is for other callbacks how we want to reinitialize the state. If we can expect that the same exact callbacks will be passed to the Trainer then it should be trivial. Or we could expect that you only pass in a single instance of each callback class (eg. Maybe for a first iteration we can just document that for |
@jeremyjordan we moved the PR to #2391 as it is the repo branch and much easier to maintain by other core... :] |
@awaelchli I created #2401 for us to continue discussion on your comment |
perfect! |
* add state_dict for early stopping * move best attr after monitor_op defined * improve early stopping and model checkpoint callbacks * fix formatting * fix attr init order * clean up setting of default_root_dir attr * logger needs default root dir set first * reorg trainer init * remove direct references to checkpoint callback * more fixes * more bugfixes * run callbacks at epoch end * update tests to use on epoch end * PR cleanup * address failing tests * refactor for homogeneity * fix merge conflict * separate tests * tests for early stopping bug regressions * small fixes * revert model checkpoint change * typo fix * fix tests * update train loop * cannot pass an int as default_save_path * refactor log message * fix test case * appease the linter * fix some doctests * move config to callback * fixes from rebase * fixes from rebase * chlog * docs * reformat * formatting * fix * fix * fixes from rebase * add new test for patience * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/callbacks/test_early_stopping.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * fix formatting * remove enable_early_stop attribute * add state_dict for early stopping * move best attr after monitor_op defined * improve early stopping and model checkpoint callbacks * fix formatting * fix attr init order * clean up setting of default_root_dir attr * logger needs default root dir set first * reorg trainer init * remove direct references to checkpoint callback * more fixes * more bugfixes * run callbacks at epoch end * update tests to use on epoch end * PR cleanup * address failing tests * refactor for homogeneity * fix merge conflict * separate tests * tests for early stopping bug regressions * small fixes * revert model checkpoint change * typo fix * fix tests * update train loop * fix test case * appease the linter * fix some doctests * move config to callback * fixes from rebase * fixes from rebase * chlog * docs * reformat * formatting * fix * fix * fixes from rebase * add new test for patience * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/callbacks/test_early_stopping.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * fix formatting * remove enable_early_stop attribute * fix test with new epoch indexing * fix progress bar totals * fix off by one error (see #2289) epoch starts at 0 now * added missing imports * fix hpc_save folderpath * fix formatting * fix tests * small fixes from a rebase * fix * tmpdir * tmpdir * tmpdir * wandb * fix merge conflict * add back evaluation after training * test_resume_early_stopping_from_checkpoint TODO * undo the horovod check * update changelog * remove a duplicate test from merge error * try fix dp_resume test * add the logger fix from master * try remove default_root_dir * try mocking numpy * try import numpy in docs test * fix wandb test * pep 8 fix * skip if no amp * dont mock when doctesting * install extra * fix the resume ES test * undo conf.py changes * revert remove comet pickle from test * Update CHANGELOG.md Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update weights_loading.rst * Update weights_loading.rst * Update weights_loading.rst * renamed flag * renamed flag * revert the None check in logger experiment name/version * add the old comments * _experiment * test chckpointing on DDP * skip the ddp test on windows * cloudpickle * renamed flag * renamed flag * parentheses for clarity * apply suggestion max epochs Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jeremy Jordan <jtjordan@ncsu.edu> Co-authored-by: Jirka <jirka@pytorchlightning.ai> Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: William Falcon <waf2107@columbia.edu>
Before submitting
What does this PR do?
For #1464
For #1463
For #1699
For #2151
Related #1458
check_val_every_n_epoch>1
)Adds tests to prevent future regressions.
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃