[2/2] Remove training loop force calling early stopping callback #7069

ananthsub · 2021-04-17T03:40:20Z

What does this PR do?

Fixes #7033
Part 2 - this depends on #6944

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2021-04-17T03:40:27Z

Hello @ananthsub! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-29 01:53:56 UTC

codecov · 2021-04-17T03:55:25Z

Codecov Report

Merging #7069 (29c2e3a) into master (e272bea) will decrease coverage by 0%.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #7069   +/-   ##
======================================
- Coverage      87%     87%   -0%     
======================================
  Files         199     199           
  Lines       12799   12791    -8     
======================================
- Hits        11170   11160   -10     
- Misses       1629    1631    +2

tchaton · 2021-04-19T14:30:57Z

pytorch_lightning/callbacks/early_stopping.py

+ self._run_early_stopping_check(trainer)
+
+ def on_validation_end(self, trainer, pl_module) -> None:
+ if self._check_on_train_epoch_end or self._should_skip_check(trainer):


Should we run it for both validation and training ?

No, as the monitor metric might not be available in both training and validation. This lets people mix and match better, similar to what we are doing with checkpointing.

side fact: We actually had at one point model checkpoint running on both training and val epoch end (pre v1.0), which lead to writing the checkpoint twice, once with the old value for the val metric and once with the new one. And then it was wrong of course when val_check_interval != 1.

tchaton · 2021-04-19T14:31:13Z

pytorch_lightning/callbacks/early_stopping.py

+ self._run_early_stopping_check(trainer)
+
+ def on_validation_end(self, trainer, pl_module) -> None:
+ if self._check_on_train_epoch_end or self._should_skip_check(trainer):


Suggested change

if self._check_on_train_epoch_end or self._should_skip_check(trainer):

if self._should_skip_check(trainer):

Borda

just to clarify, this is a cumulative change including #6944 correct?

ananthsub · 2021-04-26T22:07:40Z

just to clarify, this is a cumulative change including #6944 correct?

yes thats correct

ananthsub · 2021-04-28T20:15:01Z

@carmocca @awaelchli @tchaton @Borda mind taking a look?

Borda

lgtm

awaelchli · 2021-04-28T21:54:13Z

tests/trainer/test_trainer.py

@@ -548,7 +548,7 @@ def training_step(self, batch, batch_idx):
 return output

 model = TestModel()
- early_stop = EarlyStopping(monitor="loss", patience=0)
+ early_stop = EarlyStopping(monitor="loss", patience=0, check_on_train_epoch_end=True)


Just want to bring awareness, this changes the behavior for users.
A user that has monitor on a training metric now needs to set this argument after upgrading, right? Let's make this clear in the changelog, in the "Changed" section?

And are we good to include this in 1.3 during feature freeze here? I assume so since milestone is set to 1.3 but just to make it clear to everyone.

Added to the changelog. @edenlightning @tchaton wdyt for 1.3?

@awaelchli fwiw this is pretty new behavior: #5208

but definitely, it's a change compared to before and i dont think there's a way to make this backward compatible. logging a warning also can get lost. keeping this around longer blocks the general loop refactor we want to do

yes, it can't be made compatible. I just want to push for a better changelog.
Thanks @ananthsub

We need to anticipate that people will report these changes as bugs when upgrading and if we have a clear changelog/release notes we can point them to that.

awaelchli

Nice to finally see this TODO getting resolved!

ananthsub mentioned this pull request Apr 17, 2021

[1/2] Add support for early stopping during training epoch end #6944

Merged

11 tasks

ananthsub added this to the 1.3 milestone Apr 17, 2021

ananthsub added the callback label Apr 17, 2021

ananthsub marked this pull request as ready for review April 17, 2021 04:12

ananthsub requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners April 17, 2021 04:12

ananthsub added the refactor label Apr 17, 2021

tchaton reviewed Apr 19, 2021

View reviewed changes

mergify bot added the has conflicts label Apr 19, 2021

Borda reviewed Apr 26, 2021

View reviewed changes

ananthsub force-pushed the early-stop-train-2 branch from 807fd72 to a7f5ab9 Compare April 28, 2021 16:29

mergify bot removed the has conflicts label Apr 28, 2021

Borda approved these changes Apr 28, 2021

View reviewed changes

Borda enabled auto-merge (squash) April 28, 2021 20:33

carmocca approved these changes Apr 28, 2021

View reviewed changes

carmocca added the ready PRs ready to be merged label Apr 28, 2021

awaelchli reviewed Apr 28, 2021

View reviewed changes

awaelchli disabled auto-merge April 28, 2021 21:54

awaelchli approved these changes Apr 28, 2021

View reviewed changes

rebase

9f6c499

ananthsub added 7 commits April 28, 2021 18:53

doc

f748c3c

Update training_loop.py

35d2183

Update CHANGELOG.md

80983d6

Update CHANGELOG.md

b84b048

Update CHANGELOG.md

73d0893

Update CHANGELOG.md

84621d1

Update CHANGELOG.md

29c2e3a

ananthsub force-pushed the early-stop-train-2 branch from 36b78ce to 29c2e3a Compare April 29, 2021 01:53

ananthsub merged commit 14b8dd4 into Lightning-AI:master Apr 29, 2021

ananthsub deleted the early-stop-train-2 branch April 29, 2021 20:57

ananthsub mentioned this pull request May 3, 2021

TrainerState refactor [5/5] #7173

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2/2] Remove training loop force calling early stopping callback #7069

[2/2] Remove training loop force calling early stopping callback #7069

ananthsub commented Apr 17, 2021 •

edited

Loading

pep8speaks commented Apr 17, 2021 •

edited

Loading

codecov bot commented Apr 17, 2021 •

edited

Loading

tchaton Apr 19, 2021

ananthsub Apr 19, 2021

awaelchli Apr 19, 2021 •

edited

Loading

tchaton Apr 19, 2021

Borda left a comment

ananthsub commented Apr 26, 2021

ananthsub commented Apr 28, 2021

Borda left a comment

awaelchli Apr 28, 2021

ananthsub Apr 28, 2021

ananthsub Apr 29, 2021

awaelchli Apr 29, 2021 •

edited

Loading

awaelchli left a comment

	if self._check_on_train_epoch_end or self._should_skip_check(trainer):
	if self._should_skip_check(trainer):

[2/2] Remove training loop force calling early stopping callback #7069

[2/2] Remove training loop force calling early stopping callback #7069

Conversation

ananthsub commented Apr 17, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Apr 17, 2021 • edited Loading

Comment last updated at 2021-04-29 01:53:56 UTC

codecov bot commented Apr 17, 2021 • edited Loading

Codecov Report

tchaton Apr 19, 2021

Choose a reason for hiding this comment

ananthsub Apr 19, 2021

Choose a reason for hiding this comment

awaelchli Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

tchaton Apr 19, 2021

Choose a reason for hiding this comment

Borda left a comment

Choose a reason for hiding this comment

ananthsub commented Apr 26, 2021

ananthsub commented Apr 28, 2021

Borda left a comment

Choose a reason for hiding this comment

awaelchli Apr 28, 2021

Choose a reason for hiding this comment

ananthsub Apr 28, 2021

Choose a reason for hiding this comment

ananthsub Apr 29, 2021

Choose a reason for hiding this comment

awaelchli Apr 29, 2021 • edited Loading

Choose a reason for hiding this comment

awaelchli left a comment

Choose a reason for hiding this comment

ananthsub commented Apr 17, 2021 •

edited

Loading

pep8speaks commented Apr 17, 2021 •

edited

Loading

codecov bot commented Apr 17, 2021 •

edited

Loading

awaelchli Apr 19, 2021 •

edited

Loading

awaelchli Apr 29, 2021 •

edited

Loading