-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check early stopping metric in the beginning of the training #542
Check early stopping metric in the beginning of the training #542
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about some sections...
RuntimeWarning) | ||
stop_training = True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then you should return True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. Return True
was before and it caused the interruption of the training if the required metric was not found. And now it just gives a warning and training just proceeds as though without early stopping. The point is that the callback should not stop the training if it can't find the metrics.
Actually, in the current implementation this branch is not reachable because we check for the availability of the metric in the trainer initialization. But my idea was that if we decide to set early_stopping to True by default, then it can be used to give a warning but not to stop the training.
You can also look at #524 for better understanding.
pytorch_lightning/trainer/trainer.py
Outdated
@@ -53,7 +53,7 @@ class Trainer(TrainerIOMixin, | |||
def __init__(self, | |||
logger=True, | |||
checkpoint_callback=True, | |||
early_stop_callback=True, | |||
early_stop_callback=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why changing the default config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I have written in the description we should discuss whether early stopping should be turned on or off by the default. I think that it is better to be turned off. Again, please look at #524
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early stopping should default to True - this is the most common use case.
This is a best practice for research with patience of 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it is conditioned on the val_loss
and it will work only if the user guess right the name. For this reason it seems to me that the better option is when the user deliberately enables early stopping. It is also very easy (just set True
).
But if you insist I can suggest the following: we enable early stopping by default, but if there is no val_loss
we will just warn the user that early stopping will not work and training will proceed with disabled early stopping. I just don't like that there will be some warnings when you run the trainer with the default settings :)
pytorch_lightning/trainer/trainer.py
Outdated
@@ -185,6 +185,8 @@ def __init__(self, | |||
# creates a default one if none passed in | |||
self.early_stop_callback = None | |||
self.configure_early_stopping(early_stop_callback, logger) | |||
if self.enable_early_stop: | |||
self.nb_sanity_val_steps = max(1, self.nb_sanity_val_steps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe max(1, nb_sanity_val_steps)
since earlier you have
if self.fast_dev_run:
self.nb_sanity_val_steps = 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But exactly by that reason it should be max(1, self.nb_sanity_val_steps)
:)
We just take the previously defined final self.nb_sanity_val_steps
and set it to 1
if it is less than 1
.
If we made as you have suggested then self.nb_sanity_val_steps
will be equal to the user defined value in fast dev run mode, but it should be 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not do this. People need to have the option of turning sanity_val_check off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how then we will check that early stopping will work correctly? (Note that we force this check only if early stopping is turned on.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you're saying, but restricting EVERYONE to force sanity check will certainly block some esoteric research or production cases, so we can't do this.
But I think this is on the user at this point. If they turned off sanity check then it's on them at that point and are willingly exposing themselves to these kinds of issues... but for people who keep it on, then we use what you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@kuynzereb ok, a few things:
I think it's good to check that the metric exists in the beginning of training, which we can do at the first instance of a val result: |
Hmmm, okay, I think I got it. Our position is: If the user turn off validation sanity check then he is responsible for any failures which can arise after very first validation loop. |
exactly |
@@ -113,6 +113,13 @@ def run_training_epoch(self): | |||
if self.fast_dev_run or should_check_val: | |||
self.run_evaluation(test=self.testing) | |||
|
|||
if (self.enable_early_stop and | |||
self.callback_metrics.get(self.early_stop_callback.monitor) is None): | |||
raise RuntimeError(f"Early stopping was configured to monitor " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I guess the tests fail because early stopping is turned on by default but some test models don't specify correct |
I have fixed the tests. While I was doing it I have discovered some more problems:
it will not happen and there will be validation runs. |
@kuynzereb awesome! mind fixing the conflicts? |
Yep, I will try to fix the conflicts ASAP. |
Well, I have quite rethought all the thing, so @Borda and @williamFalcon please look at this. I have added It allows us just to call Finally, now I have set And if you set |
well, let me see the latest changes... |
If early_stop_callback=None I would expect it to be turned off even if the user wants val_loss or any other metric to be returned and logged. |
Yeah, it is indeed somewhat ambiguous. But at the same time it is a good practice to use The problem is that:
So, one possible solution is to handle the default early stopping in a special way. And the only way I know to distinguish between user set and default value is set it to Another solution may be not to crash training at all, even it early stopping was deliberately turned on, just give warnings. |
@neggert @Borda @kuynzereb we good with this change? |
@kuynzereb could you rebase master to make sure you have the latest test fixes on GPU? |
I have merged master and updated the docs |
I would say that rebase is a better option but merge master should be also fine... |
As was discussed in #524 we should check the availability of the metric required by the early stopping in the very beginning of the training. To do so I force validation sanity check if early stopping is enabled and check the obtained callback metrics.
Also this PR set default
early_stopping_callback
toFalse
. It was slightly discussed in #524 and #536. I suggest to finally decide whether it should be turned on or off. I think it is better to turn it off because:val_loss
. Thus if the user doesn't know about early stopping then it will be pure coincidence, in some cases it will work and in some cases it will not. And if the user knows about the early stopping then he can easily turn it on himself.The changes can be tested with the following code: