-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EarlyStopping checkpointed state is lagging one epoch behind #1464
Comments
@lizhitwo thanks for this very detailed bug report! looking into it... at a high level, i think that we should keep the concern of the callback state contained within the callback itself. we can follow the pytorch convention of having methods for |
@Borda do you know why we need edit: going to remove |
the core problem for this issue is the
cc @PyTorchLightning/core-contributors any suggestions? i'm not a huge fan of either of these approaches |
I would like to propose that we do the following:
@lizhitwo would this be a suitable solution for you needs? |
This would be correct only when the training ends normally. When the user hits Ctrl+C to interrupt, the previous checkpoints that are already saved are still lagging one epoch behind. I would also advise against putting more undocumented checkpoint naming conventions into Lightning. I currently am already confused why Lightning overrides my checkpoint name unless I put e.g. Question: is there a reason why early stopping callback can't be split into two, and the status update moved before checkpoint? You can update its state before checkpoint, and checkpoint, and then query its state to decide if early-stopping should be performed using e.g. a |
yeah you're right, we need a better solution here. i think we can also improve how the checkpoint naming is done but i'll leave that for a separate issue.
it's a good question. this is a bit challenging because our checkpoint callback is currently set up to run every time we iterate over the validation set. this can happen once per epoch, multiple times per epoch, or once every n epochs, depending on how the user has defined various arguments in their Trainer. here's how i think we should proceed:
cc @lizhitwo and @Borda want to weigh in if this is a good plan? |
I think that in some cases you want to monitor both train and valid...
do you mean a Trainer state like we discussed several times before? I think that there is one more thing we shall think about and it specifies the "unit" for evaluation if the e.g. cc: @PyTorchLightning/core-contributors |
could you provide an example? in 90% of the cases the user is going to want to monitor something like
i'm thinking of setting an attribute.
to
|
I think this works, as long as
then no matter when the checkpoint is done, the early-stopping's state is clean.
I think monitoring both train and val is better left to users, since they need to specify how the criteria is computed anyway. They can choose to compute it during either val or train and add it to the log in one of them. |
yes that was a behavioral regression introduced by #1528 i will fix it as well, thanks for catching it! we clearly need better tests to spot these errors |
@jeremyjordan submit asap so we can get it in 0.7.4? @lizhitwo thanks for catching that! |
actually... this may not be a trivial fix |
in (https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L459) we need to check it whenever we save weights. In this case, if this is true, then we need to stop training but make sure we do all the rest of the actions we need to call. In (https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L369) we actually want to exit. But anyhow, @jeremyjordan if you figure this out let's get this into 0.7.4 |
yes i'm hoping to get this all wrapped up this weekend, it's a bit of a tricky one |
🐛 Bug
Currently EarlyStopping's state is updated after the checkpoint callback, so what is being saved here is last epoch's state.
To Reproduce
This is somewhat related to #1463 so I am going to use the same code.
Steps to reproduce the behavior:
Install using
pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@master --upgrade
Let the model train until convergence. And then reload the model and see how it continues:
The
early_stopping
callback would print:and keeps training.
Expected behavior
The
early_stopping
callback should print:and should not be trained again at all since self.wait >= self.patience.
If the model is loaded from an interrupted save, then it should still train after resuming, but with corrected self.wait.
Environment
This is ran on Google colab.
https://colab.research.google.com/drive/1ZdiFf6ksNpgsqOdSKM6lMO0yIhqpnTHD
Additional context
Somewhat related to #1463.
The text was updated successfully, but these errors were encountered: