-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more early stopping options #6795
Comments
Yes, with this fix min_epochs / min_steps in Trainer will force this amount of training progression without stopping. |
Amazing, thanks! |
For the convergence/divergence, I could think of something like (which is what I believe you implemented above) # stop training if val_acc reached 0.95 (conditioned on patience??)
EarlyStopping(monitor="val_acc", mode="max", boundary=0.95)
# stop training if val_acc is reaching a value below 0.1 (divergence) or a value above 0.95.
# Continue training as long as we are within the band
EarlyStopping(monitor="val_acc", mode="max", boundary=(0.1, 0.95)) However, I must say I am a bit skeptical about the usefulness of such a thresholding. In my opinion, the patience + min_delta criterion is sufficiently covering both cases of convergence and divergence. |
I have no doubt that is true in your applications. But I think it is important to understand the use case of why this is essential before you design any code around it. To me, the patience criteria is only useful to stop a failing experiment early and try something different. Forget completely about computer vision or whatever your application is and even forget about machine learning. Think to when you have used a serious optimizer and solver trying to maximize or minimize a function, solve a system of equations, etc. For that, look at https://nlopt.readthedocs.io/en/latest/NLopt_Reference/#return-values or https://www.artelys.com/docs/knitro/3_referenceManual/knitromatlabReference.html#return-codes-exit-flags or https://coin-or.github.io/Ipopt/OUTPUT.html or pretty much any of them. In many applications, pytorch lightning is used in a way very similar to these optimizers. For example, you know exactly what the solution should be (e.g. everything should be zero if solving a big system of equations, the gradient should be zero if solving an optimization problem, etc.). For anything I do now, and likely will ever do, I am trying to effectively solve a big system of nonlinear equations and am evaluating the residual. It isn't successful unless that residual is below some stopping threshold, 1e-6 or whatever. Just to be clear on the analogy here to early stopping.
Hopefully that helps. These are just completely different use cases than you may be used to, but trust me when I say they are not particular to my usage. Anyone who uses PL as an optimizer would need this stuff. @luiscape I think it might make sense to get involved a little here to think about how this would work with the sort of grid controller we discussed. Multistart optimization methods/hyperparameter tuning would need this sort of thing. So with that, the boundary think is not the right interface for a couple of reasons:
|
🚀 Feature
Additional early stopping features
@tchaton
First,
max_time
which should probably be in parallel tomax_epochs
in the main trainer loop. Why an additional one? Because (1) you never have any idea how long an epoch will be - especially if you tinker with hyperparameters; and (2) sometimes you want to give an amount of time and see which version of the model does the best given a fixed amount of time.Second, a few variations on the
EarlyStopping
callback which is based on a metric.For both of these, I think it is useful to have a
min_epochs
or somethign option to ensure that it doesn't stop right away. I think that is what #6705 is supposed to do though so it isn't needed here?Finally, I think that it would be great in PL to have a way to log the reason for stopping so that it can be seen in the logs and be available within grid experiments view. Not sure the way to do that though, but maybe the callback could save a string in the logs?
Implementation
I implemented these two features in the currrent callaback with something like:
Then I added in something like
The text was updated successfully, but these errors were encountered: