-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
default EarlyStopping callback should not fail on missing val_loss data #524
Comments
I agree that it is quite unpleasant when the default early stopping callback unexpectedly stops the training because it can't find
|
I'm guessing you were doing check_val_every_n_epoch>1. |
Wow, indeed, there is a third problem: However, please note, that now callbacks metrics are not longer replaced by new ones but updated. It was fixed in #492. |
So I would suggest the following:
@williamFalcon, what do you think? |
Isn't it possible that the user returns val_loss only in some epochs, e.g., only every other epoch (intentionally or not)? |
Yeah, it is the problem if no |
@awaelchli ... maybe modify the early stopping to skip the check when that key is missing? |
very reasonable imo. @kuynzereb do you see any problem with this? |
Nope, it sound good for me too. But we will need to explicitly remove this key from callback metrics in the start of each epoch otherwise it will be always available (now it always stores the metric from the last validation loop). |
will hopefully be fixed (Lightning-AI/pytorch-lightning#524) add 3d model without dropout
@awaelchli or @kuynzereb mind submitting a PR? |
I can look into it. |
@awaelchli any updates? |
lost track of this after I ran into some unexpected behaviors. will try get back to it but it seems @kuynzereb has a better overview of early stopping than me. |
It seems that we can just add a condition that |
Hello. I am still getting a similar problem. Has this been confirmed as solved? |
I confirm the problem re-appears in pytorch-lightning version 1.4.0 and 1.4.1. Early stop callback is always checked at the end of the first epoch, so if |
Dear @TuBui, @veritas9872, Would you mind trying out master ? pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git If the error persists, please re-open this issue. Best, |
thanks for the quick update. |
@TuBui can not pip install 1.5.0 how should I use it? |
@yinrong check tchaton answer. |
Describe the bug
My training script failed overnight — this is the last thing I see in the logs before the instance shut down:
It seems like we intended this to be a "warning" but it appears that it interrupted my training script. Do you think that's possible, or could it be something else? I had 2 training scripts running on 2 different instances last night and both shut down in this way, with this RuntimeWarning as the last line in the logs. Is it possible that the default EarlyStopping callback killed my script because I didn't log a
val_loss
tensor somewhere it could find it? To be clear, it is not my intention to use EarlyStopping at all, so I was quite surprised to wake up today and find my instance shut down and training interrupted, and no clear sign of a bug on my end. Did you intend this to interrupt the trainer? If so, how do we feel about changing that plan so that the default EarlyStopping callback has no effect when it can't find aval_loss
metric?The text was updated successfully, but these errors were encountered: