Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default EarlyStopping callback should not fail on missing val_loss data #524

Closed
colllin opened this issue Nov 18, 2019 · 20 comments · Fixed by #743
Closed

default EarlyStopping callback should not fail on missing val_loss data #524

colllin opened this issue Nov 18, 2019 · 20 comments · Fixed by #743
Assignees
Labels
bug Something isn't working

Comments

@colllin
Copy link

colllin commented Nov 18, 2019

Describe the bug
My training script failed overnight — this is the last thing I see in the logs before the instance shut down:

python3.7/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:128: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: avg_val_loss_total,avg_val_jacc10,avg_val_ce
  RuntimeWarning)

It seems like we intended this to be a "warning" but it appears that it interrupted my training script. Do you think that's possible, or could it be something else? I had 2 training scripts running on 2 different instances last night and both shut down in this way, with this RuntimeWarning as the last line in the logs. Is it possible that the default EarlyStopping callback killed my script because I didn't log a val_loss tensor somewhere it could find it? To be clear, it is not my intention to use EarlyStopping at all, so I was quite surprised to wake up today and find my instance shut down and training interrupted, and no clear sign of a bug on my end. Did you intend this to interrupt the trainer? If so, how do we feel about changing that plan so that the default EarlyStopping callback has no effect when it can't find a val_loss metric?

@colllin colllin added the bug Something isn't working label Nov 18, 2019
@kuynzereb
Copy link
Contributor

I agree that it is quite unpleasant when the default early stopping callback unexpectedly stops the training because it can't find val_loss. And it is also unpleasant that you will find out the absence of the required metric only at the end of the full first training epoch (and moreover the training will stop). So I would separate this into two different problems:

  1. Default early stopping should not stop the training. We should either disable it if no val_loss was found or even just disable it at all by default.
  2. We should somehow check in the very beginning of the training that the metric required by the early stopping is available after the validation loop. Now it is checked only at the end of the first training epoch and if it is not present the training stops.

@AS-researcher6
Copy link

AS-researcher6 commented Nov 20, 2019

I'm guessing you were doing check_val_every_n_epoch>1.
This error is because callback_metrics is what is used for early stopping. This is cleared and re-filled at every training step logging. A hacky solution I have found is to save the last val_loss as a model attribute self.val_loss, and return at every training step
Ex.
output {
'loss': loss,
'log': log_dict,
'progress_bar': prog_dict,
'val_loss': self.val_loss
}

@kuynzereb
Copy link
Contributor

Wow, indeed, there is a third problem:
3) It is not clear how early stopping should work when check_val_every_n_epoch > 1.

However, please note, that now callbacks metrics are not longer replaced by new ones but updated. It was fixed in #492.

@kuynzereb
Copy link
Contributor

So I would suggest the following:

  1. By default early stop callback is turned on but if there is no val_loss then we just warn the user that early stop callback will not work, and training will proceed as though there is no early stop callback.
  2. If early stop callback is explicitly specified by the user then we will force validation sanity check and will examine the metrics obtained from it. If the metric required by the early stop callback is not present then we will raise an error.

@williamFalcon, what do you think?

@awaelchli
Copy link
Contributor

awaelchli commented Nov 25, 2019

Isn't it possible that the user returns val_loss only in some epochs, e.g., only every other epoch (intentionally or not)?

@kuynzereb
Copy link
Contributor

Yeah, it is the problem if no val_loss is returned in some epochs. In that case early stopping will work quite strange. It happens, for example, if check_val_every_n_epoch > 1.

@williamFalcon
Copy link
Contributor

williamFalcon commented Nov 25, 2019

@awaelchli ... maybe modify the early stopping to skip the check when that key is missing?

@awaelchli
Copy link
Contributor

very reasonable imo. @kuynzereb do you see any problem with this?

@kuynzereb
Copy link
Contributor

Nope, it sound good for me too. But we will need to explicitly remove this key from callback metrics in the start of each epoch otherwise it will be always available (now it always stores the metric from the last validation loop).

fellnerse added a commit to fellnerse/forgerydetection that referenced this issue Nov 27, 2019
@williamFalcon
Copy link
Contributor

@awaelchli or @kuynzereb mind submitting a PR?

@awaelchli
Copy link
Contributor

I can look into it.

@williamFalcon
Copy link
Contributor

@awaelchli any updates?

@awaelchli
Copy link
Contributor

lost track of this after I ran into some unexpected behaviors. will try get back to it but it seems @kuynzereb has a better overview of early stopping than me.

@kuynzereb
Copy link
Contributor

It seems that we can just add a condition that early_stop_callback.on_epoch_end() should be called only if current_epoch % check_val_every_n_epoch == 0

@veritas9872
Copy link

Hello. I am still getting a similar problem. Has this been confirmed as solved?

@TuBui
Copy link

TuBui commented Sep 14, 2021

I confirm the problem re-appears in pytorch-lightning version 1.4.0 and 1.4.1. Early stop callback is always checked at the end of the first epoch, so if check_val_every_n_epoch>1 the job will fail.
Run fine on version 1.3.1 though.

@tchaton
Copy link
Contributor

tchaton commented Sep 14, 2021

Dear @TuBui, @veritas9872,

Would you mind trying out master ?

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git

If the error persists, please re-open this issue.

Best,
T.C

@TuBui
Copy link

TuBui commented Sep 14, 2021

thanks for the quick update. pytorch-lightning-1.5.0.dev0 (current master branch) works.

@yinrong
Copy link

yinrong commented Oct 4, 2021

@TuBui can not pip install 1.5.0 how should I use it?

@TuBui
Copy link

TuBui commented Oct 4, 2021

@yinrong check tchaton answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants