Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add patience argument to Trainer #4186

Conversation

thesamuel
Copy link

@thesamuel thesamuel commented May 6, 2020

This closes #4894.

Summary

Often, we want to stop training if loss does not improve for a number of epochs. This PR adds a "patience" argument, which is a limit on the number of times we can get a non-improving eval loss before stopping training early.

It is implemented by other NLP frameworks, such as AllenNLP (see trainer.py and metric_tracker.py).

Motivation

This feature allows faster fine-tuning by breaking the training loop early and avoids users the toil of checking metrics on Tensorboard.

Caveats

Often, models are evaluated once per epoch, but run_lm_finetuning.py has an option to evaluate after a set number of model update steps (dictated by --logging_steps if --evaluate_during_training is true). Because of this, I've elected to tie patience to the number of evaluations without improvement in loss.

@thesamuel
Copy link
Author

This supercedes #2840, where I added patience to the outdated run_language_modeling.py script.

@BramVanroy BramVanroy requested review from LysandreJik and thomwolf and removed request for LysandreJik June 15, 2020 07:26
@BramVanroy
Copy link
Collaborator

Looking good! Can you add a reference to your original post that this closes #4894? Thanks

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Small suggestions there

Comment on lines +312 to +313
best_eval_loss = None
evals_without_improvement = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefix those with patience_ as they're specific to this

Suggested change
best_eval_loss = None
evals_without_improvement = 0
patience_best_eval_loss = None
patience_evals_without_improvement = 0
patience_should_stop = False

Comment on lines +363 to +365
logger.info(
f"Patience threshold ({self.args.patience}) exceeded, stopping training"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info(
f"Patience threshold ({self.args.patience}) exceeded, stopping training"
)
patience_should_stop = True
logger.info(
f"Patience threshold ({self.args.patience}) exceeded, stopping training"
)

Comment on lines +396 to +397
if ((self.args.max_steps > 0 and global_step > self.args.max_steps) or
(self.args.patience > 0 and evals_without_improvement >= self.args.patience)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((self.args.max_steps > 0 and global_step > self.args.max_steps) or
(self.args.patience > 0 and evals_without_improvement >= self.args.patience)):
if ((self.args.max_steps > 0 and global_step > self.args.max_steps) or
patience_should_stop):

epoch_iterator.close()
break
if self.args.max_steps > 0 and global_step > self.args.max_steps:
if ((self.args.max_steps > 0 and global_step > self.args.max_steps) or
(self.args.patience > 0 and evals_without_improvement >= self.args.patience)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@kevin-yauris
Copy link

Hello, when this feature will be merged? I would like to use it. Thank you.

@BramVanroy
Copy link
Collaborator

Hello, when this feature will be merged? I would like to use it. Thank you.

There are some changes requested that @thesamuel should fix before this can be merged.

@misrasaurabh1
Copy link
Contributor

Bump. Early stopping is critical for an automated Trainer that reliably gives us the best model. Current way of figuring out the training stopping point seems to be specifying a static train_epochs but the training duration a model can take depends on way too many factors like learning rate, data complexity, model, model size, optimizer and so on that it is unreasonable to ask the user to specify the epochs in advance.
I believe the current assumption is that people train with very small learning rates so that the loss always seems to keep decreasing very slowly but according to my experience (and on my data) it is a sub-optimal schedule which takes too much time. I see that training with higher learning rates with larger batch sizes and stopping at the early stopping point results in an equally good if not better models. Although this requires use of early stopping.

@LysandreJik LysandreJik requested a review from sgugger September 7, 2020 08:54
@PhilipMay
Copy link
Contributor

PhilipMay commented Sep 9, 2020

I would like to use this early stopping on downstream training.
The current implementation only stops training by monitoring loss. IMO it should also be possible to monitor other metrics like F1 and ROC-AUC.

I also would like to add a feature that stores the model each time when the monitored metric improves and then optionaly loads the model after training. Then later evaluation can be done on this "best" model.

@thesamuel @julien-c @kevin-yauris what do you think?

@sgugger
Copy link
Collaborator

sgugger commented Sep 9, 2020

I plan to work on this once I'm finished with the Funnel Transformer model @PhilipMay (so end of this week, beginning of the next).

@PhilipMay
Copy link
Contributor

PhilipMay commented Sep 9, 2020

I plan to work on this once I'm finished with the Funnel Transformer model @PhilipMay (so end of this week, beginning of the next).

@sgugger That would be awsome. Maybe you want to get some inspiration from the FARM training loop which is pretty nice IMO:

https://github.com/deepset-ai/FARM/blob/master/farm/train.py#L262-L370

@PhilipMay
Copy link
Contributor

I just found this PR that was already merged: #7431
I think it solved this...

@sgugger
Copy link
Collaborator

sgugger commented Sep 29, 2020

Not quite, but it makes implementing it easier.

@PhilipMay
Copy link
Contributor

Not quite, but it makes implementing it easier.

Yes - you are right. The patience part is still missing.

@stale
Copy link

stale bot commented Nov 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 29, 2020
@BramVanroy
Copy link
Collaborator

@sgugger Should we keep this open? You wrote in this thread you will work on this if you find the time, but I am not sure if you plan to use another PR for that.

@stale stale bot removed the wontfix label Nov 30, 2020
@sgugger
Copy link
Collaborator

sgugger commented Nov 30, 2020

There has been a PR merged adding the EarlyStoppingCallback (#8581) so I think this can be closed now.

@sgugger sgugger closed this Nov 30, 2020
@thesamuel
Copy link
Author

Thanks @cbrochtrup @sgugger! Sorry I didn't get around to this...

@cbrochtrup
Copy link

You're welcome, happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🚀 Add early stopping to the trainer
8 participants