Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer.train argument resume_from_last_checkpoint #10280

Closed
tanmay17061 opened this issue Feb 19, 2021 · 3 comments · Fixed by #10334
Closed

Trainer.train argument resume_from_last_checkpoint #10280

tanmay17061 opened this issue Feb 19, 2021 · 3 comments · Fixed by #10334
Assignees

Comments

@tanmay17061
Copy link
Contributor

tanmay17061 commented Feb 19, 2021

🚀 Feature request

Trainer.train accepts resume_from_checkpoint argument, which requires the user to explicitly provide the checkpoint location to continue training from.
resume_from_last_checkpoint can be useful to resume training by picking the latest checkpoint from output_dir of the TrainingArguments passed.

Motivation

  1. The checkpoint directory is created by the library, so user needs to navigate to the directory to find the value to provide for resume_from_checkpoint
  2. User may just want to resume from the last valid checkpoint since their training got disrupted previously (a common scenario for someone to want to resume training). All they know is the output_dir they provided initially

This motivates to provide a resume_from_last_checkpoint=True to the Trainer.train(...) call, which will pick the latest checkpoint from args.output_dir. FYI get_last_checkpoint function from trainer_utils can be used to do exactly the same.

Your contribution

I can raise a PR if it is a useful feature to have!

@tanmay17061 tanmay17061 changed the title Trainer argument resume_from_latest_checkpoint Trainer.train argument resume_from_latest_checkpoint Feb 19, 2021
@tanmay17061 tanmay17061 changed the title Trainer.train argument resume_from_latest_checkpoint Trainer.train argument resume_from_last_checkpoint Feb 20, 2021
@sgugger
Copy link
Collaborator

sgugger commented Feb 22, 2021

Instead of adding a new argument, I would use the existing resume_from_checkpoint and change its type to bool or str/PathLike. If it's a bool and if it's True, we then use get_last_checkpoint to get the last checkpoint in args.output_dir. Does that sound good to you?

@tanmay17061
Copy link
Contributor Author

Yes, SGTM. I have raised a PR doing the same. Do let me know if there is any other change required as well!

PS: Can you also review my other PR introducing save_strategy in TrainingArguments? This PR is the last one to round-up the save_strategy, evaluation_strategy and logging_strategy enhancements.

Thanks!

tanmay17061 added a commit to tanmay17061/transformers that referenced this issue Feb 22, 2021
Enhance resume_from_checkpoint argument of Trainer.train to accept
bool type. If True given, last saved checkpoint in self.args.output_dir
will be loaded. (huggingface#10280)
sgugger pushed a commit that referenced this issue Feb 22, 2021
Enhance resume_from_checkpoint argument of Trainer.train to accept
bool type. If True given, last saved checkpoint in self.args.output_dir
will be loaded. (#10280)
@npoulose
Copy link

🚀 Feature request

Trainer.train accepts resume_from_checkpoint argument, which requires the user to explicitly provide the checkpoint location to continue training from. resume_from_last_checkpoint can be useful to resume training by picking the latest checkpoint from output_dir of the TrainingArguments passed.

Motivation

  1. The checkpoint directory is created by the library, so user needs to navigate to the directory to find the value to provide for resume_from_checkpoint
  2. User may just want to resume from the last valid checkpoint since their training got disrupted previously (a common scenario for someone to want to resume training). All they know is the output_dir they provided initially

This motivates to provide a resume_from_last_checkpoint=True to the Trainer.train(...) call, which will pick the latest checkpoint from args.output_dir. FYI get_last_checkpoint function from trainer_utils can be used to do exactly the same.

Your contribution

I can raise a PR if it is a useful feature to have!
Is it possible to train while adding a new category to the dataset using this resume_from_checkpoint argument?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants