Trainer.train argument resume_from_last_checkpoint #10280

tanmay17061 · 2021-02-19T15:53:02Z

🚀 Feature request

Trainer.train accepts resume_from_checkpoint argument, which requires the user to explicitly provide the checkpoint location to continue training from.
resume_from_last_checkpoint can be useful to resume training by picking the latest checkpoint from output_dir of the TrainingArguments passed.

Motivation

The checkpoint directory is created by the library, so user needs to navigate to the directory to find the value to provide for resume_from_checkpoint
User may just want to resume from the last valid checkpoint since their training got disrupted previously (a common scenario for someone to want to resume training). All they know is the output_dir they provided initially

This motivates to provide a resume_from_last_checkpoint=True to the Trainer.train(...) call, which will pick the latest checkpoint from args.output_dir. FYI get_last_checkpoint function from trainer_utils can be used to do exactly the same.

Your contribution

I can raise a PR if it is a useful feature to have!

The text was updated successfully, but these errors were encountered:

sgugger · 2021-02-22T14:13:12Z

Instead of adding a new argument, I would use the existing resume_from_checkpoint and change its type to bool or str/PathLike. If it's a bool and if it's True, we then use get_last_checkpoint to get the last checkpoint in args.output_dir. Does that sound good to you?

tanmay17061 · 2021-02-22T16:21:38Z

Yes, SGTM. I have raised a PR doing the same. Do let me know if there is any other change required as well!

PS: Can you also review my other PR introducing save_strategy in TrainingArguments? This PR is the last one to round-up the save_strategy, evaluation_strategy and logging_strategy enhancements.

Thanks!

Enhance resume_from_checkpoint argument of Trainer.train to accept bool type. If True given, last saved checkpoint in self.args.output_dir will be loaded. (huggingface#10280)

Enhance resume_from_checkpoint argument of Trainer.train to accept bool type. If True given, last saved checkpoint in self.args.output_dir will be loaded. (#10280)

npoulose · 2022-10-12T02:50:09Z

🚀 Feature request

Trainer.train accepts resume_from_checkpoint argument, which requires the user to explicitly provide the checkpoint location to continue training from. resume_from_last_checkpoint can be useful to resume training by picking the latest checkpoint from output_dir of the TrainingArguments passed.

Motivation

The checkpoint directory is created by the library, so user needs to navigate to the directory to find the value to provide for resume_from_checkpoint

User may just want to resume from the last valid checkpoint since their training got disrupted previously (a common scenario for someone to want to resume training). All they know is the output_dir they provided initially

This motivates to provide a resume_from_last_checkpoint=True to the Trainer.train(...) call, which will pick the latest checkpoint from args.output_dir. FYI get_last_checkpoint function from trainer_utils can be used to do exactly the same.

Your contribution

I can raise a PR if it is a useful feature to have!
Is it possible to train while adding a new category to the dataset using this resume_from_checkpoint argument?

tanmay17061 changed the title ~~Trainer argument resume_from_latest_checkpoint~~ Trainer.train argument resume_from_latest_checkpoint Feb 19, 2021

tanmay17061 changed the title ~~Trainer.train argument resume_from_latest_checkpoint~~ Trainer.train argument resume_from_last_checkpoint Feb 20, 2021

LysandreJik assigned sgugger Feb 22, 2021

tanmay17061 mentioned this issue Feb 22, 2021

Loading from last checkpoint functionality in Trainer.train #10334

Merged

5 tasks

sgugger closed this as completed in #10334 Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer.train argument resume_from_last_checkpoint #10280

Trainer.train argument resume_from_last_checkpoint #10280

tanmay17061 commented Feb 19, 2021 •

edited

Loading

sgugger commented Feb 22, 2021

tanmay17061 commented Feb 22, 2021

npoulose commented Oct 12, 2022

🚀 Feature request

Motivation

Your contribution

Trainer.train argument resume_from_last_checkpoint #10280

Trainer.train argument resume_from_last_checkpoint #10280

Comments

tanmay17061 commented Feb 19, 2021 • edited Loading

🚀 Feature request

Motivation

Your contribution

sgugger commented Feb 22, 2021

tanmay17061 commented Feb 22, 2021

npoulose commented Oct 12, 2022

🚀 Feature request

Motivation

Your contribution

tanmay17061 commented Feb 19, 2021 •

edited

Loading