-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
output management checkpoints and final model #630
Conversation
The aim of this PR is to provide options to separate model checkpoints from final `model.tar.gz` saved to S3. In [this](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb) example, training a model should result in ~500MB, but the model saves all checkpoints which results in a 12GB `model.tar.gz` training artifact, containing all your checkpoints, most of them possibly useless for deployment. If you want to persist checkpoints to S3 while saving only the final model in the `model.tar.gz` final training artifact, you can point Hugging Face Trainer’s `output_dir` to `/opt/ml/checkpoints` and at the end of your script save a specific model to the model location with `trainer.save_model(/opt/ml/model)`. Currently, pointing Hugging Face Trainer’s `output_dir` to `/opt/ml/checkpoints` and saving location with `trainer.save_model(/opt/ml/model)` does NOT save the final model to the S3 resulting in an empty `output` folder in [this](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb) example It would be useful to add `Training output management` section to this documentation after `Prepare a :hugging_face: Transformers fine-tuning script`.
tagging @philschmid for visibility! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening the PR. That's a great idea! I left some comments and suggestions.
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
replace **so** with **to** an Amazon S3 location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Committed your suggested changes. The section is ready. Thanks for your suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The aim of this PR is to provide options to separate model checkpoints from final
model.tar.gz
saved to S3.In this example, training a model should result in ~500MB, but the model saves all checkpoints which results in a 12GB
model.tar.gz
training artifact, containing all your checkpoints, most of them possibly useless for deployment.If you want to persist checkpoints to S3 while saving only the final model in the
model.tar.gz
final training artifact, you can point Hugging Face Trainer’soutput_dir
to/opt/ml/checkpoints
and at the end of your script save a specific model to the model location withtrainer.save_model(/opt/ml/model)
. Currently, pointing Hugging Face Trainer’soutput_dir
to/opt/ml/checkpoints
and saving location withtrainer.save_model(/opt/ml/model)
does NOT save the final model to the S3 resulting in an emptyoutput
folder in this exampleIt would be useful to add
Training output management
section to this documentation afterPrepare a :hugging_face: Transformers fine-tuning script
.