Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output management checkpoints and final model #630

Merged
merged 6 commits into from
Feb 1, 2022

Conversation

mlonaws
Copy link
Contributor

@mlonaws mlonaws commented Jan 28, 2022

The aim of this PR is to provide options to separate model checkpoints from final model.tar.gz saved to S3.

In this example, training a model should result in ~500MB, but the model saves all checkpoints which results in a 12GB model.tar.gz training artifact, containing all your checkpoints, most of them possibly useless for deployment.

If you want to persist checkpoints to S3 while saving only the final model in the model.tar.gz final training artifact, you can point Hugging Face Trainer’s output_dir to /opt/ml/checkpoints and at the end of your script save a specific model to the model location with trainer.save_model(/opt/ml/model). Currently, pointing Hugging Face Trainer’s output_dir to /opt/ml/checkpoints and saving location with trainer.save_model(/opt/ml/model) does NOT save the final model to the S3 resulting in an empty output folder in this example

It would be useful to add Training output management section to this documentation after Prepare a :hugging_face: Transformers fine-tuning script.

The aim of this PR is to provide options to separate model checkpoints from final `model.tar.gz` saved to S3. 

In [this](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb) example, training a model should result in ~500MB, but the model saves all checkpoints which results in a 12GB `model.tar.gz` training artifact, containing all your checkpoints, most of them possibly useless for deployment. 

If you want to persist checkpoints to S3 while saving only the final model in the `model.tar.gz` final training artifact, you can point Hugging Face Trainer’s `output_dir` to `/opt/ml/checkpoints` and at the end of your script save a specific model to the model location with `trainer.save_model(/opt/ml/model)`. Currently, pointing Hugging Face Trainer’s `output_dir` to `/opt/ml/checkpoints` and saving location with `trainer.save_model(/opt/ml/model)` does NOT save the final model to the S3 resulting in an empty `output` folder in [this](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb) example 

It would be useful to add `Training output management` section to this documentation after `Prepare a :hugging_face: Transformers fine-tuning script`.
@julien-c
Copy link
Member

tagging @philschmid for visibility!

Copy link
Member

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening the PR. That's a great idea! I left some comments and suggestions.

docs/sagemaker/train.md Outdated Show resolved Hide resolved
docs/sagemaker/train.md Outdated Show resolved Hide resolved
mlonaws and others added 3 commits January 29, 2022 18:52
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
docs/sagemaker/train.md Outdated Show resolved Hide resolved
mlonaws and others added 2 commits January 31, 2022 18:22
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
replace **so** with **to** an Amazon S3 location.
Copy link
Contributor Author

@mlonaws mlonaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committed your suggested changes. The section is ready. Thanks for your suggestions.

Copy link
Member

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@philschmid philschmid merged commit 8912ccd into huggingface:main Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants