output management checkpoints and final model #630

mlonaws · 2022-01-28T08:48:13Z

The aim of this PR is to provide options to separate model checkpoints from final model.tar.gz saved to S3.

In this example, training a model should result in ~500MB, but the model saves all checkpoints which results in a 12GB model.tar.gz training artifact, containing all your checkpoints, most of them possibly useless for deployment.

If you want to persist checkpoints to S3 while saving only the final model in the model.tar.gz final training artifact, you can point Hugging Face Trainer’s output_dir to /opt/ml/checkpoints and at the end of your script save a specific model to the model location with trainer.save_model(/opt/ml/model). Currently, pointing Hugging Face Trainer’s output_dir to /opt/ml/checkpoints and saving location with trainer.save_model(/opt/ml/model) does NOT save the final model to the S3 resulting in an empty output folder in this example

It would be useful to add Training output management section to this documentation after Prepare a :hugging_face: Transformers fine-tuning script.

The aim of this PR is to provide options to separate model checkpoints from final `model.tar.gz` saved to S3. In [this](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb) example, training a model should result in ~500MB, but the model saves all checkpoints which results in a 12GB `model.tar.gz` training artifact, containing all your checkpoints, most of them possibly useless for deployment. If you want to persist checkpoints to S3 while saving only the final model in the `model.tar.gz` final training artifact, you can point Hugging Face Trainer’s `output_dir` to `/opt/ml/checkpoints` and at the end of your script save a specific model to the model location with `trainer.save_model(/opt/ml/model)`. Currently, pointing Hugging Face Trainer’s `output_dir` to `/opt/ml/checkpoints` and saving location with `trainer.save_model(/opt/ml/model)` does NOT save the final model to the S3 resulting in an empty `output` folder in [this](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb) example It would be useful to add `Training output management` section to this documentation after `Prepare a :hugging_face: Transformers fine-tuning script`.

julien-c · 2022-01-28T09:17:09Z

tagging @philschmid for visibility!

philschmid

Thank you for opening the PR. That's a great idea! I left some comments and suggestions.

docs/sagemaker/train.md

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

docs/sagemaker/train.md

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

replace **so** with **to** an Amazon S3 location.

mlonaws

Committed your suggested changes. The section is ready. Thanks for your suggestions.

philschmid

LGTM!

philschmid reviewed Jan 28, 2022

View reviewed changes

docs/sagemaker/train.md Outdated Show resolved Hide resolved

docs/sagemaker/train.md Outdated Show resolved Hide resolved

philschmid reviewed Jan 28, 2022

View reviewed changes

docs/sagemaker/train.md Show resolved Hide resolved

mlonaws and others added 3 commits January 29, 2022 18:52

Update docs/sagemaker/train.md

6e730e7

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

Update docs/sagemaker/train.md

b365d04

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

Enable checkpointing in estimator

1cb8e08

philschmid reviewed Jan 31, 2022

View reviewed changes

docs/sagemaker/train.md Outdated Show resolved Hide resolved

mlonaws and others added 2 commits January 31, 2022 18:22

Update docs/sagemaker/train.md

f004068

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

Update train.md

9b721a0

replace **so** with **to** an Amazon S3 location.

mlonaws commented Jan 31, 2022

View reviewed changes

philschmid approved these changes Feb 1, 2022

View reviewed changes

philschmid merged commit 8912ccd into huggingface:main Feb 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output management checkpoints and final model #630

output management checkpoints and final model #630

mlonaws commented Jan 28, 2022

julien-c commented Jan 28, 2022

philschmid left a comment

mlonaws left a comment

philschmid left a comment

output management checkpoints and final model #630

output management checkpoints and final model #630

Conversation

mlonaws commented Jan 28, 2022

julien-c commented Jan 28, 2022

philschmid left a comment

Choose a reason for hiding this comment

mlonaws left a comment

Choose a reason for hiding this comment

philschmid left a comment

Choose a reason for hiding this comment