Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output management checkpoints and final model #630

Merged
merged 6 commits into from
Feb 1, 2022
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion docs/sagemaker/train.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,13 @@ _Note that SageMaker doesn’t support argparse actions. For example, if you wan

Look [here](https://github.com/huggingface/notebooks/blob/master/sagemaker/01_getting_started_pytorch/scripts/train.py) for a complete example of a 🤗 Transformers training script.

## Training Output Management

If `output_dir` in the `TrainingArguments` is set to '/opt/ml/model' the Trainer saves all training artifacts, including logs, checkpoints, and models. Amazon SageMaker archives the whole '/opt/ml/model' directory as `model.tar.gz` and uploads it at the end of the training job to Amazon S3. Depending on your Hyperparameters and `TrainingArguments` this could lead to a large artifact (> 5GB), which can slow down deployment for Amazon SageMaker Inference.
You can control how checkpoints, logs, and artifacts are saved by customization the [TrainingArguments](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.TrainingArguments). For example by providing `save_total_limit` as `TrainingArgument` you can control the limit of the total amount of checkpoints. Deletes the older checkpoints in `output_dir` if new ones are saved and the maximum limit is reached.

mlonaws marked this conversation as resolved.
Show resolved Hide resolved
If you are using the HuggingFace framework estimator, you need to specify a checkpoint output path through hyperparameters. To enable checkpointing, set `output_dir` to `/opt/ml/checkpoints` in hyperparameters, and point `checkpoint_s3_uri` to an S3 location in your estimator ([see Use Checkpoints on Amazon SageMaker Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html)).
mlonaws marked this conversation as resolved.
Show resolved Hide resolved

## Create a Hugging Face Estimator

Run 🤗 Transformers training scripts on SageMaker by creating a [Hugging Face Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#huggingface-estimator). The Estimator handles end-to-end SageMaker training. There are several parameters you should define in the Estimator:
Expand Down Expand Up @@ -339,4 +346,4 @@ huggingface_estimator = HuggingFace(
hyperparameters = hyperparameters)
```

📓 Open the [notebook](https://github.com/huggingface/notebooks/blob/master/sagemaker/06_sagemaker_metrics/sagemaker-notebook.ipynb) for an example of how to capture metrics in SageMaker.
📓 Open the [notebook](https://github.com/huggingface/notebooks/blob/master/sagemaker/06_sagemaker_metrics/sagemaker-notebook.ipynb) for an example of how to capture metrics in SageMaker.