Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 19 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,21 @@
- [Examples](#examples)
- [How It Works](#how-it-works)
- [Docs](#docs)
- [SageMaker Debugger in action](#sagemaker-debugger-in-action)


## Overview
Amazon SageMaker Debugger is an offering from AWS which help you automate the debugging of machine learning training jobs.
This library powers Amazon SageMaker Debugger, and helps you develop better, faster and cheaper models by catching common errors quickly.
It allows you to save tensors from training jobs and makes these tensors available for analysis, all through a flexible and powerful API.
It supports TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+.

- Zero Script Change experience on SageMaker when using supported versions of SageMaker Framework containers or AWS Deep Learning containers
- Zero Script Change experience on SageMaker when using [supported containers](docs/sagemaker.md#zero-script-change)
- Full visibility into any tensor part of the training process
- Real-time training job monitoring through Rules
- Automated anomaly detection and state assertions
- Interactive exploration of saved tensors
- Automated anomaly detection and state assertions through built-in and custom Rules on SageMaker
- Actions on your training jobs based on the status of Rules
- Interactive exploration of saved tensors
- Distributed training support
- TensorBoard support

Expand Down Expand Up @@ -95,18 +97,16 @@ print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').v

## How It Works

Amazon SageMaker Debugger uses a `Hook` to store the values of tensors throughout the training process.
Another process called a `Rule` job simultaneously monitors and validates these outputs to ensure
Amazon SageMaker Debugger uses the construct of a `Hook` to save the values of requested tensors throughout the training process. You can then setup a `Rule` job which simultaneously monitors and validates these tensors to ensure
that training is progressing as expected.
A rule might check for vanishing gradients, or exploding tensor values, or poor weight initialization.
If a rule is triggered, it will raise a CloudWatch event, saving you time and money.
A rule might check for vanishing gradients, or exploding tensor values, or poor weight initialization. Rules are attached to CloudWatch events, so that when a rule is triggered it changes the state of the CloudWatch event. You can configure any action on the CloudWatch event, such as to stop the training job saving you time and money.

Amazon SageMaker Debugger can be used inside or outside of SageMaker. There are three main use cases:
- SageMaker Zero-Script-Change: Here you specify which rules to use when setting up the estimator and run your existing script, no changes needed. See the first example above.
- SageMaker Bring-Your-Own-Container: Here you specify the rules to use, and modify your training script.
- Non-SageMaker: Here you write custom rules (or manually analyze the tensors) and modify your training script. See the second example above.
Amazon SageMaker Debugger can be used inside or outside of SageMaker. However the built-in rules that AWS provides are only available for SageMaker training. Scenarios of usage can be classified into the following:
- **SageMaker Zero-Script-Change**: Here you specify which rules to use when setting up the estimator and run your existing script, no changes needed. See the first example above.
- **SageMaker Bring-Your-Own-Container**: Here you specify the rules to use, and modify your training script minimally to enable SageMaker Debugger.
- **Non-SageMaker**: Here you write custom rules (or manually analyze the tensors) and modify your training script minimally to enable SageMaker Debugger. See the second example above.

The reason for different setups is that SageMaker Zero-Script-Change (via Deep Learning Containers) uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost to save tensors automatically.
The reason for different setups is that SageMaker Zero-Script-Change (via AWS Deep Learning Containers) uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost which add our Hook to the training job and save requested tensors automatically.
These framework forks are not available in custom containers or non-SM environments, so you must modify your training script in these environments.

## Docs
Expand All @@ -115,8 +115,14 @@ These framework forks are not available in custom containers or non-SM environme
| --- | --- |
| [SageMaker Training](docs/sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger |
| Frameworks <ul><li>[TensorFlow](docs/tensorflow.md)</li><li>[PyTorch](docs/pytorch.md)</li><li>[MXNet](docs/mxnet.md)</li><li>[XGBoost](docs/xgboost.md)</li></ul> | See the frameworks pages for details on what's supported and how to modify your training script if applicable |
| [APIs for Saving Tensors](docs/api.md) | Full description of our APIs on saving tensors |
| [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by our APIs which allows you to perform interactive exploration of tensors saved as well as to write your own Rules monitoring your training jobs. |
| [APIs](docs/api.md) | Full description of our APIs on saving tensors |


## SageMaker Debugger in action
- Using SageMaker Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training. ![](docs/resources/xgboost_feature_importance.png?raw=true)
- Using SageMaker Debugger with TensorFlow in SageMaker Studio to run built-in rules and visualize the loss. ![](docs/resources/tensorflow_rules_loss.png?raw=true)



## License
Expand Down
4 changes: 2 additions & 2 deletions docs/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,12 +107,12 @@ for (inputs, labels) in trainloader:
optimizer.zero_grad()
outputs = net(inputs)
loss = F.cross_entropy(outputs, labels)

#######################################
# Manually record the loss
hook.record_tensor_value(tensor_name="loss", tensor_value=loss)
#######################################

loss.backward()
optimizer.step()
```
Expand Down
Binary file added docs/resources/tensorflow_rules_loss.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/resources/xgboost_feature_importance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.