Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
69b3943
Update README.md
mchoi8739 Apr 20, 2020
90ab484
response to the comment from vandanavk
mchoi8739 Apr 21, 2020
55d735e
fixing a typo
mchoi8739 Apr 21, 2020
fd39891
Fixing the example links
mchoi8739 Apr 23, 2020
f69a206
Staging for preview
mchoi8739 Apr 27, 2020
d1543aa
fixed the doc responding to the comments
mchoi8739 Apr 28, 2020
23aab4e
Update README.md
mchoi8739 Apr 27, 2020
d6b74cb
fix minor things
mchoi8739 Apr 28, 2020
4821ded
re-arange and edit README and sagemaker markdown files
mchoi8739 Apr 29, 2020
b1170d5
fixing few typos
mchoi8739 Apr 29, 2020
4c5cbaf
update README.md / add BYOC example
mchoi8739 Apr 30, 2020
ca3e7a9
minor fix
mchoi8739 Apr 30, 2020
5e549aa
fixed links
mchoi8739 Apr 30, 2020
8f3a171
Update README.md
mchoi8739 Apr 30, 2020
52a2a90
Update README.md
mchoi8739 Apr 30, 2020
fedefc0
Update README.md
mchoi8739 Apr 30, 2020
ef0707d
Update README.md
mchoi8739 Apr 30, 2020
77e0d34
Update README.md
mchoi8739 Apr 30, 2020
3c37a48
Update README.md
mchoi8739 Apr 30, 2020
5dceb53
Update README.md
mchoi8739 Apr 30, 2020
05039cc
Update README.md
mchoi8739 Apr 30, 2020
309c4a0
Update README.md
mchoi8739 Apr 30, 2020
97a1fb3
Update README.md
mchoi8739 Apr 30, 2020
a29c6c9
Update README.md
mchoi8739 Apr 30, 2020
4733be3
Update README.md
mchoi8739 Apr 30, 2020
7781628
Update README.md
mchoi8739 Apr 30, 2020
1e87842
Update README.md
mchoi8739 Apr 30, 2020
9e3e14c
Update README.md
mchoi8739 Apr 30, 2020
ed1b75b
Update README.md
mchoi8739 Apr 30, 2020
5059f99
Update README.md
mchoi8739 Apr 30, 2020
7935b87
Update README.md
mchoi8739 Apr 30, 2020
fd467b4
a few changes and re-ordering
mchoi8739 Apr 30, 2020
d08795b
model pruning resnet image
mchoi8739 Apr 30, 2020
c5e0963
update README.md
mchoi8739 Apr 30, 2020
5fd2281
fix issues
mchoi8739 May 5, 2020
8cc5cd9
sync up
mchoi8739 Jul 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 193 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,35 @@
[![codecov](https://codecov.io/gh/awslabs/sagemaker-debugger/branch/master/graph/badge.svg)](https://codecov.io/gh/awslabs/sagemaker-debugger)
[![PyPI](https://badge.fury.io/py/smdebug.svg)](https://badge.fury.io/py/smdebug)

## Table of Contents

## Table of Contents

- [Overview](#overview)
- [Examples](#examples)
- [How It Works](#how-it-works)
- [Docs](#docs)
- [SageMaker Debugger in action](#sagemaker-debugger-in-action)
- [Install](#install-sagemaker-debugger)
- [How It Works](#how-it-works)
- [Examples](#examples)
- [Further Documentation](#further-documentation)
- [Release Notes](#release-notes)


## Overview
Amazon SageMaker Debugger is an offering from AWS which help you automate the debugging of machine learning training jobs.
This library powers Amazon SageMaker Debugger, and helps you develop better, faster and cheaper models by catching common errors quickly.
It allows you to save tensors from training jobs and makes these tensors available for analysis, all through a flexible and powerful API.
It supports TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+.
[Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html) automates the debugging process of machine learning training jobs. From training jobs, Debugger allows you to
run your own training script (Zero Script Change experience) using Debugger built-in features—`Hook` and `Rule`—to capture tensors,
have flexibility to build customized Hooks and Rules for configuring tensors as you want,
and make the tensors available for analysis by saving in an [Amazon S3](https://aws.amazon.com/s3/?nc=sn&loc=0) bucket,
all through a flexible and powerful API.

The `smdebug` library powers Debugger by calling the saved tensors from the S3 bucket during the training job.
`smdebug` retrieves and filters the tensors generated from Debugger such as gradients, weights, and biases.

Debugger helps you develop better, faster, and cheaper models by minimally modifying estimator, tracing the tensors, catching anomalies while training models, and iterative model pruning.

Debugger supports TensorFlow, PyTorch, MXNet, and XGBoost frameworks.
The following list is a summary of the main functionalities of Debugger:

- Zero Script Change experience on SageMaker when using [supported containers](docs/sagemaker.md#zero-script-change)
- Zero Script Change experience on SageMaker when using [supported containers](#support)
- Full visibility into any tensor part of the training process
- Real-time training job monitoring through Rules
- Automated anomaly detection and state assertions through built-in and custom Rules on SageMaker
Expand All @@ -24,12 +39,126 @@ It supports TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+.
- Distributed training support
- TensorBoard support

See [How it works](#how-it-works) for more details.


## SageMaker Debugger in Action
- Through the model pruning process using Debugger and `smdebug`, you can iteratively identify the importance of weights and cut neurons below a threshold you define. This process allows you to train the model with significantly fewer neurons, which means a lighter, more efficient, faster, and cheaper model without compromising accuracy.
![Debugger Iterative Model Pruning using ResNet](docs/resources/results_resnet.png?raw=true)
See [Using SageMaker Debugger and SageMaker Experiments for iterative model pruning](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_iterative_model_pruning/iterative_model_pruning_resnet.ipynb) notebook for visualization and further information.
- Use Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training. ![Debugger XGBoost Visualization Example](docs/resources/xgboost_feature_importance.png?raw=true)
- Use Debugger with TensorFlow in SageMaker Studio to run built-in rules and visualize the loss. ![Debugger TensorFlow Visualization Example](docs/resources/tensorflow_rules_loss.png?raw=true)


## Install SageMaker Debugger

`smdebug` library runs on Python 3.x. Install `smdebug` through:

```
pip install smdebug
```

### Debugger Usage and Supported Frameworks
There are two ways in which you can enable SageMaker Debugger while training on SageMaker—Zero Script Change and Bring Your Own Training Container (BYOC).

#### Zero Script Change

You can use your own training script while using [AWS Deep Learning Containers (DLC)](https://aws.amazon.com/machine-learning/containers/) in TensorFlow, PyTorch, MXNet, and XGBoost frameworks. The AWS DLCs enable you to use Debugger with no changes to your training script by automatically adding SageMaker Debugger's `Hook`.
The following table shows currently supported versions of the four frameworks for Zero Script Change experience.

| Framework | Version |
| --- | --- |
| [TensorFlow](docs/tensorflow.md) | 1.15, 1.15.2, 2.1 |
| [MXNet](docs/mxnet.md) | 1.6 |
| [PyTorch](docs/pytorch.md) | 1.3, 1.4 |
| [XGBoost](docs/xgboost.md) | >=0.90-2 [As Built-in algorithm](xgboost.md#use-xgboost-as-a-built-in-algorithm)|

For the full list and information of the AWS DLCs, see [Deep Learning Containers Images](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html#deep-learning-containers-images-table).


#### Bring Your Own Training Container

`smdebug` supports frameworks other than the ones listed in the previous Zero Script Change section. You can use your own training script by adding a minimal modification.
Currently supported versions of frameworks are listed in the following table.

| Framework | Versions |
| --- | --- |
| [TensorFlow](docs/tensorflow.md) | 1.14. 1.15, 2.0.1, 2.1.0 |
| Keras (with TensorFlow backend) | 2.3 |
| [MXNet](docs/mxnet.md) | 1.4, 1.5, 1.6 |
| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4 |
| [XGBoost](docs/xgboost.md) | [As Framework](xgboost.md#use-xgboost-as-a-framework) |

### Support for Distributed Training and Known Limitations

<table>
<thead>
<tr>
<th colspan=3>
Distributed Training
</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan=2>Horovod</td>
<td>Supported</td>
<td>TF 1.15, PT 1.4, MX 1.6</td>
</tr>
<tr>
<td>Not supported</td>
<td>TF 2.x, PT 1.5</td>
</tr>
<tr>
<td>Parameter Server-based</td>
<td colspan=2>Not supported</td>
</tr>
</tbody>
</table>

## How It Works

Amazon SageMaker Debugger uses the construct of a `Hook` to save the values of requested tensors throughout the training process. You can then setup a `Rule` job which simultaneously monitors and validates these tensors to ensure
that training is progressing as expected.

A `Rule` checks for vanishing gradients, exploding tensor values, or poor weight initialization. Rules are attached to Amazon CloudWatch events, so that when a rule is triggered it changes the state of the CloudWatch event.
You can configure any action on the CloudWatch event, such as to stop the training job saving you time and money.

Debugger can be used inside or outside of SageMaker. However the built-in rules that AWS provides are only available for SageMaker training. Scenarios of usage can be classified into the following three cases.

#### Using SageMaker Debugger with Zero Script Change of Your Training Script

Here you specify which rules to use when setting up the estimator and run your existing script without no change. For an example of this, see [Running a Rule with Zero Script Change on SageMaker](#running-a-rule-with-zero-script-change-on-sageMaker).

#### Using SageMaker Debugger on Bring Your Own Container

You can use Debugger with your training script on your own container making only a minimal modification to your training script to add Debugger's `Hook`.
For an example template of code to use Debugger on your own container in TensorFlow 2.x frameworks, see [Running on Your Own Container](#Running-on-Your-Own-Container).
See the following instruction pages to set up Debugger in your preferred framework.
- [TensorFlow](docs/tensorflow.md)
- [MXNet](docs/mxnet.md)
- [PyTorch](docs/pytorch.md)
- [XGBoost](docs/xgboost.md)

#### Using SageMaker Debugger on a Non-SageMaker Environment

Here you write custom rules (or manually analyze the tensors) and modify your training script minimally to enable Debugger on a non-SageMaker Environment such as your local machine. For an example of this, see [Running Locally](#running-locally).

The reason for different setups is that Zero Script Change (via AWS Deep Learning Containers) uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost which add the `Hook` to the training job and save requested tensors automatically.
These framework forks are not available in custom containers or non-SageMaker environments, so you must modify your training script in these environments.


## Examples
### Notebooks
We have a bunch of [example notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) here demonstrating different functionality of SageMaker Debugger.

### Running a Rule with Zero Script Change on SageMaker
This example uses a zero-script-change experience, where you can use your training script as-is. Refer [Running SageMaker jobs with Amazon SageMaker Debugger](docs/sagemaker.md) for more details on this.
### SageMaker Notebook Examples

To find a collection of demonstrations using Debugger, see [SageMaker Debugger Example Notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger).

#### Run a Rule with Zero Script Change

This example shows a how to use Debugger with Zero Script Change of
your training script on a SageMaker DLC.

```python
import sagemaker as sm
from sagemaker.debugger import rule_configs, Rule, CollectionConfig
Expand All @@ -48,7 +177,7 @@ rule = Rule.sagemaker(

# Pass the rule to the estimator
sagemaker_simple_estimator = sm.tensorflow.TensorFlow(
entry_point="script.py",
entry_point="script.py", #replace script.py to your own training script
role=sm.get_execution_role(),
framework_version="1.15",
py_version="py3",
Expand All @@ -65,18 +194,45 @@ print(f"Saved these tensors: {trial.tensor_names()}")
print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}")
```

That's it! Amazon SageMaker will automatically monitor your training job for you with the Rules specified and create a CloudWatch
event which tracks the status of the Rule, so you can take any action based on them.
That's it! When you configure the `sagemaker_simple_estimator`,
you simply specify the `entry_point` to your training script python file.
When you run the `sagemaker_simple_estimator.fit()` API,
SageMaker will automatically monitor your training job for you with the Rules specified and create a `CloudWatch` event that tracks the status of the Rule,
so you can take any action based on them.

If you want greater configuration and control, we offer that too. Head over [here](docs/sagemaker.md) for more information.
If you want additional configuration and control, see [Running SageMaker jobs with Debugger](docs/sagemaker.md) for more information.

### Running Locally
Requires Python 3.6+, and this example uses tf.keras. Run
```
pip install smdebug
#### Run Debugger in Your Own Container

The following example shows how to set `hook` to set a training model using Debugger in your own container.
This example is for containers in TensorFlow 2.x framework using GradientTape to configure the `hook`.

```python
import smdebug.tensorflow as smd
hook = smd.KerasHook(out_dir=args.out_dir)

model = tf.keras.models.Sequential([ ... ])
for epoch in range(n_epochs):
for data, labels in dataset:
dataset_labels = labels
# wrap the tape to capture tensors
with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape:
logits = model(data, training=True) # (32,10)
loss_value = cce(labels, logits)
grads = tape.gradient(loss_value, model.variables)
opt.apply_gradients(zip(grads, model.variables))
acc = train_acc_metric(dataset_labels, logits)
# manually save metric values
hook.record_tensor_value(tensor_name="accuracy", tensor_value=acc)
```

To use Amazon SageMaker Debugger, simply add a callback hook:
To see a full script of this, refer to the [tf_keras_gradienttape.py](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow2/scripts/tf_keras_gradienttape.py) example script.
For a notebook example of using BYOC in PyTorch, see [Using Amazon SageMaker Debugger with Your Own PyTorch Container](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_custom_container/pytorch_byoc_smdebug.ipynb)

#### Run Debugger Locally
Requires Python 3.6+ and this example uses tf.keras.

To use Debugger, simply add a callback `hook`:
```python
import smdebug.tensorflow as smd
hook = smd.KerasHook(out_dir='~/smd_outputs/')
Expand All @@ -97,35 +253,32 @@ print(f"Saved these tensors: {trial.tensor_names()}")
print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}")
```

## How It Works

Amazon SageMaker Debugger uses the construct of a `Hook` to save the values of requested tensors throughout the training process. You can then setup a `Rule` job which simultaneously monitors and validates these tensors to ensure
that training is progressing as expected.
A rule might check for vanishing gradients, or exploding tensor values, or poor weight initialization. Rules are attached to CloudWatch events, so that when a rule is triggered it changes the state of the CloudWatch event. You can configure any action on the CloudWatch event, such as to stop the training job saving you time and money.

Amazon SageMaker Debugger can be used inside or outside of SageMaker. However the built-in rules that AWS provides are only available for SageMaker training. Scenarios of usage can be classified into the following:
- **SageMaker Zero-Script-Change**: Here you specify which rules to use when setting up the estimator and run your existing script, no changes needed. See the first example above.
- **SageMaker Bring-Your-Own-Container**: Here you specify the rules to use, and modify your training script minimally to enable SageMaker Debugger.
- **Non-SageMaker**: Here you write custom rules (or manually analyze the tensors) and modify your training script minimally to enable SageMaker Debugger. See the second example above.

The reason for different setups is that SageMaker Zero-Script-Change (via AWS Deep Learning Containers) uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost which add our Hook to the training job and save requested tensors automatically.
These framework forks are not available in custom containers or non-SM environments, so you must modify your training script in these environments.

## Docs
## Further Documentation

| Section | Description |
| --- | --- |
| [SageMaker Training](docs/sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger |
| Frameworks <ul><li>[TensorFlow](docs/tensorflow.md)</li><li>[PyTorch](docs/pytorch.md)</li><li>[MXNet](docs/mxnet.md)</li><li>[XGBoost](docs/xgboost.md)</li></ul> | See the frameworks pages for details on what's supported and how to modify your training script if applicable |
| [APIs for Saving Tensors](docs/api.md) | Full description of our APIs on saving tensors |
| [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by our APIs which allows you to perform interactive exploration of tensors saved as well as to write your own Rules monitoring your training jobs. |
| [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by the APIs that enable you to perform interactive exploration of tensors saved, as well as to write your own Rules monitoring your training jobs. |


## Release Notes

### [Latest release v0.7.2](https://github.com/awslabs/sagemaker-debugger/releases)

- Introducing experimental support for TF 2.x training scripts using GradientTape -
With this update, weights, bias, loss, metrics, and gradients are captured by SageMaker Debugger.
GradientTape in TF 2.x captures these tensors from custom training jobs. An example of GradientTape implementation to a custom ResNet training script using TensorFlow's Keras interface is provided at [tf_keras_gradienttape.py](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow2/scripts/tf_keras_gradienttape.py).
GradientTape does not work with Zero Script Change experience at this time.

*Note*: Training scripts using GradientTape for higher-order gradients or multiple tapes are not supported. Distributed training scripts that use GradientTape are not supported at this time.

## SageMaker Debugger in action
- Using SageMaker Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training. ![](docs/resources/xgboost_feature_importance.png?raw=true)
- Using SageMaker Debugger with TensorFlow in SageMaker Studio to run built-in rules and visualize the loss. ![](docs/resources/tensorflow_rules_loss.png?raw=true)
- Support `SyncOnReadVariable` in mirrored strategy - Fixes a bug that occurred because the `SyncOnRead` distributed variable was not supported with `smdebug`. Also enables the use of `smdebug` with training scripts using TF 2.x MirroredStrategy with the `fit()` API.

- Turn off hook and write only from one worker for unsupported distributed training techniques – Fixes a crash when distributed training in PyTorch framework is implemented using generic multiprocessing library, which is not a method supported by `smdebug`. This fix handles this case and ensures that tensors are saved.

- Bug fix: Pytorch: Register only if tensors require gradients – Users were observing a crash when training with pre-trained embeddings which does not need gradient updates. This fix checks if a gradient update is required and registers a backward hook only in those cases.

## License
This library is licensed under the Apache 2.0 License.
Binary file added docs/resources/results_resnet.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/sagemaker.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
## Running SageMaker jobs with Amazon SageMaker Debugger

## Outline
- [Enabling SageMaker Debugger](#enabling-sagemaker-debugger)
- [Zero Script Change](#zero-script-change)
- [Bring your own training container](#bring-your-own-training-container)
- [Configuring SageMaker Debugger](#configuring-sagemaker-debugger)
- [Saving data](#saving-data)
- [Saving built-in collections that we manage](#saving-built-in-collections-that-we-manage)
Expand All @@ -17,6 +14,8 @@
- [TensorBoard Visualization](#tensorboard-visualization)
- [Example Notebooks](#example-notebooks)

<<<<<<< HEAD
=======
## Enabling SageMaker Debugger
There are two ways in which you can enable SageMaker Debugger while training on SageMaker.

Expand Down Expand Up @@ -56,6 +55,7 @@ This library `smdebug` itself supports versions other than the ones listed above
- [MXNet](mxnet.md)
- [XGBoost](xgboost.md)

>>>>>>> upstream/master
## Configuring SageMaker Debugger

Regardless of which of the two above ways you have enabled SageMaker Debugger, you can configure it using the SageMaker python SDK. There are two aspects to this configuration.
Expand Down