diff --git a/README.md b/README.md
index 650dc1f16..665f2b213 100644
--- a/README.md
+++ b/README.md
@@ -2,21 +2,36 @@
[](https://codecov.io/gh/awslabs/sagemaker-debugger)
[](https://badge.fury.io/py/smdebug)
+## Table of Contents
+
- [Overview](#overview)
-- [Examples](#examples)
+- [Install the smdebug library](#install-the-smdebug-library)
+- [Debugger-supported Frameworks](#debugger-supported-frameworks)
- [How It Works](#how-it-works)
-- [Docs](#docs)
+- [Examples](#examples)
- [SageMaker Debugger in action](#sagemaker-debugger-in-action)
-
+- [Further Documentation and References](#further-documentation-and-references)
## Overview
-Amazon SageMaker Debugger is an offering from AWS which help you automate the debugging of machine learning training jobs.
-This library powers Amazon SageMaker Debugger, and helps you develop better, faster and cheaper models by catching common errors quickly.
-It allows you to save tensors from training jobs and makes these tensors available for analysis, all through a flexible and powerful API.
-It supports TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+.
+[Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html) automates the debugging process of machine learning training jobs. From training jobs, Debugger allows you to
+run your own training script (Zero Script Change experience) using Debugger built-in features—`Hook` and `Rule`—to capture tensors,
+have flexibility to build customized Hooks and Rules for configuring tensors as you want,
+and make the tensors available for analysis by saving in an [Amazon S3](https://aws.amazon.com/s3/?nc=sn&loc=0) bucket,
+all through a flexible and powerful API.
+
+The `smdebug` library powers Debugger by calling the saved tensors from the S3 bucket during the training job.
+`smdebug` retrieves and filters the tensors generated from Debugger such as gradients, weights, and biases.
-- Zero Script Change experience on SageMaker when using [supported containers](docs/sagemaker.md#zero-script-change)
-- Full visibility into any tensor part of the training process
+Debugger helps you develop better, faster, and cheaper models by minimally modifying estimator, tracing the tensors, catching anomalies while training models, and iterative model pruning.
+
+Debugger supports TensorFlow, PyTorch, MXNet, and XGBoost frameworks.
+
+The following list is a summary of the main functionalities of Debugger:
+
+- Run and debug training jobs of your model on SageMaker when using [supported containers](#debugger-supported-frameworks)
+- No changes needed to your training script if using AWS Deep Learning Containers with Debugger fully integrated
+- Minimal changes to your training script if using AWS containers with script mode or custom containers
+- Full visibility into any tensor retrieved from targeted parts of the training jobs
- Real-time training job monitoring through Rules
- Automated anomaly detection and state assertions through built-in and custom Rules on SageMaker
- Actions on your training jobs based on the status of Rules
@@ -24,12 +39,101 @@ It supports TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+.
- Distributed training support
- TensorBoard support
+See [How it works](#how-it-works) for more details.
+
+---
+
+## Install the smdebug library
+
+The `smdebug` library runs on Python 3. Install using the following command:
+
+```python
+pip install smdebug
+```
+
+---
+
+## Debugger-supported Frameworks
+For a complete overview of Amazon SageMaker Debugger to learn how it works, go to the [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) developer guide.
+
+### AWS Deep Learning Containers with zero code change
+Debugger is installed by default in AWS Deep Learning Containers with TensorFlow, PyTorch, MXNet, and XGBoost. The following framework containers enable you to use Debugger with no changes to your training script, by automatically adding [SageMaker Debugger's Hook](api.md#glossary).
+
+The following frameworks are available AWS Deep Learning Containers with the deep learning frameworks for the zero script change experience.
+
+| Framework | Version |
+| --- | --- |
+| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1, 2.2 |
+| [MXNet](docs/mxnet.md) | 1.6 |
+| [PyTorch](docs/pytorch.md) | 1.4, 1.5 |
+| [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))|
+
+### AWS training containers with script mode
+
+The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script.
+
+| Framework | Versions |
+| --- | --- |
+| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1, 2.2 |
+| Keras (with TensorFlow backend) | 2.3 |
+| [MXNet](docs/mxnet.md) | 1.4, 1.5, 1.6 |
+| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5 |
+| [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 (As a framework)|
+
+### Debugger on custom containers or local machines
+You can also fully use the Debugger features in custom containers with the SageMaker Python SDK. Furthermore, `smdebug` is an open source library, so you can install it on your local machine for any advanced use cases that cannot be run in the SageMaker environment and for constructing `smdebug` custom hooks and rules.
+
+---
+
+## How It Works
+
+Amazon SageMaker Debugger uses the construct of a `Hook` to save the values of requested tensors throughout the training process. You can then setup a `Rule` job which simultaneously monitors and validates these tensors to ensure
+that training is progressing as expected.
+
+A `Rule` checks for vanishing gradients, exploding tensor values, or poor weight initialization. Rules are attached to Amazon CloudWatch events, so that when a rule is triggered it changes the state of the CloudWatch event.
+You can configure any action on the CloudWatch event, such as to stop the training job saving you time and money.
+
+Debugger can be used inside or outside of SageMaker. However the built-in rules that AWS provides are only available for SageMaker training. Scenarios of usage can be classified into the following three cases.
+
+#### Using SageMaker Debugger on AWS Deep Learning Containers with zero training script change
+
+Use Debugger built-in hook configurations and rules while setting up the estimator and monitor your training job.
+
+For a full guide and examples of using the built-in rules, see [Running a Rule with zero script change on AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-rules.html).
+
+To see a complete list of built-in rules and their functionalities, see [List of Debugger Built-in Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html).
+
+#### Using SageMaker Debugger on AWS training containers with script mode
+
+You can use Debugger with your training script on your own container making only a minimal modification to your training script to add Debugger's `Hook`.
+For an example template of code to use Debugger on your own container in TensorFlow 2.x frameworks, see [Run Debugger in custom container](#Run-Debugger-in-custom-container).
+See the following instruction pages to set up Debugger in your preferred framework.
+ - [TensorFlow](docs/tensorflow.md)
+ - [MXNet](docs/mxnet.md)
+ - [PyTorch](docs/pytorch.md)
+ - [XGBoost](docs/xgboost.md)
+
+#### Using SageMaker Debugger on custom containers
+
+Debugger is available for any deep learning models that you bring to Amazon SageMaker. The AWS CLI, the SageMaker Estimator API, and the Debugger APIs enable you to use any Docker base images to build and customize containers to train and debug your models. To use Debugger with customized containers, go to [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html).
+
+#### Using SageMaker Debugger on a non-SageMaker environment
+
+Using the smdebug library, you can create custom hooks and rules (or manually analyze the tensors) and modify your training script to enable tensor analysis on a non-SageMaker environment, such as your local machine. For an example of this, see [Run Debugger locally](#run-debugger-locally).
+
+---
+
## Examples
-### Notebooks
-We have a bunch of [example notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) here demonstrating different functionality of SageMaker Debugger.
-### Running a Rule with Zero Script Change on SageMaker
-This example uses a zero-script-change experience, where you can use your training script as-is. Refer [Running SageMaker jobs with Amazon SageMaker Debugger](docs/sagemaker.md) for more details on this.
+### SageMaker Notebook Examples
+
+To find a collection of demonstrations using Debugger, see [SageMaker Debugger Example Notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger).
+
+#### Run Debugger rules with zero script change
+
+This example shows a how to use Debugger with Zero Script Change of
+your training script on a SageMaker DLC.
+
```python
import sagemaker as sm
from sagemaker.debugger import rule_configs, Rule, CollectionConfig
@@ -48,7 +152,7 @@ rule = Rule.sagemaker(
# Pass the rule to the estimator
sagemaker_simple_estimator = sm.tensorflow.TensorFlow(
- entry_point="script.py",
+ entry_point="script.py", #replace script.py to your own training script
role=sm.get_execution_role(),
framework_version="1.15",
py_version="py3",
@@ -65,18 +169,45 @@ print(f"Saved these tensors: {trial.tensor_names()}")
print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}")
```
-That's it! Amazon SageMaker will automatically monitor your training job for you with the Rules specified and create a CloudWatch
-event which tracks the status of the Rule, so you can take any action based on them.
+That's it! When you configure the `sagemaker_simple_estimator`,
+you simply specify the `entry_point` to your training script python file.
+When you run the `sagemaker_simple_estimator.fit()` API,
+SageMaker will automatically monitor your training job for you with the Rules specified and create a `CloudWatch` event that tracks the status of the Rule,
+so you can take any action based on them.
-If you want greater configuration and control, we offer that too. Head over [here](docs/sagemaker.md) for more information.
+If you want additional configuration and control, see [Running SageMaker jobs with Debugger](docs/sagemaker.md) for more information.
-### Running Locally
-Requires Python 3.6+, and this example uses tf.keras. Run
-```
-pip install smdebug
+#### Run Debugger in custom container
+
+The following example shows how to set `hook` to set a training model using Debugger in your own container.
+This example is for containers in TensorFlow 2.x framework using GradientTape to configure the `hook`.
+
+```python
+import smdebug.tensorflow as smd
+hook = smd.KerasHook(out_dir=args.out_dir)
+
+model = tf.keras.models.Sequential([ ... ])
+ for epoch in range(n_epochs):
+ for data, labels in dataset:
+ dataset_labels = labels
+ # wrap the tape to capture tensors
+ with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape:
+ logits = model(data, training=True) # (32,10)
+ loss_value = cce(labels, logits)
+ grads = tape.gradient(loss_value, model.variables)
+ opt.apply_gradients(zip(grads, model.variables))
+ acc = train_acc_metric(dataset_labels, logits)
+ # manually save metric values
+ hook.record_tensor_value(tensor_name="accuracy", tensor_value=acc)
```
-To use Amazon SageMaker Debugger, simply add a callback hook:
+To see a full script of this, refer to the [tf_keras_gradienttape.py](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow2/scripts/tf_keras_gradienttape.py) example script.
+For a notebook example of using BYOC in PyTorch, see [Using Amazon SageMaker Debugger with Your Own PyTorch Container](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_custom_container/pytorch_byoc_smdebug.ipynb)
+
+#### Run Debugger locally
+This example shows how to use Debugger for the Keras `model.fit()` API.
+
+To use Debugger, simply add a callback `hook`:
```python
import smdebug.tensorflow as smd
hook = smd.KerasHook(out_dir='~/smd_outputs/')
@@ -97,34 +228,28 @@ print(f"Saved these tensors: {trial.tensor_names()}")
print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}")
```
-## How It Works
+---
-Amazon SageMaker Debugger uses the construct of a `Hook` to save the values of requested tensors throughout the training process. You can then setup a `Rule` job which simultaneously monitors and validates these tensors to ensure
-that training is progressing as expected.
-A rule might check for vanishing gradients, or exploding tensor values, or poor weight initialization. Rules are attached to CloudWatch events, so that when a rule is triggered it changes the state of the CloudWatch event. You can configure any action on the CloudWatch event, such as to stop the training job saving you time and money.
+## SageMaker Debugger in Action
+- Through the model pruning process using Debugger and `smdebug`, you can iteratively identify the importance of weights and cut neurons below a threshold you define. This process allows you to train the model with significantly fewer neurons, which means a lighter, more efficient, faster, and cheaper model without compromising accuracy. The following accuracy versus the number of parameters graph is produced in Studio. It shows that the model accuracy started from about 0.9 with 12 million parameters (the data point moves from right to left along with the pruning process), improved during the first few pruning iterations, kept the quality of accuracy until it cut the number of parameters down to 6 million, and start sacrificing the accuracy afterwards.
+
+
+Debugger provides you tools to access such training process and have a complete control over your model. See [Using SageMaker Debugger and SageMaker Experiments for iterative model pruning](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_iterative_model_pruning/iterative_model_pruning_resnet.ipynb) notebook for the full example and more information.
-Amazon SageMaker Debugger can be used inside or outside of SageMaker. However the built-in rules that AWS provides are only available for SageMaker training. Scenarios of usage can be classified into the following:
-- **SageMaker Zero-Script-Change**: Here you specify which rules to use when setting up the estimator and run your existing script, no changes needed. See the first example above.
-- **SageMaker Bring-Your-Own-Container**: Here you specify the rules to use, and modify your training script minimally to enable SageMaker Debugger.
-- **Non-SageMaker**: Here you write custom rules (or manually analyze the tensors) and modify your training script minimally to enable SageMaker Debugger. See the second example above.
+- Use Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training. 
-The reason for different setups is that SageMaker Zero-Script-Change (via AWS Deep Learning Containers) uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost which add our Hook to the training job and save requested tensors automatically.
-These framework forks are not available in custom containers or non-SM environments, so you must modify your training script in these environments.
+- Use Debugger with TensorFlow in SageMaker Studio to run built-in rules and visualize the loss. 
-## Docs
+---
+
+## Further Documentation and References
| Section | Description |
| --- | --- |
| [SageMaker Training](docs/sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger |
| Frameworks
- [TensorFlow](docs/tensorflow.md)
- [PyTorch](docs/pytorch.md)
- [MXNet](docs/mxnet.md)
- [XGBoost](docs/xgboost.md)
| See the frameworks pages for details on what's supported and how to modify your training script if applicable |
| [APIs for Saving Tensors](docs/api.md) | Full description of our APIs on saving tensors |
-| [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by our APIs which allows you to perform interactive exploration of tensors saved as well as to write your own Rules monitoring your training jobs. |
-
-
-## SageMaker Debugger in action
-- Using SageMaker Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training. 
-- Using SageMaker Debugger with TensorFlow in SageMaker Studio to run built-in rules and visualize the loss. 
-
+| [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by the APIs that enable you to perform interactive exploration of tensors saved, as well as to write your own Rules monitoring your training jobs. |
## License
diff --git a/docs/api.md b/docs/api.md
index 8ebb26110..778cf3e46 100644
--- a/docs/api.md
+++ b/docs/api.md
@@ -47,24 +47,37 @@ you're in. Defaults to "global".
## Hook
### Creating a Hook
-Note that when using Zero Script Change supported containers in SageMaker, you generally do not need to create your hook object except for some advanced use cases where you need access to the hook.
+By using AWS Deep Learning Containers, you can directly run your own training script without any additional effort to make it compatible with the SageMaker Python SDK. For a detailed developer guide for this, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html).
-`HookClass` or `hook_class` below will be `Hook` for PyTorch, MXNet, and XGBoost. It will be one of `KerasHook`, `SessionHook` or `EstimatorHook` for TensorFlow.
+However, for some advanced use cases where you need access to customized tensors from targeted parts of a training script, you can manually construct the hook object. The smdebug library provides hook classes to make this process simple and compatible with the SageMaker ecosystem and Debugger.
-The framework in `smd` import below refers to one of `tensorflow`, `mxnet`, `pytorch` or `xgboost`.
-
-#### Hook when using SageMaker Python SDK
+#### Hook when using the SageMaker Python SDK
If you create a SageMaker job and specify the hook configuration in the SageMaker Estimator API
as described in [AWS Docs](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html),
-a JSON file containing the hook configuration will be automatically written to the training container. In such a case, you can create a hook from that configuration file by calling
+the CreateTrainingJob API operation containing the hook configuration will be automatically written to the training container.
+
+To capture tensors from your training model, paste the following code to the top or the main function of the training script.
```python
-import smdebug.{framework} as smd
-hook = smd.{hook_class}.create_from_json_file()
+import smdebug.Framework as smd
+hook = smd.HookClass.create_from_json_file()
```
-with no arguments and then use the hook Python API in your script.
+
+Depending on your choice of framework, `HookClass` need to be replaced by one of `KerasHook`, `SessionHook` or `EstimatorHook` for TensorFlow, and `Hook` for PyTorch, MXNet, and XGBoost.
+
+The framework in `smd.Framework` import refers to one of `tensorflow`, `mxnet`, `pytorch`, or `xgboost`.
+
+After choosing a framework and defining the hook object, you need to embed the hooks into target parts of your training script to retrieve tensors and to use with the SageMaker Debugger Python SDK.
+
+For more information about constructing the hook depending on a framework of your choice and adding the hooks to your model, see the following pages.
+
+* [TensorFlow hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/tensorflow.md)
+* [MXNet hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/mxnet.md)
+* [PyTorch hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/pytorch.md)
+* [XGBoost hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/xgboost.md)
#### Configuring Hook using SageMaker Python SDK
-Parameters to the Hook are passed as below when using the SageMaker Python SDK.
+After you make the minimal changes to your training script, you can configure the hook with parameters to the SageMaker Debugger API operation, `DebuggerHookConfig`.
+
```python
from sagemaker.debugger import DebuggerHookConfig
hook_config = DebuggerHookConfig(
@@ -73,7 +86,9 @@ hook_config = DebuggerHookConfig(
"parameter": "value"
})
```
-The parameters can be one of the following. The meaning of these parameters will be clear as you review the sections of documentation below. Note that all parameters below have to be strings. So for any parameter which accepts a list (such as save_steps, reductions, include_regex), the value needs to be given as strings separated by a comma between them.
+
+The available hook parameters are listed in the following. The meaning of these parameters will be clear as you review the sections of documentation below. Note that all parameters below have to be strings. So for any parameter which accepts a list (such as save_steps, reductions, include_regex), the value needs to be given as strings separated by a comma between them.
+
```
dry_run
save_all
@@ -147,7 +162,8 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`.
|`set_mode(mode)`| value of the enum `smd.modes` | Sets mode of the job, can be one of `smd.modes.TRAIN`, `smd.modes.EVAL`, `smd.modes.PREDICT` or `smd.modes.GLOBAL`. Refer [Modes](#modes) for more on that. |
|`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to.
|`close()` | - | Closes all files that are currently open by the hook |
-| `save_scalar(`
`name, `
`value, `
`sm_metric=False)` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. |
+| `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. |
+
### TensorFlow specific Hook API
Note that there are three types of Hooks in TensorFlow: SessionHook, EstimatorHook and KerasHook based on the TensorFlow interface being used for training. [This page](tensorflow.md) shows examples of each of these.
@@ -157,12 +173,12 @@ Note that there are three types of Hooks in TensorFlow: SessionHook, EstimatorHo
| `wrap_optimizer(optimizer)` | `optimizer` (tf.train.Optimizer or tf.keras.Optimizer) | Returns the same optimizer object passed with a couple of identifying markers to help `smdebug`. This returned optimizer should be used for training. | When not using Zero Script Change environments, calling this method on your optimizer is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same optimizer object passed and does not change your optimization logic. If the hook is of type `KerasHook`, you can pass in either an object of type `tf.train.Optimizer` or `tf.keras.Optimizer`. If the hook is of type `SessionHook` or `EstimatorHook`, the optimizer can only be of type `tf.train.Optimizer`. This new
| `add_to_collection(`
`collection_name, variable)` | `collection_name (str)` : name of the collection to add to.
`variable` parameter to pass to the collection's `add` method. | `None` | Calls the `add` method of a collection object. See [this section](#collection) for more. |
-APIs specific to training scripts using TF 2.x GradientTape ([Example](tensorflow.md#TF 2.x GradientTape example)):
+The following hook APIs are specific to training scripts using the TF 2.x GradientTape ([Example](tensorflow.md#TF 2.x GradientTape example)):
| Method | Arguments | Returns | Behavior |
| --- | --- | --- | --- |
| `wrap_tape(tape)` | `tape` (tensorflow.python.eager.backprop.GradientTape) | Returns a tape object with three identifying markers to help `smdebug`. This returned tape should be used for training. | When not using Zero Script Change environments, calling this method on your tape is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same tape object passed.
-| `record_tensor_value(`
`tensor_name, tensor_value)` | `tensor_name (str)` : name of the tensor to save.
`tensor_value` EagerTensor to save. | `None` | Manually save metrics tensors while using TF 2.x GradientTape. |
+| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors while using TF 2.x GradientTape. Note: `record_tensor_value()` is deprecated.|
### MXNet specific Hook API
@@ -217,6 +233,7 @@ The names of these collections are all lower case strings.
| `losses` | TensorFlow, PyTorch, MXNet | Saves the loss for the model |
| `metrics` | TensorFlow's KerasHook, XGBoost | For KerasHook, saves the metrics computed by Keras for the model. For XGBoost, the evaluation metrics computed by the algorithm. |
| `outputs` | TensorFlow's KerasHook | Matches the outputs of the model |
+| `layers` | TensorFlow's KerasHook | Input and output of intermediate convolutional layers |
| `sm_metrics` | TensorFlow | You can add scalars that you want to show up in SageMaker Metrics to this collection. SageMaker Debugger will save these scalars both to the out_dir of the hook, as well as to SageMaker Metric. Note that the scalars passed here will be saved on AWS servers outside of your AWS account. |
| `optimizer_variables` | TensorFlow's KerasHook | Matches all optimizer variables, currently only supported in Keras. |
| `hyperparameters` | XGBoost | [Booster paramameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) |
diff --git a/docs/mxnet.md b/docs/mxnet.md
index fb42ef8c4..e54d50055 100644
--- a/docs/mxnet.md
+++ b/docs/mxnet.md
@@ -35,8 +35,9 @@ If using SageMaker, you will configure the hook in SageMaker's python SDK using
#### 2. Register the model to the hook
Call `hook.register_block(net)`.
-#### 3. (Optional) Configure Collections, SaveConfig and ReductionConfig
-See the [Common API](api.md) page for details on how to do this.
+#### 3. Take actions using the hook APIs
+
+For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [MXNet specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#mxnet-specific-hook-api).
---
diff --git a/docs/pytorch.md b/docs/pytorch.md
index 5bb1beff6..5a9f380df 100644
--- a/docs/pytorch.md
+++ b/docs/pytorch.md
@@ -36,8 +36,9 @@ Call `hook.register_module(net)`.
If using a loss which is a subclass of `nn.Module`, call `hook.register_loss(loss_criterion)` once before starting training.\
If using a loss which is a subclass of `nn.functional`, call `hook.record_tensor_value(loss)` after each training step.
-#### 4. (Optional) Configure Collections, SaveConfig and ReductionConfig
-See the [Common API](api.md) page for details on how to do this.
+#### 4. Take actions using the hook APIs
+
+For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [PyTorch specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#pytorch-specific-hook-api).
---
diff --git a/docs/resources/results_resnet.png b/docs/resources/results_resnet.png
new file mode 100644
index 000000000..614b92b99
Binary files /dev/null and b/docs/resources/results_resnet.png differ
diff --git a/docs/sagemaker.md b/docs/sagemaker.md
index e2968f8d8..22a5ea00c 100644
--- a/docs/sagemaker.md
+++ b/docs/sagemaker.md
@@ -1,9 +1,6 @@
## Running SageMaker jobs with Amazon SageMaker Debugger
-## Outline
-- [Enabling SageMaker Debugger](#enabling-sagemaker-debugger)
- - [Zero Script Change](#zero-script-change)
- - [Bring your own training container](#bring-your-own-training-container)
+### Outline
- [Configuring SageMaker Debugger](#configuring-sagemaker-debugger)
- [Saving data](#saving-data)
- [Saving built-in collections that we manage](#saving-built-in-collections-that-we-manage)
@@ -17,44 +14,6 @@
- [TensorBoard Visualization](#tensorboard-visualization)
- [Example Notebooks](#example-notebooks)
-## Enabling SageMaker Debugger
-There are two ways in which you can enable SageMaker Debugger while training on SageMaker.
-
-### Zero Script Change
-We have equipped the official Framework containers on SageMaker with custom versions of supported frameworks TensorFlow, PyTorch, MXNet and XGBoost. These containers enable you to use SageMaker Debugger with no changes to your training script, by automatically adding [SageMaker Debugger's Hook](api.md#glossary).
-
-Here's a list of frameworks and versions which support this experience.
-
-| Framework | Version |
-| --- | --- |
-| [TensorFlow](tensorflow.md) | 1.15, 2.1, 2.2 |
-| [MXNet](mxnet.md) | 1.6 |
-| [PyTorch](pytorch.md) | 1.4, 1.5 |
-| [XGBoost](xgboost.md) | >=0.90-2 [As Built-in algorithm](xgboost.md#use-xgboost-as-a-built-in-algorithm)|
-
-More details for the deep learning frameworks on which containers these are can be found here: [SageMaker Framework Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) and [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/). You do not have to specify any training container image if you want to use them on SageMaker. You only need to specify the version above to use these containers.
-
-### Bring your own training container
-
-This library `smdebug` itself supports versions other than the ones listed above. If you want to use SageMaker Debugger with a version different from the above, you will have to orchestrate your training script with a few lines. Before we discuss how these changes look like, let us take a look at the versions supported.
-
-| Framework | Versions |
-| --- | --- |
-| [TensorFlow](tensorflow.md) | 1.13, 1.14, 1.15, 2.1, 2.2 |
-| Keras (with TensorFlow backend) | 2.3 |
-| [MXNet](mxnet.md) | 1.4, 1.5, 1.6 |
-| [PyTorch](pytorch.md) | 1.2, 1.3, 1.4, 1.5 |
-| [XGBoost](xgboost.md) | 0.90-2, 1.0-1 |
-
-#### Setting up SageMaker Debugger with your script on your container
-
-- Ensure that you are using Python3 runtime as `smdebug` only supports Python3.
-- Install `smdebug` binary through `pip install smdebug`
-- Make some minimal modifications to your training script to add SageMaker Debugger's Hook. Please refer to the framework pages linked below for instructions on how to do that.
- - [TensorFlow](tensorflow.md)
- - [PyTorch](pytorch.md)
- - [MXNet](mxnet.md)
- - [XGBoost](xgboost.md)
## Configuring SageMaker Debugger
@@ -185,17 +144,8 @@ Note that passing a `CollectionConfig` object to the Rule as `collections_to_sav
is equivalent to passing it to the `DebuggerHookConfig` object as `collection_configs`.
This is just a shortcut for your convenience.
-#### Built in Rules
-The Built-in Rules, or SageMaker Rules, are described in detail on [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)
-
-
-Scope of Validity | Rules |
-|---|---|
-| Generic Deep Learning models (TensorFlow, Apache MXNet, and PyTorch) |- [`dead_relu`](https://docs.aws.amazon.com/sagemaker/latest/dg/dead-relu.html)
- [`exploding_tensor`](https://docs.aws.amazon.com/sagemaker/latest/dg/exploding-tensor.html)
- [`poor_weight_initialization`](https://docs.aws.amazon.com/sagemaker/latest/dg/poor-weight-initialization.html)
- [`saturated_activation`](https://docs.aws.amazon.com/sagemaker/latest/dg/saturated-activation.html)
- [`vanishing_gradient`](https://docs.aws.amazon.com/sagemaker/latest/dg/vanishing-gradient.html)
- [`weight_update_ratio`](https://docs.aws.amazon.com/sagemaker/latest/dg/weight-update-ratio.html)
|
-| Generic Deep learning models (TensorFlow, MXNet, and PyTorch) and the XGBoost algorithm | - [`all_zero`](https://docs.aws.amazon.com/sagemaker/latest/dg/all-zero.html)
- [`class_imbalance`](https://docs.aws.amazon.com/sagemaker/latest/dg/class-imbalance.html)
- [`confusion`](https://docs.aws.amazon.com/sagemaker/latest/dg/confusion.html)
- [`loss_not_decreasing`](https://docs.aws.amazon.com/sagemaker/latest/dg/loss-not-decreasing.html)
- [`overfit`](https://docs.aws.amazon.com/sagemaker/latest/dg/overfit.html)
- [`overtraining`](https://docs.aws.amazon.com/sagemaker/latest/dg/overtraining.html)
- [`similar_across_runs`](https://docs.aws.amazon.com/sagemaker/latest/dg/similar-across-runs.html)
- [`tensor_variance`](https://docs.aws.amazon.com/sagemaker/latest/dg/tensor-variance.html)
- [`unchanged_tensor`](https://docs.aws.amazon.com/sagemaker/latest/dg/unchanged-tensor.html)
|
-| Deep learning applications |- [`check_input_images`](https://docs.aws.amazon.com/sagemaker/latest/dg/checkinput-mages.html)
- [`nlp_sequence_ratio`](https://docs.aws.amazon.com/sagemaker/latest/dg/nlp-sequence-ratio.html)
|
-| XGBoost algorithm | - [`tree_depth`](https://docs.aws.amazon.com/sagemaker/latest/dg/tree-depth.html)
|
-
+#### Built-in Rules
+To find a full list of built-in rules that you can use with the SageMaker Python SDK, see the [List of Debugger Built-in Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html) page.
#### Running built-in SageMaker Rules
You can run a SageMaker built-in Rule as follows using the `Rule.sagemaker` method.
@@ -234,10 +184,11 @@ sagemaker_estimator = sm.tensorflow.TensorFlow(
framework_version="1.15",
py_version="py3",
# smdebug-specific arguments below
- rules=[exploding_tensor_rule, vanishing_gradient_rule],
+ rules=[exploding_tensor_rule, vanishing_gradient_rule]
)
sagemaker_estimator.fit()
```
+
#### Custom Rules
You can write your own rule custom made for your application and provide it, so SageMaker can monitor your training job using your rule. To do so, you need to understand the programming model that `smdebug` provides. Our page on [Programming Model for Analysis](analysis.md) describes the APIs that we provide to help you write your own rule.
diff --git a/docs/tensorflow.md b/docs/tensorflow.md
index 968dd5230..1f8d7e5d9 100644
--- a/docs/tensorflow.md
+++ b/docs/tensorflow.md
@@ -3,70 +3,116 @@
## Contents
- [Support](#support)
- [How to Use](#how-to-use)
-- [tf.keras Example](#tfkeras)
-- [MonitoredSession Example](#monitoredsession)
-- [Estimator Example](#estimator)
-- [Full API](#full-api)
+- [Code Structure Samples](#examples)
+- [References](#references)
---
## Support
+**Zero script change experience** — No modification is needed to your training script to enable the Debugger features while using the [official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html).
+
+**Script mode experience** — The smdebug library supports training jobs with the TensorFlow framework and script mode through its API operations. This option requires minimal changes to your training script, and the smdebug library provides you hook features to help implement Debugger and analyze tensors.
+
### Versions
-- Zero Script Change experience where you need no modifications to your training script is supported in the official [SageMaker Framework Container for TensorFlow 1.15](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html), or the [AWS Deep Learning Container for TensorFlow 1.15](https://aws.amazon.com/machine-learning/containers/).
+For a full list of TensorFlow framework versions to use Debugger, see [AWS Deep Learning Containers and SageMaker training containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers).
-- This library itself supports the following versions when you use our API which requires a few minimal changes to your training script: TensorFlow 1.14, 1.15, 2.0+. Keras 2.3.
+### Distributed training supported by Debugger
+- Horovod and Mirrored Strategy multi-GPU distributed trainings are supported.
+- Parameter server based distributed training is currently not supported.
-### Interfaces
-- TF 1.x:
- - [Estimator](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/estimator)
- - [tf.keras](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/keras)
- - [MonitoredSession](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/train/MonitoredSession?hl=en)
-- TF 2.x:
- - [Estimator](https://www.tensorflow.org/versions/r2.1/api_docs/python/tf/estimator)
- - [tf.keras](https://www.tensorflow.org/versions/r2.1/api_docs/python/tf/keras)
+---
+## How to Use
+### Debugger with AWS Deep Learning Containers and zero script change
-### Distributed training
-- [MirroredStrategy](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/distribute/MirroredStrategy) or [Contrib MirroredStrategy](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/contrib/distribute/MirroredStrategy)
+The Debugger features are all integrated into the AWS Deep Learning Containers, and you can run your training script with zero script change. To find a high-level SageMaker TensorFlow estimator with Debugger example code, see [Debugger in TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow).
-We will very quickly follow up with support for Parameter Server based training.
+### Debugger with AWS training containers and script mode
----
+In case you want to run your own training script and debug using the SageMaker TensorFlow framework with script mode and Debugger, the smdebug client library provides the hook constructor that you can add to the training script and retrieve tensors.
-## How to Use
-### Using Zero Script Change containers
-In this case, you don't need to do anything to get the hook running. You are encouraged to configure the hook from the SageMaker python SDK so you can run different jobs with different configurations without having to modify your script. If you want access to the hook to configure certain things which can not be configured through the SageMaker SDK, you can retrieve the hook as follows.
-```
+#### 1. Create a hook
+
+ To create the hook constructor, add the following code.
+
+```python
import smdebug.tensorflow as smd
hook = smd.{hook_class}.create_from_json_file()
```
-Note that you can create the hook from smdebug's python API as is being done in the next section even in such containers.
-### Bring your own container experience
-#### 1. Create a hook
-If using SageMaker, you will configure the hook in SageMaker's python SDK using the Estimator class. Instantiate it with
-`smd.{hook_class}.create_from_json_file()`. Otherwise, call the hook class constructor, `smd.{hook_class}()`. Details are below for tf.keras, MonitoredSession, or Estimator.
+Depending on the TensorFlow versions for your model, you need to choose a hook class. There are three hook constructor classes that you can pick and replace `{hook_class}`: `KerasHook`, `SessionHook`, and `EstimatorHook`.
+
+#### KerasHook
+
+Use if you use the Keras `model.fit()` API. This is available for all frameworks and versions of Keras and TensorFlow. `KerasHook` covers the eager execution modes and the gradient tape feature that are introduced from the TensorFlow framework version 2.0. For example, you can set the Keras hook constructor by adding the following code into your training script.
+```python
+hook = smd.KerasHook.create_from_json_file()
+```
+To learn how to fully implement the hook to your training script, see the [Keras with the TensorFlow gradient tape and the smdebug hook example scripts](https://github.com/awslabs/sagemaker-debugger/tree/master/examples/tensorflow2/scripts).
+
+> **Note**: If you use the AWS Deep Learning Containers for zero script change, Debugger collects the most of tensors regardless the eager execution modes, through its high-level API.
+
+#### SessionHook
+
+Use if your model is created in TensorFlow version 1.x with the low-level approach, not using the Keras API. This is for the TensorFlow 1.x monitored training session API, `tf.train.MonitoredSessions()`.
+
+```python
+hook = smd.SessionHook.create_from_json_file()
+```
+
+To learn how to fully implement the hook into your training script, see the [TensorFlow monitored training session with the smdebug hook example script](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py).
+
+> **Note**: The official TensorFlow library deprecated the `tf.train.MonitoredSessions()` API in favor of `tf.function()` in TF 2.0 and above. You can use `SessionHook` for `tf.function()` in TF 2.0 and above.
+
+#### EstimatorHook
+
+Use if you have a model using the `tf.estimator()` API. Available for any TensorFlow framework versions that supports the `tf.estimator()` API.
+
+```python
+hook = smd.EstimatorHook.create_from_json_file()
+```
+
+To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py).
#### 2. Register the hook to your model
-The argument is `callbacks=[hook]` for tf.keras. It is `hooks=[hook]` for MonitoredSession and Estimator.
-#### 3. Wrap the optimizer
-If you would like to save `gradients`, wrap your optimizer with the hook as follows `optimizer = hook.wrap_optimizer(optimizer)`. This does not modify your optimization logic, and returns the same optimizer instance passed to the method.
+To collect the tensors from the hooks that you implemented, add `callbacks=[hook]` to the Keras `model.fit()` API and `hooks=[hook]` for the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs.
+
+#### 3. Wrap the optimizer and the gradient tape
+
+The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific for the TensorFlow framework.
+
+If you want to save `gradients` from the optimizer of your model, wrap it with the hook as follows:
+```python
+optimizer = hook.wrap_optimizer(optimizer)
+```
+
+If you want to save `gradients` from the TensorFlow gradient tape feature, wrap it as follows:
+```python
+with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape:
+```
+
+These wrappers capture the gradient tensors, not affecting your optimization logic at all.
+
+For examples of code structure to apply the hook wrappers, see the [Examples](#examples) section.
+
+#### 4. Take actions using the hook APIs
-#### 4. (Optional) Configure Collections, SaveConfig and ReductionConfig
-See the [Common API](api.md) page for details on how to do this.
+For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [TensorFlow specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#tensorflow-specific-hook-api).
+
+>**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1.
---
## Examples
-We have three Hooks for different interfaces of TensorFlow. The following is needed to enable SageMaker Debugger on non Zero Script Change supported containers. Refer [SageMaker training](sagemaker.md) on how to use the Zero Script Change experience.
+The following examples show the three different hook constructions of TensorFlow. The following examples show what minimal changes have to be made to enable SageMaker Debugger while using the AWS containers with script mode. To learn how to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html).
-## tf.keras
-### Example
+### Keras API (tf.keras)
```python
import smdebug.tensorflow as smd
+
hook = smd.KerasHook(out_dir=args.out_dir)
model = tf.keras.models.Sequential([ ... ])
@@ -79,9 +125,10 @@ model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook])
model.evaluate(x_test, y_test, callbacks=[hook])
```
-### TF 2.x GradientTape example
+### Keras GradientTape example for TF 2.0 and above
```python
import smdebug.tensorflow as smd
+
hook = smd.KerasHook(out_dir=args.out_dir)
model = tf.keras.models.Sequential([ ... ])
@@ -99,12 +146,10 @@ model = tf.keras.models.Sequential([ ... ])
hook.record_tensor_value(tensor_name="accuracy", tensor_value=acc)
```
----
-
-## MonitoredSession
-### Example
+### Monitored Session (tf.train.MonitoredSession)
```python
import smdebug.tensorflow as smd
+
hook = smd.SessionHook(out_dir=args.out_dir)
loss = tf.reduce_mean(tf.matmul(...), name="loss")
@@ -119,12 +164,10 @@ sess = tf.train.MonitoredSession(hooks=[hook])
sess.run([loss, ...])
```
----
-
-## Estimator
-### Example
+### Estimator (tf.estimator.Estimator)
```python
import smdebug.tensorflow as smd
+
hook = smd.EstimatorHook(out_dir=args.out_dir)
train_input_fn, eval_input_fn = ...
@@ -140,6 +183,20 @@ estimator.evaluate(input_fn=eval_input_fn, steps=args.steps, hooks=[hook])
---
-## Full API
+## References
+
+### The smdebug API for saving tensors
See the [API for saving tensors](api.md) page for details about the Hooks, Collection, SaveConfig, and ReductionConfig.
See the [Analysis](analysis.md) page for details about analyzing a training job.
+
+### TensorFlow References
+- TF 1.x:
+ - [tf.estimator](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/estimator)
+ - [tf.keras](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/keras)
+ - [tf.train.MonitoredSession](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/train/MonitoredSession?hl=en)
+- TF 2.1:
+ - [tf.estimator](https://www.tensorflow.org/versions/r2.1/api_docs/python/tf/estimator)
+ - [tf.keras](https://www.tensorflow.org/versions/r2.1/api_docs/python/tf/keras)
+- TF 2.2:
+ - [tf.estimator](https://www.tensorflow.org/api_docs/python/tf/estimator)
+ - [tf.keras](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf)
diff --git a/docs/xgboost.md b/docs/xgboost.md
index 14a8220bc..2ec65157a 100644
--- a/docs/xgboost.md
+++ b/docs/xgboost.md
@@ -80,6 +80,7 @@ def __init__(
validation_data = None,
)
```
+
Initializes the hook. Pass this object as a callback to `xgboost.train()`.
* `out_dir` (str): A path into which tensors and metadata will be written.
* `export_tensorboard` (bool): Whether to use TensorBoard logs.