From aa789c7dd0ff43dd06f271309726071b65bc016d Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Sun, 9 Aug 2020 23:47:35 -0700 Subject: [PATCH 01/28] update TF 2.2 smdebug features --- docs/tensorflow.md | 128 ++++++++++++++++++++++++++++++++++++--------- 1 file changed, 103 insertions(+), 25 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 1f8d7e5d9..ee4cc1ac9 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -2,7 +2,9 @@ ## Contents - [Support](#support) -- [How to Use](#how-to-use) +- [How to Use Debugger with TensorFlow](#how-to-use) + - [Debugger with AWS Deep Learning Containers](#debugger-dlc) + - [Debugger with other AWS training containers and custom containers](#debugger-script-change) - [Code Structure Samples](#examples) - [References](#references) @@ -10,48 +12,111 @@ ## Support -**Zero script change experience** — No modification is needed to your training script to enable the Debugger features while using the [official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). - -**Script mode experience** — The smdebug library supports training jobs with the TensorFlow framework and script mode through its API operations. This option requires minimal changes to your training script, and the smdebug library provides you hook features to help implement Debugger and analyze tensors. +### Supported TensorFlow Versions -### Versions +The SageMaker Debugger python SDK and `smdebug` library now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can retrieve tensors from your TensorFlow models with either eager or non-eager mode, with Keras API or the pure TensorFlow framework. For a full list of TensorFlow framework versions to use Debugger, see [AWS Deep Learning Containers and SageMaker training containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). +**Zero script change experience** — No modification is needed to your training script to enable the Debugger features while using the [official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). + +**Script mode experience** — The smdebug library supports training jobs with the TensorFlow framework and script mode through its API operations. This option requires minimal changes to your training script to register Debugger hooks, and the smdebug library provides you hook features to help implement Debugger and analyze saved tensors. + ### Distributed training supported by Debugger - Horovod and Mirrored Strategy multi-GPU distributed trainings are supported. - Parameter server based distributed training is currently not supported. --- -## How to Use -### Debugger with AWS Deep Learning Containers and zero script change +## How to Use Debugger + +### Debugger with AWS Deep Learning Containers + +The Debugger built-in rules and hook features are fully integrated into the AWS Deep Learning Containers, and you can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Debugger in TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow). -The Debugger features are all integrated into the AWS Deep Learning Containers, and you can run your training script with zero script change. To find a high-level SageMaker TensorFlow estimator with Debugger example code, see [Debugger in TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow). +The following code sample is how to set a SageMaker TensorFlow estimator with Debugger. + +```python +from sagemaker.tensorflow import TensorFlow +from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs + +tf_estimator = TensorFlow( + entry_point = "tf-train.py", + role = "SageMakerRole", + instance_count = 1, + instance_type = "ml.p2.xlarge", + framework_version = "2.2", + py_version = "py37" + + # Debugger-specific Parameters + rules = [ + Rule.sagemaker(rule_configs.vanishing_gradient()), + Rule.sagemaker(rule_configs.loss_not_decreasing()), + ... + ], + debugger_hook_config = DebuggerHookConfig( + CollectionConfig(name="inputs"), + CollectionConfig(name="outputs"), + CollectionConfig(name="layers"), + CollectionConfig(name="gradients") + ... + ) +) +tf_estimator.fit("s3://bucket/path/to/training/data") +``` + +Available tensor collections that you can retrieve from TensorFlow training jobs for zero script change are as follows: + +| Name | Description| +| --- | --- | +| all | Matches all tensors. | +| default | Includes "metrics", "losses", and "sm_metrics". | +| metrics | For KerasHook, saves the metrics computed by Keras for the model. | +| losses | Saves all losses of the model. | +| sm_metrics | You can add scalars that you want to show up in SageMaker Metrics to this collection. SageMaker Debugger will save these scalars both to the out_dir of the hook, as well as to SageMaker Metric. Note that the scalars passed here will be saved on AWS servers outside of your AWS account. | +| inputs | Matches all input to the model. | +| outputs | Matches all outputs of the model, such as predictions (logits) and labels. | +| layers | Matches all inputs and outputs of intermediate layers. | +| gradients | Matches all gradients of the model. In TensorFlow when not using zero script change environments, must use hook.wrap_optimizer() or hook.wrap_tape(). | +| weights | Matches all weights of the model. | +| biases | Matches all biases of the model. | +| optimizer_variables | Matches all optimizer variables, currently only supported for Keras. | + +>**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1. -### Debugger with AWS training containers and script mode +### Debugger with other AWS training containers and custom containers + +If you want to run your own training script or custom container, there are two available options. One option is to use the SageMaker TensorFlow with script change on other AWS training containers (the SageMaker TensorFlow estimator is in script mode by default from TensorFlow 2.1, so you do not need to specify `script_mode` parameter). Another option is to use your custom container with your training script and push the container to Amazon ECR. In both cases, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow models and API operations in your script, you need to pick the right hook class as introduced in the following steps. + +1. [Create a hook](#create-a-hook) + * [KerasHook](#kerashook) + * [SessionHook](#sessionhook) + * [EstimatorHook](#estimatorhook) +2. [Wrap the optimizer and the gradient tape with the hook to retrieve gradient tensors](#wrap-opt-with-hook) +3. [Register the hook to model.fit()](#register-a-hook) -In case you want to run your own training script and debug using the SageMaker TensorFlow framework with script mode and Debugger, the smdebug client library provides the hook constructor that you can add to the training script and retrieve tensors. #### 1. Create a hook - To create the hook constructor, add the following code. + To create the hook constructor, add the following code to your training script. This will enable the `smdebug` tools for TensorFlow and create a TensorFlow hook object. ```python import smdebug.tensorflow as smd hook = smd.{hook_class}.create_from_json_file() ``` -Depending on the TensorFlow versions for your model, you need to choose a hook class. There are three hook constructor classes that you can pick and replace `{hook_class}`: `KerasHook`, `SessionHook`, and `EstimatorHook`. +Depending on TensorFlow versions and Keras API that was used in your training script, you need to choose the right hook class. There are three hook constructors for TensorFlow that you can choose: `KerasHook`, `SessionHook`, and `EstimatorHook`. #### KerasHook -Use if you use the Keras `model.fit()` API. This is available for all frameworks and versions of Keras and TensorFlow. `KerasHook` covers the eager execution modes and the gradient tape feature that are introduced from the TensorFlow framework version 2.0. For example, you can set the Keras hook constructor by adding the following code into your training script. +Use `KerasHook` if you use the Keras model zoo and a Keras `model.fit()` API. This is available for the Keras with TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced from the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code into your training script. Place this code line before `model.compile()`. + ```python hook = smd.KerasHook.create_from_json_file() ``` + To learn how to fully implement the hook to your training script, see the [Keras with the TensorFlow gradient tape and the smdebug hook example scripts](https://github.com/awslabs/sagemaker-debugger/tree/master/examples/tensorflow2/scripts). -> **Note**: If you use the AWS Deep Learning Containers for zero script change, Debugger collects the most of tensors regardless the eager execution modes, through its high-level API. +>**Note**: If you use the AWS Deep Learning Containers for zero script change, Debugger collects the most of tensors regardless the eager execution modes, through its high-level API. #### SessionHook @@ -63,7 +128,7 @@ hook = smd.SessionHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [TensorFlow monitored training session with the smdebug hook example script](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -> **Note**: The official TensorFlow library deprecated the `tf.train.MonitoredSessions()` API in favor of `tf.function()` in TF 2.0 and above. You can use `SessionHook` for `tf.function()` in TF 2.0 and above. +>**Note**: The official TensorFlow library deprecated the `tf.train.MonitoredSessions()` API in favor of `tf.function()` in TF 2.0 and above. You can use `SessionHook` for `tf.function()` in TF 2.0 and above. #### EstimatorHook @@ -75,34 +140,47 @@ hook = smd.EstimatorHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -#### 2. Register the hook to your model - -To collect the tensors from the hooks that you implemented, add `callbacks=[hook]` to the Keras `model.fit()` API and `hooks=[hook]` for the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. - -#### 3. Wrap the optimizer and the gradient tape +#### 2. Wrap the optimizer and the gradient tape to retrieve gradient tensors The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific for the TensorFlow framework. -If you want to save `gradients` from the optimizer of your model, wrap it with the hook as follows: +If you want to save `gradients`, for example, from the Keras Adam optimizer, wrap it with the hook as follows: ```python +optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr) optimizer = hook.wrap_optimizer(optimizer) ``` -If you want to save `gradients` from the TensorFlow gradient tape feature, wrap it as follows: +If you want to save `gradients` from the TensorFlow gradient tape feature, wrap `tf.GradientTape` with the `hook.wrap_tape` method and save using the `hook.save_tensor` function. The input of `hook.save_tensor` is in (tensor_name, tensor_value, collections_to_write="default") format. For example: ```python with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape: + logits = model(data, training=True) + loss_value = cce(labels, logits) +hook.save_tensor("y_labels", labels, "outputs") +grads = tape.gradient(loss_value, model.variables) ``` -These wrappers capture the gradient tensors, not affecting your optimization logic at all. +These smdebug hook wrapper functions capture the gradient tensors, not affecting your optimization logic at all. For examples of code structure to apply the hook wrappers, see the [Examples](#examples) section. +#### 3. Register the hook to model.fit() + +To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: + +```python +model.fit(X_train, Y_train, + batch_size=batch_size, + epochs=epoch, + validation_data=(X_valid, Y_valid), + shuffle=True, + # smdebug modification: Pass the hook as a Keras callback + callbacks=[hook]) +``` + #### 4. Take actions using the hook APIs For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [TensorFlow specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#tensorflow-specific-hook-api). ->**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1. - --- ## Examples From df74588878da974b5a4ebbe5b32cac2d1a74f0e7 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 00:57:40 -0700 Subject: [PATCH 02/28] add details --- docs/tensorflow.md | 69 ++++++++++++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 30 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index ee4cc1ac9..6b768effd 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -1,25 +1,21 @@ # Tensorflow ## Contents -- [Support](#support) +- [What SageMaker Debugger Supports](#support) - [How to Use Debugger with TensorFlow](#how-to-use) - [Debugger with AWS Deep Learning Containers](#debugger-dlc) - [Debugger with other AWS training containers and custom containers](#debugger-script-change) -- [Code Structure Samples](#examples) +- [Code Samples](#examples) - [References](#references) --- -## Support +## What SageMaker Debugger Supports -### Supported TensorFlow Versions +The SageMaker Debugger python SDK and `smdebug` library now fully support TensorFlow 2.2 with the latest version release (v0.9.1). Using Debugger, you can access tensors from any kind of TensorFlow models, from the Keras model zoo to your custom model. +You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras APIs or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. -The SageMaker Debugger python SDK and `smdebug` library now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can retrieve tensors from your TensorFlow models with either eager or non-eager mode, with Keras API or the pure TensorFlow framework. -For a full list of TensorFlow framework versions to use Debugger, see [AWS Deep Learning Containers and SageMaker training containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). - -**Zero script change experience** — No modification is needed to your training script to enable the Debugger features while using the [official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). - -**Script mode experience** — The smdebug library supports training jobs with the TensorFlow framework and script mode through its API operations. This option requires minimal changes to your training script to register Debugger hooks, and the smdebug library provides you hook features to help implement Debugger and analyze saved tensors. +Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [AWS Deep Learning Containers and SageMaker training containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). ### Distributed training supported by Debugger - Horovod and Mirrored Strategy multi-GPU distributed trainings are supported. @@ -29,7 +25,7 @@ For a full list of TensorFlow framework versions to use Debugger, see [AWS Deep ## How to Use Debugger -### Debugger with AWS Deep Learning Containers +### Debugger on AWS Deep Learning Containers with TensorFlow The Debugger built-in rules and hook features are fully integrated into the AWS Deep Learning Containers, and you can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Debugger in TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow). @@ -63,8 +59,16 @@ tf_estimator = TensorFlow( ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` +>**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command line. +```bash +pip install -U sagemaker +pip install -U smdebug +``` +If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. + +#### Available Tensor Collections for TensorFlow -Available tensor collections that you can retrieve from TensorFlow training jobs for zero script change are as follows: +The following table lists the pre-configured tensor collections for TensorFlow models. | Name | Description| | --- | --- | @@ -83,19 +87,22 @@ Available tensor collections that you can retrieve from TensorFlow training jobs >**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1. -### Debugger with other AWS training containers and custom containers +### Debugger on SageMaker TensorFlow training containers or custom containers -If you want to run your own training script or custom container, there are two available options. One option is to use the SageMaker TensorFlow with script change on other AWS training containers (the SageMaker TensorFlow estimator is in script mode by default from TensorFlow 2.1, so you do not need to specify `script_mode` parameter). Another option is to use your custom container with your training script and push the container to Amazon ECR. In both cases, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow models and API operations in your script, you need to pick the right hook class as introduced in the following steps. +If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, there are two alternatives. +- Alternative 1: Use the SageMaker TensorFlow training containers with training script modification +- Alternative 2: Use your custom container with modified training script and push the container to Amazon ECR. +In both cases, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and save tensors. 1. [Create a hook](#create-a-hook) - * [KerasHook](#kerashook) - * [SessionHook](#sessionhook) - * [EstimatorHook](#estimatorhook) + - [KerasHook](#kerashook) + - [SessionHook](#sessionhook) + - [EstimatorHook](#estimatorhook) 2. [Wrap the optimizer and the gradient tape with the hook to retrieve gradient tensors](#wrap-opt-with-hook) 3. [Register the hook to model.fit()](#register-a-hook) -#### 1. Create a hook +#### 1. Create a hook To create the hook constructor, add the following code to your training script. This will enable the `smdebug` tools for TensorFlow and create a TensorFlow hook object. @@ -150,13 +157,15 @@ optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr) optimizer = hook.wrap_optimizer(optimizer) ``` -If you want to save `gradients` from the TensorFlow gradient tape feature, wrap `tf.GradientTape` with the `hook.wrap_tape` method and save using the `hook.save_tensor` function. The input of `hook.save_tensor` is in (tensor_name, tensor_value, collections_to_write="default") format. For example: +If you want to save gradients and outputs tensors from the TensorFlow `GradientTape` feature, wrap `tf.GradientTape` with the smdebug `hook.wrap_tape` method and save using the `hook.save_tensor` function. The input of `hook.save_tensor` is in (tensor_name, tensor_value, collections_to_write="default") format. For example: ```python with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape: logits = model(data, training=True) loss_value = cce(labels, logits) hook.save_tensor("y_labels", labels, "outputs") +hook.save_tensor("predictions", logits, "outputs") grads = tape.gradient(loss_value, model.variables) +hook.save_tensor("grads", grads, "gradients") ``` These smdebug hook wrapper functions capture the gradient tensors, not affecting your optimization logic at all. @@ -169,12 +178,12 @@ To collect the tensors from the hooks that you registered, add `callbacks=[hook] ```python model.fit(X_train, Y_train, - batch_size=batch_size, - epochs=epoch, - validation_data=(X_valid, Y_valid), - shuffle=True, - # smdebug modification: Pass the hook as a Keras callback - callbacks=[hook]) + batch_size=batch_size, + epochs=epoch, + validation_data=(X_valid, Y_valid), + shuffle=True, + # smdebug modification: Pass the hook as a Keras callback + callbacks=[hook]) ``` #### 4. Take actions using the hook APIs @@ -191,7 +200,7 @@ The following examples show the three different hook constructions of TensorFlow ```python import smdebug.tensorflow as smd -hook = smd.KerasHook(out_dir=args.out_dir) +hook = smd.KerasHook.create_from_json_file() model = tf.keras.models.Sequential([ ... ]) model.compile( @@ -207,7 +216,7 @@ model.evaluate(x_test, y_test, callbacks=[hook]) ```python import smdebug.tensorflow as smd -hook = smd.KerasHook(out_dir=args.out_dir) +hook = smd.KerasHook.create_from_json_file() model = tf.keras.models.Sequential([ ... ]) for epoch in range(n_epochs): @@ -221,14 +230,14 @@ model = tf.keras.models.Sequential([ ... ]) opt.apply_gradients(zip(grads, model.variables)) acc = train_acc_metric(dataset_labels, logits) # manually save metric values - hook.record_tensor_value(tensor_name="accuracy", tensor_value=acc) + hook.save_tensor(tensor_name="accuracy", tensor_value=acc, collections_to_write="default") ``` ### Monitored Session (tf.train.MonitoredSession) ```python import smdebug.tensorflow as smd -hook = smd.SessionHook(out_dir=args.out_dir) +hook = smd.SessionHook.create_from_json_file() loss = tf.reduce_mean(tf.matmul(...), name="loss") optimizer = tf.train.AdamOptimizer(args.lr) @@ -246,7 +255,7 @@ sess.run([loss, ...]) ```python import smdebug.tensorflow as smd -hook = smd.EstimatorHook(out_dir=args.out_dir) +hook = smd.EstimatorHook.create_from_json_file() train_input_fn, eval_input_fn = ... estimator = tf.estimator.Estimator(...) From 2fa0fdbfa25b0d2fee739f42ca40823604d54d11 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 01:55:24 -0700 Subject: [PATCH 03/28] Update code samples/notes for new pySDK and smdebug/add and fix links --- docs/tensorflow.md | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 6b768effd..dac815d2b 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -3,8 +3,8 @@ ## Contents - [What SageMaker Debugger Supports](#support) - [How to Use Debugger with TensorFlow](#how-to-use) - - [Debugger with AWS Deep Learning Containers](#debugger-dlc) - - [Debugger with other AWS training containers and custom containers](#debugger-script-change) + - [Debugger on AWS Deep Learning Containers with TensorFlow](#debugger-dlc) + - [Debugger on SageMaker Training Containers and Custom Containers](#debugger-script-change) - [Code Samples](#examples) - [References](#references) @@ -12,10 +12,10 @@ ## What SageMaker Debugger Supports -The SageMaker Debugger python SDK and `smdebug` library now fully support TensorFlow 2.2 with the latest version release (v0.9.1). Using Debugger, you can access tensors from any kind of TensorFlow models, from the Keras model zoo to your custom model. -You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras APIs or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. +SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors from any kind of TensorFlow models, from the Keras model zoo to your custom model. +You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras API or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. -Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [AWS Deep Learning Containers and SageMaker training containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). +Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide of using custom containers, go to [Use Debugger in Custom Training Containers ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). ### Distributed training supported by Debugger - Horovod and Mirrored Strategy multi-GPU distributed trainings are supported. @@ -23,13 +23,13 @@ Debugger and its client library `smdebug` support debugging your training job on --- -## How to Use Debugger +## How to Use Debugger ### Debugger on AWS Deep Learning Containers with TensorFlow -The Debugger built-in rules and hook features are fully integrated into the AWS Deep Learning Containers, and you can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Debugger in TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow). +The Debugger built-in rules and hook features are fully integrated into the AWS Deep Learning Containers, and you can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, go to the [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) developer guide. -The following code sample is how to set a SageMaker TensorFlow estimator with Debugger. +The following code sample is the base structure of a SageMaker TensorFlow estimator with Debugger. ```python from sagemaker.tensorflow import TensorFlow @@ -59,16 +59,16 @@ tf_estimator = TensorFlow( ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` ->**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command line. +>**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. ```bash pip install -U sagemaker pip install -U smdebug ``` -If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. +If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). #### Available Tensor Collections for TensorFlow -The following table lists the pre-configured tensor collections for TensorFlow models. +The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the base code sample. | Name | Description| | --- | --- | @@ -85,9 +85,13 @@ The following table lists the pre-configured tensor collections for TensorFlow m | biases | Matches all biases of the model. | | optimizer_variables | Matches all optimizer variables, currently only supported for Keras. | +For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). + +For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). + >**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1. -### Debugger on SageMaker TensorFlow training containers or custom containers +### Debugger on SageMaker Training Containers and Custom Containers If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, there are two alternatives. - Alternative 1: Use the SageMaker TensorFlow training containers with training script modification @@ -192,9 +196,9 @@ For a full list of actions that the hook APIs offer to construct hooks and save --- -## Examples +## Code Samples -The following examples show the three different hook constructions of TensorFlow. The following examples show what minimal changes have to be made to enable SageMaker Debugger while using the AWS containers with script mode. To learn how to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). +The following examples show the base structures of hook registration in various TensorFlow training scripts. If you want to take the benefit of the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). ### Keras API (tf.keras) ```python @@ -208,7 +212,10 @@ model.compile( loss='sparse_categorical_crossentropy', ) # Add the hook as a callback +hook.set_mode(mode=smd.modes.TRAIN) model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook]) + +hook.set_mode(mode=smd.modes.EVAL) model.evaluate(x_test, y_test, callbacks=[hook]) ``` From 6857d6caecfc0c079896e4a64f11db275798559e Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 09:37:50 -0700 Subject: [PATCH 04/28] add 'New features' note --- docs/tensorflow.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index dac815d2b..c3aa1a6c0 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -12,11 +12,18 @@ ## What SageMaker Debugger Supports -SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors from any kind of TensorFlow models, from the Keras model zoo to your custom model. +SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind of TensorFlow models, from the Keras model zoo to your custom model, and save them using Debugger built-in or custom tensor collections. You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras API or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide of using custom containers, go to [Use Debugger in Custom Training Containers ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). +### New features +- The latest TensorFlow version fully covered by Debugger is `2.2.0`. +- Debug training jobs with the TensorFlow framework or Keras TensorFlow. +- Debug training jobs with the TensorFlow framework in eager or non-eager model. +- New built-in tensor collections: model `inputs`, `outputs`, `layers`, `gradients`. +- New hook APIs to save tensors, in addition to scalars: `save_tensors`, `save_scalar`. + ### Distributed training supported by Debugger - Horovod and Mirrored Strategy multi-GPU distributed trainings are supported. - Parameter server based distributed training is currently not supported. @@ -40,7 +47,7 @@ tf_estimator = TensorFlow( role = "SageMakerRole", instance_count = 1, instance_type = "ml.p2.xlarge", - framework_version = "2.2", + framework_version = "2.2.0", py_version = "py37" # Debugger-specific Parameters From 8be632a8da6a498a0539bb5d69f625b775eb76e2 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 09:49:24 -0700 Subject: [PATCH 05/28] minor fix --- docs/api.md | 2 +- docs/tensorflow.md | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/api.md b/docs/api.md index 778cf3e46..6edb47540 100644 --- a/docs/api.md +++ b/docs/api.md @@ -163,6 +163,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`. |`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. |`close()` | - | Closes all files that are currently open by the hook | | `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | +| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| ### TensorFlow specific Hook API @@ -178,7 +179,6 @@ The following hook APIs are specific to training scripts using the TF 2.x Gradie | Method | Arguments | Returns | Behavior | | --- | --- | --- | --- | | `wrap_tape(tape)` | `tape` (tensorflow.python.eager.backprop.GradientTape) | Returns a tape object with three identifying markers to help `smdebug`. This returned tape should be used for training. | When not using Zero Script Change environments, calling this method on your tape is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same tape object passed. -| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors while using TF 2.x GradientTape. Note: `record_tensor_value()` is deprecated.| ### MXNet specific Hook API diff --git a/docs/tensorflow.md b/docs/tensorflow.md index c3aa1a6c0..4cf860ac7 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -20,8 +20,8 @@ Debugger and its client library `smdebug` support debugging your training job on ### New features - The latest TensorFlow version fully covered by Debugger is `2.2.0`. - Debug training jobs with the TensorFlow framework or Keras TensorFlow. -- Debug training jobs with the TensorFlow framework in eager or non-eager model. -- New built-in tensor collections: model `inputs`, `outputs`, `layers`, `gradients`. +- Debug training jobs with the TensorFlow eager or non-eager mode. +- New built-in tensor collections: `inputs`, `outputs`, `layers`, `gradients`. - New hook APIs to save tensors, in addition to scalars: `save_tensors`, `save_scalar`. ### Distributed training supported by Debugger @@ -67,11 +67,11 @@ tf_estimator = TensorFlow( tf_estimator.fit("s3://bucket/path/to/training/data") ``` >**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. -```bash +>```bash pip install -U sagemaker pip install -U smdebug ``` -If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). +>If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). #### Available Tensor Collections for TensorFlow From d787f4b923a4d8bd7199a4d9b3cbd14471956236 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 09:50:31 -0700 Subject: [PATCH 06/28] minor fix --- docs/tensorflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 4cf860ac7..314e0f550 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -67,11 +67,11 @@ tf_estimator = TensorFlow( tf_estimator.fit("s3://bucket/path/to/training/data") ``` >**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. ->```bash +``` pip install -U sagemaker pip install -U smdebug ``` ->If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). +If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). #### Available Tensor Collections for TensorFlow From 6c00d2a0a5e9b5087d33656c031cec84cf1409ba Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 09:52:27 -0700 Subject: [PATCH 07/28] fix formatting --- docs/tensorflow.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 314e0f550..1dcdd502b 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -66,12 +66,13 @@ tf_estimator = TensorFlow( ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` ->**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. -``` -pip install -U sagemaker -pip install -U smdebug -``` -If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). + +>**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker Python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. + ``` + pip install -U sagemaker + pip install -U smdebug + ``` +>If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). #### Available Tensor Collections for TensorFlow From 4b6e0deab62d81004c4a7e9993798cbeb33bf424 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 12:14:21 -0700 Subject: [PATCH 08/28] minor fix --- docs/tensorflow.md | 31 ++++++++++++++++++------------- 1 file changed, 18 insertions(+), 13 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 1dcdd502b..fde9d4aea 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -67,16 +67,16 @@ tf_estimator = TensorFlow( tf_estimator.fit("s3://bucket/path/to/training/data") ``` ->**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker Python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. - ``` - pip install -U sagemaker - pip install -U smdebug - ``` ->If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). +**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker Python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. +``` +pip install -U sagemaker +pip install -U smdebug +``` +If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). #### Available Tensor Collections for TensorFlow -The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the base code sample. +The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the base code sample. SageMaker Debugger will save these tensors to the default out_dir of the hook. | Name | Description| | --- | --- | @@ -84,11 +84,11 @@ The following table lists the pre-configured tensor collections for TensorFlow m | default | Includes "metrics", "losses", and "sm_metrics". | | metrics | For KerasHook, saves the metrics computed by Keras for the model. | | losses | Saves all losses of the model. | -| sm_metrics | You can add scalars that you want to show up in SageMaker Metrics to this collection. SageMaker Debugger will save these scalars both to the out_dir of the hook, as well as to SageMaker Metric. Note that the scalars passed here will be saved on AWS servers outside of your AWS account. | -| inputs | Matches all input to the model. | -| outputs | Matches all outputs of the model, such as predictions (logits) and labels. | +| sm_metrics | Saves scalars that you want to include in the SageMaker metrics collection. | +| inputs | Matches all model inputs to the model. | +| outputs | Matches all model outputs of the model, such as predictions (logits) and labels. | | layers | Matches all inputs and outputs of intermediate layers. | -| gradients | Matches all gradients of the model. In TensorFlow when not using zero script change environments, must use hook.wrap_optimizer() or hook.wrap_tape(). | +| gradients | Matches all gradients of the model. | | weights | Matches all weights of the model. | | biases | Matches all biases of the model. | | optimizer_variables | Matches all optimizer variables, currently only supported for Keras. | @@ -97,13 +97,15 @@ For more information about adjusting the tensor collection parameters, see [Save For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). ->**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1. +>**Note**: The `inputs`, `outputs`, and `layers` collections are currently not available for TensorFlow 2.1. ### Debugger on SageMaker Training Containers and Custom Containers If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, there are two alternatives. + - Alternative 1: Use the SageMaker TensorFlow training containers with training script modification - Alternative 2: Use your custom container with modified training script and push the container to Amazon ECR. + In both cases, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and save tensors. 1. [Create a hook](#create-a-hook) @@ -116,11 +118,14 @@ In both cases, you need to manually register the Debugger hook to your training #### 1. Create a hook - To create the hook constructor, add the following code to your training script. This will enable the `smdebug` tools for TensorFlow and create a TensorFlow hook object. + To create the hook constructor, add the following code to your training script. This will enable the `smdebug` tools for TensorFlow and create a TensorFlow `hook` object. When executing the fit() API for training, specify the smdebug `hook` as callbacks. ```python import smdebug.tensorflow as smd hook = smd.{hook_class}.create_from_json_file() +... +model.fit(... + callbacks=[hook]) ``` Depending on TensorFlow versions and Keras API that was used in your training script, you need to choose the right hook class. There are three hook constructors for TensorFlow that you can choose: `KerasHook`, `SessionHook`, and `EstimatorHook`. From 54c12ce6c0874e7535c285319784a142b5cabab4 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 12:31:55 -0700 Subject: [PATCH 09/28] lint --- docs/tensorflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index fde9d4aea..4f9528569 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -13,7 +13,7 @@ ## What SageMaker Debugger Supports SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind of TensorFlow models, from the Keras model zoo to your custom model, and save them using Debugger built-in or custom tensor collections. -You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras API or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. +You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras API or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide of using custom containers, go to [Use Debugger in Custom Training Containers ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). From 9e079dd9213c224c6e0a1bf8c30841b7df7f9a74 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 12 Aug 2020 17:33:46 -0700 Subject: [PATCH 10/28] lint --- docs/tensorflow.md | 118 ++++++++++++++++++++++----------------------- 1 file changed, 57 insertions(+), 61 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 4f9528569..7138ccff4 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -10,33 +10,28 @@ --- -## What SageMaker Debugger Supports +## Amazon SageMaker Debugger Support for TensorFlow -SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind of TensorFlow models, from the Keras model zoo to your custom model, and save them using Debugger built-in or custom tensor collections. -You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras API or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. +Amazon SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. -Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide of using custom containers, go to [Use Debugger in Custom Training Containers ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). +Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide for using custom containers, see [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). -### New features -- The latest TensorFlow version fully covered by Debugger is `2.2.0`. -- Debug training jobs with the TensorFlow framework or Keras TensorFlow. -- Debug training jobs with the TensorFlow eager or non-eager mode. -- New built-in tensor collections: `inputs`, `outputs`, `layers`, `gradients`. -- New hook APIs to save tensors, in addition to scalars: `save_tensors`, `save_scalar`. - -### Distributed training supported by Debugger -- Horovod and Mirrored Strategy multi-GPU distributed trainings are supported. -- Parameter server based distributed training is currently not supported. +### New Features supported by Debugger +- The latest TensorFlow version fully covered by Debugger is 2.2.0 +- Debug training jobs with the TensorFlow framework or Keras TensorFlow +- Debug training jobs with the TensorFlow eager or non-eager mode +- New built-in tensor collections: `inputs`, `outputs`, `layers`, `gradients` +- New hook APIs to save tensors, in addition to scalars: `save_tensors`, `save_scalar` --- -## How to Use Debugger +## Using Debugger -### Debugger on AWS Deep Learning Containers with TensorFlow +### Using Debugger on AWS Deep Learning Containers with TensorFlow -The Debugger built-in rules and hook features are fully integrated into the AWS Deep Learning Containers, and you can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, go to the [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) developer guide. +The Debugger built-in rules and hook features are fully integrated with the AWS Deep Learning Containers. You can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) in the Amazon SageMaker Developer Guide. -The following code sample is the base structure of a SageMaker TensorFlow estimator with Debugger. +The following code example provides the base structure for a SageMaker TensorFlow estimator with Debugger. ```python from sagemaker.tensorflow import TensorFlow @@ -45,8 +40,8 @@ from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_ tf_estimator = TensorFlow( entry_point = "tf-train.py", role = "SageMakerRole", - instance_count = 1, - instance_type = "ml.p2.xlarge", + train_instance_count = 1, + train_instance_type = "ml.p2.xlarge", framework_version = "2.2.0", py_version = "py37" @@ -67,46 +62,46 @@ tf_estimator = TensorFlow( tf_estimator.fit("s3://bucket/path/to/training/data") ``` -**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker Python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. +**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest `smdebug` library v0.9.1. We highly recommend that you upgrade the packages by running the following commands at the command line: ``` pip install -U sagemaker pip install -U smdebug ``` -If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). +If you are using a Jupyter Notebook, put an exclamation mark (!) at the beginning of the code string and restart your kernel. For more information about the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). -#### Available Tensor Collections for TensorFlow +#### Using Tensor Collections with TensorFlow -The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the base code sample. SageMaker Debugger will save these tensors to the default out_dir of the hook. +The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the previous base code example. SageMaker Debugger will save these tensors to the default out_dir of the hook. | Name | Description| | --- | --- | -| all | Matches all tensors. | -| default | Includes "metrics", "losses", and "sm_metrics". | -| metrics | For KerasHook, saves the metrics computed by Keras for the model. | -| losses | Saves all losses of the model. | -| sm_metrics | Saves scalars that you want to include in the SageMaker metrics collection. | -| inputs | Matches all model inputs to the model. | -| outputs | Matches all model outputs of the model, such as predictions (logits) and labels. | -| layers | Matches all inputs and outputs of intermediate layers. | -| gradients | Matches all gradients of the model. | -| weights | Matches all weights of the model. | -| biases | Matches all biases of the model. | -| optimizer_variables | Matches all optimizer variables, currently only supported for Keras. | - -For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). +| `all` | Matches all tensors. | +| `default` | Includes `metrics`, `losses`, and `sm_metrics`. | +| `metrics` | For KerasHook, saves the metrics computed by Keras for the model. | +| `losses` | Saves all losses of the model. | +| `sm_metrics` | Saves scalars that you want to include in the SageMaker metrics collection. | +| `inputs` | Matches all model inputs to the model. | +| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. | +| `layers` | Matches all inputs and outputs of intermediate layers. | +| `gradients` | Matches all gradients of the model. | +| `weights` | Matches all weights of the model. | +| `biases` | Matches all biases of the model. | +| `optimizer_variables` | Matches all optimizer variables, currently only supported for Keras. | + +For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). ->**Note**: The `inputs`, `outputs`, and `layers` collections are currently not available for TensorFlow 2.1. +>**Note**: The `inputs`, `outputs`, and `layers` collections are currently not available for TensorFlow 2.1.0. -### Debugger on SageMaker Training Containers and Custom Containers +### Using Debugger on SageMaker Training Containers and Custom Containers -If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, there are two alternatives. +If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, you can use any of the following options: -- Alternative 1: Use the SageMaker TensorFlow training containers with training script modification -- Alternative 2: Use your custom container with modified training script and push the container to Amazon ECR. +- Option 1 - Use the SageMaker TensorFlow training containers with training script modification +- Option 2 - Use your custom container with modified training script and push the container to Amazon ECR. -In both cases, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and save tensors. +For both options, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and then save the tensors. 1. [Create a hook](#create-a-hook) - [KerasHook](#kerashook) @@ -116,9 +111,9 @@ In both cases, you need to manually register the Debugger hook to your training 3. [Register the hook to model.fit()](#register-a-hook) -#### 1. Create a hook +#### Step 1: Create a hook - To create the hook constructor, add the following code to your training script. This will enable the `smdebug` tools for TensorFlow and create a TensorFlow `hook` object. When executing the fit() API for training, specify the smdebug `hook` as callbacks. + To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown following: ```python import smdebug.tensorflow as smd @@ -128,23 +123,23 @@ model.fit(... callbacks=[hook]) ``` -Depending on TensorFlow versions and Keras API that was used in your training script, you need to choose the right hook class. There are three hook constructors for TensorFlow that you can choose: `KerasHook`, `SessionHook`, and `EstimatorHook`. +Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are `KerasHook`, `SessionHook`, and `EstimatorHook`. #### KerasHook -Use `KerasHook` if you use the Keras model zoo and a Keras `model.fit()` API. This is available for the Keras with TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced from the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code into your training script. Place this code line before `model.compile()`. +If you use the Keras model zoo and a Keras `model.fit()` API, use `KerasHook`. `KerasHook` is available for the Keras model with the TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced in the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code to your training script. Place this code line before `model.compile()`: ```python hook = smd.KerasHook.create_from_json_file() ``` -To learn how to fully implement the hook to your training script, see the [Keras with the TensorFlow gradient tape and the smdebug hook example scripts](https://github.com/awslabs/sagemaker-debugger/tree/master/examples/tensorflow2/scripts). +To learn how to fully implement the hook in your training script, see the [Keras with the TensorFlow gradient tape and the smdebug hook example scripts](https://github.com/awslabs/sagemaker-debugger/tree/master/examples/tensorflow2/scripts). ->**Note**: If you use the AWS Deep Learning Containers for zero script change, Debugger collects the most of tensors regardless the eager execution modes, through its high-level API. +>**Note**: If you use the AWS Deep Learning Containers for zero script change, Debugger collects most of the tensors through its high-level API, regardless of the eager execution modes. #### SessionHook -Use if your model is created in TensorFlow version 1.x with the low-level approach, not using the Keras API. This is for the TensorFlow 1.x monitored training session API, `tf.train.MonitoredSessions()`. +If your model is created in TensorFlow version 1.x with the low-level approach (not using the Keras API), use `SessionHook`. `SessionHook` is for the TensorFlow 1.x monitored training session API, `tf.train.MonitoredSessions()`, as shown following: ```python hook = smd.SessionHook.create_from_json_file() @@ -152,11 +147,11 @@ hook = smd.SessionHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [TensorFlow monitored training session with the smdebug hook example script](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). ->**Note**: The official TensorFlow library deprecated the `tf.train.MonitoredSessions()` API in favor of `tf.function()` in TF 2.0 and above. You can use `SessionHook` for `tf.function()` in TF 2.0 and above. +>**Note**: The official TensorFlow library deprecated the `tf.train.MonitoredSessions()` API in favor of `tf.function()` in TensorFlow 2.0 and later. You can use `SessionHook` for `tf.function()` in TensorFlow 2.0 and later. #### EstimatorHook -Use if you have a model using the `tf.estimator()` API. Available for any TensorFlow framework versions that supports the `tf.estimator()` API. +If you have a model using the `tf.estimator()` API, use `EstimatorHook`. `EstimatorHook` is available for any TensorFlow framework versions that support the `tf.estimator()` API, as shown following: ```python hook = smd.EstimatorHook.create_from_json_file() @@ -164,11 +159,12 @@ hook = smd.EstimatorHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -#### 2. Wrap the optimizer and the gradient tape to retrieve gradient tensors +#### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors + +The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific to the TensorFlow framework. -The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific for the TensorFlow framework. +If you want to save `gradients` (for example, from the Keras Adam optimizer) wrap it with the hook as shown following: -If you want to save `gradients`, for example, from the Keras Adam optimizer, wrap it with the hook as follows: ```python optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr) optimizer = hook.wrap_optimizer(optimizer) @@ -187,9 +183,9 @@ hook.save_tensor("grads", grads, "gradients") These smdebug hook wrapper functions capture the gradient tensors, not affecting your optimization logic at all. -For examples of code structure to apply the hook wrappers, see the [Examples](#examples) section. +For examples of code structures that you can use to apply the hook wrappers, see the [Code Examples](#examples) section. -#### 3. Register the hook to model.fit() +#### Step 3: Register the hook to model.fit() To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: @@ -203,7 +199,7 @@ model.fit(X_train, Y_train, callbacks=[hook]) ``` -#### 4. Take actions using the hook APIs +#### Step 4: Perform actions using the hook APIs For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [TensorFlow specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#tensorflow-specific-hook-api). @@ -211,7 +207,7 @@ For a full list of actions that the hook APIs offer to construct hooks and save ## Code Samples -The following examples show the base structures of hook registration in various TensorFlow training scripts. If you want to take the benefit of the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). +The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). ### Keras API (tf.keras) ```python @@ -232,7 +228,7 @@ hook.set_mode(mode=smd.modes.EVAL) model.evaluate(x_test, y_test, callbacks=[hook]) ``` -### Keras GradientTape example for TF 2.0 and above +### Keras GradientTape example for TensorFlow 2.0 and later ```python import smdebug.tensorflow as smd From 4afb5fc3c43e577cca82772424ae5ab02896a123 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 12 Aug 2020 17:44:49 -0700 Subject: [PATCH 11/28] minor structure change --- docs/tensorflow.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 7138ccff4..b7c9a3ece 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -2,9 +2,9 @@ ## Contents - [What SageMaker Debugger Supports](#support) -- [How to Use Debugger with TensorFlow](#how-to-use) - - [Debugger on AWS Deep Learning Containers with TensorFlow](#debugger-dlc) - - [Debugger on SageMaker Training Containers and Custom Containers](#debugger-script-change) +- [Debugger on AWS Deep Learning Containers with TensorFlow](#debugger-dlc) + - [Debugger Built-in Tensor Collections for TensorFlow](#tf-built-in-collection) +- [Debugger on SageMaker Training Containers and Custom Containers](#debugger-script-change) - [Code Samples](#examples) - [References](#references) @@ -25,9 +25,7 @@ Debugger and its client library `smdebug` support debugging your training job on --- -## Using Debugger - -### Using Debugger on AWS Deep Learning Containers with TensorFlow +## Using Debugger on AWS Deep Learning Containers with TensorFlow The Debugger built-in rules and hook features are fully integrated with the AWS Deep Learning Containers. You can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) in the Amazon SageMaker Developer Guide. @@ -69,7 +67,7 @@ pip install -U smdebug ``` If you are using a Jupyter Notebook, put an exclamation mark (!) at the beginning of the code string and restart your kernel. For more information about the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). -#### Using Tensor Collections with TensorFlow +### Debugger Built-in Tensor Collections for TensorFlow **Note**: The `inputs`, `outputs`, and `layers` collections are currently not available for TensorFlow 2.1.0. -### Using Debugger on SageMaker Training Containers and Custom Containers +--- + +## Using Debugger on SageMaker Training Containers and Custom Containers If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, you can use any of the following options: -- Option 1 - Use the SageMaker TensorFlow training containers with training script modification -- Option 2 - Use your custom container with modified training script and push the container to Amazon ECR. +- **Option 1** - Use the SageMaker TensorFlow training containers with training script modification +- **Option 2** - Use your custom container with modified training script and push the container to Amazon ECR. For both options, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and then save the tensors. @@ -111,7 +111,7 @@ For both options, you need to manually register the Debugger hook to your traini 3. [Register the hook to model.fit()](#register-a-hook) -#### Step 1: Create a hook +### Step 1: Create a hook To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown following: @@ -159,7 +159,7 @@ hook = smd.EstimatorHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -#### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors +### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific to the TensorFlow framework. @@ -185,7 +185,7 @@ These smdebug hook wrapper functions capture the gradient tensors, not affecting For examples of code structures that you can use to apply the hook wrappers, see the [Code Examples](#examples) section. -#### Step 3: Register the hook to model.fit() +### Step 3: Register the hook to model.fit() To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: @@ -199,7 +199,7 @@ model.fit(X_train, Y_train, callbacks=[hook]) ``` -#### Step 4: Perform actions using the hook APIs +### Step 4: Perform actions using the hook APIs For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [TensorFlow specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#tensorflow-specific-hook-api). From 9c20ef2a509d57b9d00a82c2caa988338ba78839 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 12 Aug 2020 17:46:22 -0700 Subject: [PATCH 12/28] minor fix --- docs/tensorflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index b7c9a3ece..89dec7e41 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -67,7 +67,7 @@ pip install -U smdebug ``` If you are using a Jupyter Notebook, put an exclamation mark (!) at the beginning of the code string and restart your kernel. For more information about the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). -### Debugger Built-in Tensor Collections for TensorFlow The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the previous base code example. SageMaker Debugger will save these tensors to the default out_dir of the hook. From 293f7704910d30e730240114c9e0566feed3a603 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 12 Aug 2020 20:58:52 -0700 Subject: [PATCH 13/28] minor fix --- docs/tensorflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 89dec7e41..92e08d673 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -5,7 +5,7 @@ - [Debugger on AWS Deep Learning Containers with TensorFlow](#debugger-dlc) - [Debugger Built-in Tensor Collections for TensorFlow](#tf-built-in-collection) - [Debugger on SageMaker Training Containers and Custom Containers](#debugger-script-change) -- [Code Samples](#examples) +- [Code Examples](#examples) - [References](#references) --- @@ -205,7 +205,7 @@ For a full list of actions that the hook APIs offer to construct hooks and save --- -## Code Samples +## Code Examples The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). From 4996feb6c02b73a763349ff309c37c6459b83621 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 09:55:06 -0700 Subject: [PATCH 14/28] incorporate comments --- docs/tensorflow.md | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 92e08d673..efac54234 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -12,7 +12,7 @@ ## Amazon SageMaker Debugger Support for TensorFlow -Amazon SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. +Amazon SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.2) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide for using custom containers, see [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). @@ -60,7 +60,7 @@ tf_estimator = TensorFlow( tf_estimator.fit("s3://bucket/path/to/training/data") ``` -**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest `smdebug` library v0.9.1. We highly recommend that you upgrade the packages by running the following commands at the command line: +**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest `smdebug` library. We highly recommend that you upgrade the packages by running the following commands at the command line: ``` pip install -U sagemaker pip install -U smdebug @@ -123,7 +123,7 @@ model.fit(... callbacks=[hook]) ``` -Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are `KerasHook`, `SessionHook`, and `EstimatorHook`. +Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are `smd.KerasHook`, `smd.SessionHook`, and `smd.EstimatorHook`. #### KerasHook @@ -210,6 +210,9 @@ For a full list of actions that the hook APIs offer to construct hooks and save The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). ### Keras API (tf.keras) + +The following code example shows how to register the smdebug `KerasHook` for the Keras `model.fit()`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](#api.md#modes). + ```python import smdebug.tensorflow as smd @@ -221,6 +224,7 @@ model.compile( loss='sparse_categorical_crossentropy', ) # Add the hook as a callback +# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL hook.set_mode(mode=smd.modes.TRAIN) model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook]) @@ -229,6 +233,9 @@ model.evaluate(x_test, y_test, callbacks=[hook]) ``` ### Keras GradientTape example for TensorFlow 2.0 and later + +The following code example shows how to register the smdebug `KerasHook` by wrapping the TensorFlow `GradientTape()` with the smdebug `hook.wrap_tape()` API. + ```python import smdebug.tensorflow as smd @@ -250,6 +257,9 @@ model = tf.keras.models.Sequential([ ... ]) ``` ### Monitored Session (tf.train.MonitoredSession) + +The following code example shows how to register the smdebug `SessionHook`. + ```python import smdebug.tensorflow as smd @@ -268,6 +278,9 @@ sess.run([loss, ...]) ``` ### Estimator (tf.estimator.Estimator) + +The following code example shows how to register the smdebug `EstimatorHook`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](#api.md#modes). + ```python import smdebug.tensorflow as smd @@ -276,7 +289,7 @@ hook = smd.EstimatorHook.create_from_json_file() train_input_fn, eval_input_fn = ... estimator = tf.estimator.Estimator(...) -# Set the mode and pass the hook as callback +# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL. hook.set_mode(mode=smd.modes.TRAIN) estimator.train(input_fn=train_input_fn, steps=args.steps, hooks=[hook]) From 782e8c6b957a77d4f89be923a76cc209d7d75e1f Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 10:33:45 -0700 Subject: [PATCH 15/28] incorporate comments / lift limitation note --- README.md | 2 -- docs/tensorflow.md | 19 +++++++------------ 2 files changed, 7 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 62fe15e61..665f2b213 100644 --- a/README.md +++ b/README.md @@ -68,8 +68,6 @@ The following frameworks are available AWS Deep Learning Containers with the dee | [PyTorch](docs/pytorch.md) | 1.4, 1.5 | | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))| ->**Note**: Limited support of the zero script change experience for TensorFlow 2.2. The tensor collections `layers`, `inputs`, `outputs`, and `gradients` are currently not available. - ### AWS training containers with script mode The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script. diff --git a/docs/tensorflow.md b/docs/tensorflow.md index efac54234..551f8243f 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -113,15 +113,7 @@ For both options, you need to manually register the Debugger hook to your traini ### Step 1: Create a hook - To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown following: - -```python -import smdebug.tensorflow as smd -hook = smd.{hook_class}.create_from_json_file() -... -model.fit(... - callbacks=[hook]) -``` +To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown in the following subsections. Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are `smd.KerasHook`, `smd.SessionHook`, and `smd.EstimatorHook`. @@ -130,6 +122,7 @@ Depending on the TensorFlow versions and the Keras API that you use in your trai If you use the Keras model zoo and a Keras `model.fit()` API, use `KerasHook`. `KerasHook` is available for the Keras model with the TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced in the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code to your training script. Place this code line before `model.compile()`: ```python +import smdebug.tensorflow as smd hook = smd.KerasHook.create_from_json_file() ``` @@ -142,6 +135,7 @@ To learn how to fully implement the hook in your training script, see the [Keras If your model is created in TensorFlow version 1.x with the low-level approach (not using the Keras API), use `SessionHook`. `SessionHook` is for the TensorFlow 1.x monitored training session API, `tf.train.MonitoredSessions()`, as shown following: ```python +import smdebug.tensorflow as smd hook = smd.SessionHook.create_from_json_file() ``` @@ -154,6 +148,7 @@ To learn how to fully implement the hook into your training script, see the [Ten If you have a model using the `tf.estimator()` API, use `EstimatorHook`. `EstimatorHook` is available for any TensorFlow framework versions that support the `tf.estimator()` API, as shown following: ```python +import smdebug.tensorflow as smd hook = smd.EstimatorHook.create_from_json_file() ``` @@ -211,7 +206,7 @@ The following code examples show the base structures that you can use for hook r ### Keras API (tf.keras) -The following code example shows how to register the smdebug `KerasHook` for the Keras `model.fit()`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](#api.md#modes). +The following code example shows how to register the smdebug `KerasHook` for the Keras `model.fit()`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](api.md#modes). ```python import smdebug.tensorflow as smd @@ -279,7 +274,7 @@ sess.run([loss, ...]) ### Estimator (tf.estimator.Estimator) -The following code example shows how to register the smdebug `EstimatorHook`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](#api.md#modes). +The following code example shows how to register the smdebug `EstimatorHook`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](api.md#modes). ```python import smdebug.tensorflow as smd @@ -289,7 +284,7 @@ hook = smd.EstimatorHook.create_from_json_file() train_input_fn, eval_input_fn = ... estimator = tf.estimator.Estimator(...) -# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL. +# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL hook.set_mode(mode=smd.modes.TRAIN) estimator.train(input_fn=train_input_fn, steps=args.steps, hooks=[hook]) From aa7fcc540592863b315080d08ac1e4a54e948c7f Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 11:11:48 -0700 Subject: [PATCH 16/28] incorporate comments --- docs/api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/api.md b/docs/api.md index 6edb47540..bfb2f549a 100644 --- a/docs/api.md +++ b/docs/api.md @@ -163,7 +163,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`. |`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. |`close()` | - | Closes all files that are currently open by the hook | | `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | -| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| +| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str or list[str]) | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| ### TensorFlow specific Hook API From 83ad97019d23666b02defbd063092b962f763621 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 11:25:13 -0700 Subject: [PATCH 17/28] include pypi links --- docs/tensorflow.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 551f8243f..f3abde5f2 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -12,7 +12,12 @@ ## Amazon SageMaker Debugger Support for TensorFlow -Amazon SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.2) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. +Amazon SageMaker Debugger python SDK and its client library `smdebug` now fully support TensorFlow 2.2 with the latest version release. + +- [Amazon SageMaker SDK PyPI](https://pypi.org/project/sagemaker/) +- [The latest smdebug PyPI release](https://pypi.org/project/smdebug/) + +Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide for using custom containers, see [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). From 3f2beff7472e1b248d215e181384559e3a86a687 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 12:19:01 -0700 Subject: [PATCH 18/28] minor fix --- docs/api.md | 2 +- docs/tensorflow.md | 18 +++++++++--------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/api.md b/docs/api.md index bfb2f549a..a01b33750 100644 --- a/docs/api.md +++ b/docs/api.md @@ -163,7 +163,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`. |`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. |`close()` | - | Closes all files that are currently open by the hook | | `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | -| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str or list[str]) | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| +| `save_tensor()`| `tensor_name (str)`, `tensor_value (float)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| ### TensorFlow specific Hook API diff --git a/docs/tensorflow.md b/docs/tensorflow.md index f3abde5f2..272530b82 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -10,11 +10,11 @@ --- -## Amazon SageMaker Debugger Support for TensorFlow +## Amazon SageMaker Debugger Support for TensorFlow Amazon SageMaker Debugger python SDK and its client library `smdebug` now fully support TensorFlow 2.2 with the latest version release. -- [Amazon SageMaker SDK PyPI](https://pypi.org/project/sagemaker/) +- [Amazon SageMaker Python SDK PyPI](https://pypi.org/project/sagemaker/) - [The latest smdebug PyPI release](https://pypi.org/project/smdebug/) Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. @@ -30,7 +30,7 @@ Debugger and its client library `smdebug` support debugging your training job on --- -## Using Debugger on AWS Deep Learning Containers with TensorFlow +## Using Debugger on AWS Deep Learning Containers with TensorFlow The Debugger built-in rules and hook features are fully integrated with the AWS Deep Learning Containers. You can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) in the Amazon SageMaker Developer Guide. @@ -72,7 +72,7 @@ pip install -U smdebug ``` If you are using a Jupyter Notebook, put an exclamation mark (!) at the beginning of the code string and restart your kernel. For more information about the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). -### Debugger Built-in Tensor Collections for TensorFlow +### Debugger Built-in Tensor Collections for TensorFlow The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the previous base code example. SageMaker Debugger will save these tensors to the default out_dir of the hook. @@ -99,7 +99,7 @@ For a full list of available tensor collection parameters, see [Configuring Coll --- -## Using Debugger on SageMaker Training Containers and Custom Containers +## Using Debugger on SageMaker Training Containers and Custom Containers If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, you can use any of the following options: @@ -116,7 +116,7 @@ For both options, you need to manually register the Debugger hook to your traini 3. [Register the hook to model.fit()](#register-a-hook) -### Step 1: Create a hook +### Step 1: Create a hook To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown in the following subsections. @@ -159,7 +159,7 @@ hook = smd.EstimatorHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors +### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific to the TensorFlow framework. @@ -185,7 +185,7 @@ These smdebug hook wrapper functions capture the gradient tensors, not affecting For examples of code structures that you can use to apply the hook wrappers, see the [Code Examples](#examples) section. -### Step 3: Register the hook to model.fit() +### Step 3: Register the hook to model.fit() To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: @@ -205,7 +205,7 @@ For a full list of actions that the hook APIs offer to construct hooks and save --- -## Code Examples +## Code Examples The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). From fd1b1c2296b36c065da2d4fbf1a0ba4265a7f811 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 14:08:44 -0700 Subject: [PATCH 19/28] incorporate comments --- docs/api.md | 2 +- docs/tensorflow.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/api.md b/docs/api.md index a01b33750..bbbd0fa52 100644 --- a/docs/api.md +++ b/docs/api.md @@ -163,7 +163,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`. |`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. |`close()` | - | Closes all files that are currently open by the hook | | `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | -| `save_tensor()`| `tensor_name (str)`, `tensor_value (float)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| +| `save_tensor()`| `tensor_name (str)`, `tensor_value (np.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| ### TensorFlow specific Hook API diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 272530b82..9332aab4c 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -95,7 +95,7 @@ For more information about adjusting the tensor collection parameters, see [Save For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). ->**Note**: The `inputs`, `outputs`, and `layers` collections are currently not available for TensorFlow 2.1.0. +>**Note**: The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently available for TensorFlow 2.2.0. --- From 463f0b4fd5d21d1afb51a7a91b46f613c14e301f Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 14:12:20 -0700 Subject: [PATCH 20/28] incorporate comments --- docs/tensorflow.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 9332aab4c..0827cd9c8 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -83,20 +83,18 @@ The following table lists the pre-configured tensor collections for TensorFlow m | `metrics` | For KerasHook, saves the metrics computed by Keras for the model. | | `losses` | Saves all losses of the model. | | `sm_metrics` | Saves scalars that you want to include in the SageMaker metrics collection. | -| `inputs` | Matches all model inputs to the model. | -| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. | -| `layers` | Matches all inputs and outputs of intermediate layers. | -| `gradients` | Matches all gradients of the model. | | `weights` | Matches all weights of the model. | | `biases` | Matches all biases of the model. | | `optimizer_variables` | Matches all optimizer variables, currently only supported for Keras. | +| `inputs` | Matches all model inputs to the model. (Available only for TensorFlow 2.2.0)| +| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. (Available only for TensorFlow 2.2.0)| +| `layers` | Matches all inputs and outputs of intermediate layers. (Available only for TensorFlow 2.2.0)| +| `gradients` | Matches all gradients of the model. (Available only for TensorFlow 2.2.0)| For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). ->**Note**: The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently available for TensorFlow 2.2.0. - --- ## Using Debugger on SageMaker Training Containers and Custom Containers From 72e48dfd39fc6dc3b6b98e81984b2864b029c08d Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 14:40:25 -0700 Subject: [PATCH 21/28] incorporate comments --- docs/tensorflow.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 0827cd9c8..c3f076905 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -83,18 +83,20 @@ The following table lists the pre-configured tensor collections for TensorFlow m | `metrics` | For KerasHook, saves the metrics computed by Keras for the model. | | `losses` | Saves all losses of the model. | | `sm_metrics` | Saves scalars that you want to include in the SageMaker metrics collection. | +| `inputs` | Matches all model inputs to the model. | +| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. | +| `layers` | Matches all inputs and outputs of intermediate layers. | +| `gradients` | Matches all gradients of the model. | | `weights` | Matches all weights of the model. | | `biases` | Matches all biases of the model. | | `optimizer_variables` | Matches all optimizer variables, currently only supported for Keras. | -| `inputs` | Matches all model inputs to the model. (Available only for TensorFlow 2.2.0)| -| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. (Available only for TensorFlow 2.2.0)| -| `layers` | Matches all inputs and outputs of intermediate layers. (Available only for TensorFlow 2.2.0)| -| `gradients` | Matches all gradients of the model. (Available only for TensorFlow 2.2.0)| For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). +>**Note**: The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently available for TensorFlow versions <2.0 and ==2.2.0. + --- ## Using Debugger on SageMaker Training Containers and Custom Containers From 557eae10968ffe5cb31b1fa8277aff5f5a51350e Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 14:57:01 -0700 Subject: [PATCH 22/28] version addition --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 665f2b213..ac87bd0cb 100644 --- a/README.md +++ b/README.md @@ -63,9 +63,9 @@ The following frameworks are available AWS Deep Learning Containers with the dee | Framework | Version | | --- | --- | -| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1, 2.2 | +| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1.0, 2.2.0, 2.3.0 | | [MXNet](docs/mxnet.md) | 1.6 | -| [PyTorch](docs/pytorch.md) | 1.4, 1.5 | +| [PyTorch](docs/pytorch.md) | 1.4, 1.5, 1.6 | | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))| ### AWS training containers with script mode @@ -74,10 +74,10 @@ The `smdebug` library supports frameworks other than the ones listed above while | Framework | Versions | | --- | --- | -| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1, 2.2 | +| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0 | | Keras (with TensorFlow backend) | 2.3 | | [MXNet](docs/mxnet.md) | 1.4, 1.5, 1.6 | -| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5 | +| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5, 1.6 | | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 (As a framework)| ### Debugger on custom containers or local machines From 1eee9c626e20ca7c3d77880c6c29d432ea298981 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 15:01:50 -0700 Subject: [PATCH 23/28] version addition --- docs/api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/api.md b/docs/api.md index bbbd0fa52..92ac9ecc0 100644 --- a/docs/api.md +++ b/docs/api.md @@ -163,7 +163,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`. |`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. |`close()` | - | Closes all files that are currently open by the hook | | `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | -| `save_tensor()`| `tensor_name (str)`, `tensor_value (np.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| +| `save_tensor()`| `tensor_name (str)`, `tensor_value (numpy.array or numpy.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| ### TensorFlow specific Hook API From fd62feb202cecc60452bede941348d33fe6500e9 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 31 Aug 2020 15:32:15 -0700 Subject: [PATCH 24/28] add footnote about limitation --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index ac87bd0cb..23ffcb520 100644 --- a/README.md +++ b/README.md @@ -68,6 +68,8 @@ The following frameworks are available AWS Deep Learning Containers with the dee | [PyTorch](docs/pytorch.md) | 1.4, 1.5, 1.6 | | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))| +>**Note**: Debugger with zero script change is partially available for TensorFlow v2.1.0 and v2.3.0. The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently not available for these TensorFlow versions. + ### AWS training containers with script mode The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script. From 19754a1a63e645b20d0fcfc873397ef31894be07 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 31 Aug 2020 15:46:32 -0700 Subject: [PATCH 25/28] add details --- docs/tensorflow.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index c3f076905..498959b00 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -64,6 +64,14 @@ tf_estimator = TensorFlow( ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` +>**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command line. +```bash +pip install -U sagemaker +pip install -U smdebug +``` +If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. + +#### Available Tensor Collections for TensorFlow **Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest `smdebug` library. We highly recommend that you upgrade the packages by running the following commands at the command line: ``` From dd13c6cb41fd598e5f5391e1301cd3cdb1e9f8db Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 31 Aug 2020 15:58:45 -0700 Subject: [PATCH 26/28] add footnote --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index ac87bd0cb..c2cfd7d58 100644 --- a/README.md +++ b/README.md @@ -68,6 +68,8 @@ The following frameworks are available AWS Deep Learning Containers with the dee | [PyTorch](docs/pytorch.md) | 1.4, 1.5, 1.6 | | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))| +**Note**: Debugger with zero script change is partially available for TensorFlow v2.1.0 and v2.3.0. The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently not available for these TensorFlow versions. + ### AWS training containers with script mode The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script. From 4d86970e14f9a397a61d35e9e5cea74f389e5882 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Tue, 1 Sep 2020 10:00:12 -0700 Subject: [PATCH 27/28] retrigger CI From 0decbe9cf005b45b198de7452a8321f39805279e Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Tue, 1 Sep 2020 13:37:44 -0700 Subject: [PATCH 28/28] fix version numbers --- docs/tensorflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 498959b00..2f50a9fc2 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -64,7 +64,7 @@ tf_estimator = TensorFlow( ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` ->**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command line. +>**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the SageMaker python SDK v2 and `smdebug` v0.9.2. It is highly recommended to upgrade the packages by executing the following command line. ```bash pip install -U sagemaker pip install -U smdebug