diff --git a/README.md b/README.md
index 5074df07a..d824a9a2e 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,9 @@
# Amazon SageMaker Debugger
- [Overview](#overview)
-- [Examples](#sagemaker-example)
+- [Examples](#examples)
- [How It Works](#how-it-works)
+- [Docs](#docs)
## Overview
Amazon SageMaker Debugger is an offering from AWS which help you automate the debugging of machine learning training jobs.
@@ -15,6 +16,7 @@ It supports TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+.
- Real-time training job monitoring through Rules
- Automated anomaly detection and state assertions
- Interactive exploration of saved tensors
+- Actions on your training jobs based on the status of Rules
- Distributed training support
- TensorBoard support
@@ -51,6 +53,12 @@ sagemaker_simple_estimator = sm.tensorflow.TensorFlow(
)
sagemaker_simple_estimator.fit()
+tensors_path = sagemaker_simple_estimator.latest_job_debugger_artifacts_path()
+
+import smdebug as smd
+trial = smd.trials.create_trial(out_dir=tensors_path)
+print(f"Saved these tensors: {trial.tensor_names()}")
+print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}")
```
That's it! Amazon SageMaker will automatically monitor your training job for you with the Rules specified and create a CloudWatch
@@ -101,12 +109,15 @@ Amazon SageMaker Debugger can be used inside or outside of SageMaker. There are
The reason for different setups is that SageMaker Zero-Script-Change (via Deep Learning Containers) uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost to save tensors automatically.
These framework forks are not available in custom containers or non-SM environments, so you must modify your training script in these environments.
-See the [SageMaker page](docs/sagemaker.md) for details on SageMaker Zero-Code-Change and Bring-Your-Own-Container (BYOC) experience.\
-See the frameworks pages for details on modifying the training script:
-- [TensorFlow](docs/tensorflow.md)
-- [PyTorch](docs/pytorch.md)
-- [MXNet](docs/mxnet.md)
-- [XGBoost](docs/xgboost.md)
+## Docs
+
+| Section | Description |
+| --- | --- |
+| [SageMaker Training](docs/sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger |
+| Frameworks
[TensorFlow](docs/tensorflow.md)
[PyTorch](docs/pytorch.md)
[MXNet](docs/mxnet.md)
[XGBoost](docs/xgboost.md)
| See the frameworks pages for details on what's supported and how to modify your training script if applicable |
+| [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by our APIs which allows you to perform interactive exploration of tensors saved as well as to write your own Rules monitoring your training jobs. |
+| [APIs](docs/api.md) | Full description of our APIs |
+
## License
This library is licensed under the Apache 2.0 License.
diff --git a/docs/sagemaker.md b/docs/sagemaker.md
index 094f9bdd5..aa5bce967 100644
--- a/docs/sagemaker.md
+++ b/docs/sagemaker.md
@@ -29,7 +29,7 @@ Here's a list of frameworks and versions which support this experience.
| [TensorFlow](tensorflow.md) | 1.15 |
| [MXNet](mxnet.md) | 1.6 |
| [PyTorch](pytorch.md) | 1.3 |
-| [XGBoost](xgboost.md) | |
+| [XGBoost](xgboost.md) | >=0.90-2 [As Built-in algorithm](xgboost.md#use-xgboost-as-a-built-in-algorithm)|
More details for the deep learning frameworks on which containers these are can be found here: [SageMaker Framework Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) and [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/). You do not have to specify any training container image if you want to use them on SageMaker. You only need to specify the version above to use these containers.
@@ -43,7 +43,7 @@ This library `smdebug` itself supports versions other than the ones listed above
| Keras (with TensorFlow backend) | 2.3 |
| [MXNet](mxnet.md) | 1.4, 1.5, 1.6 |
| [PyTorch](pytorch.md) | 1.2, 1.3 |
-| [XGBoost](xgboost.md) | |
+| [XGBoost](xgboost.md) | [As Framework](xgboost.md#use-xgboost-as-a-framework) |
#### Setting up SageMaker Debugger with your script on your container
@@ -189,7 +189,7 @@ The Built-in Rules, or SageMaker Rules, are described in detail on [this page](h
Scope of Validity | Rules |
|---|---|
| Generic Deep Learning models (TensorFlow, Apache MXNet, and PyTorch) |
|
diff --git a/docs/xgboost.md b/docs/xgboost.md
index 35c63ba03..14a8220bc 100644
--- a/docs/xgboost.md
+++ b/docs/xgboost.md
@@ -10,9 +10,9 @@
### Use XGBoost as a built-in algorithm
The XGBoost algorithm can be used 1) as a built-in algorithm, or 2) as a framework such as MXNet, PyTorch, or Tensorflow.
-If SageMaker XGBoost is used as a built-in algorithm in container verision `0.90-2` or later, Amazon SageMaker Debugger will be available by default (i.e., zero code change experience).
+If SageMaker XGBoost is used as a built-in algorithm in container version `0.90-2` or later, Amazon SageMaker Debugger will be available by default (i.e., zero code change experience).
See [XGBoost Algorithm AWS docmentation](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) for more information on how to use XGBoost as a built-in algorithm.
-See [Amazon SageMaker Debugger examples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) for sample notebooks that demonstrate debugging and monitoring capabilities of Aamazon SageMaker Debugger.
+See [Amazon SageMaker Debugger examples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) for sample notebooks that demonstrate debugging and monitoring capabilities of Amazon SageMaker Debugger.
See [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) for more information on how to configure the Amazon SageMaker Debugger from the Python SDK.
### Use XGBoost as a framework
diff --git a/examples/mxnet/README.md b/examples/mxnet/README.md
new file mode 100644
index 000000000..9818d371a
--- /dev/null
+++ b/examples/mxnet/README.md
@@ -0,0 +1,2 @@
+## Example Notebooks
+Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger)
diff --git a/examples/mxnet/notebooks/MNISTSimpleInteractiveAnalysis.ipynb b/examples/mxnet/notebooks/MNISTSimpleInteractiveAnalysis.ipynb
deleted file mode 100644
index 67a818ebe..000000000
--- a/examples/mxnet/notebooks/MNISTSimpleInteractiveAnalysis.ipynb
+++ /dev/null
@@ -1,652 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Simple Interactive Analysis in Tornasole\n",
- "This notebook will demonstrate the simplest kind of interactive analysis that can be run in smdebug. It will focus on the [vanishing/exploding gradient](https://medium.com/learn-love-ai/the-curious-case-of-the-vanishing-exploding-gradient-bf58ec6822eb) problems on a simple MNIST digit recognition."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Some basic setup that's always helpful"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%load_ext autoreload\n",
- "%autoreload 2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Make sure that MXNet is accessible! If you are on the EC2 Deep Learning AMI, you will probably want\n",
- "to activate the right MXNet environment\n",
- "```\n",
- "sh> source activate mxnet_p36\n",
- "```\n",
- "You'll probably have to restart this notebook after doing this"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's import some basic libraries for ML"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import mxnet as mx\n",
- "from mxnet import gluon, autograd\n",
- "from mxnet.gluon import nn\n",
- "import matplotlib.pyplot as plt"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's copy the Tornasole libraries to this instance, this step has to be executed only once. \n",
- "Please make sure that the AWS account you are using can access the `tornasole-external-preview-use1` bucket.\n",
- "\n",
- "To do so you'll need the appropriate AWS credentials. There are several ways of doing this:\n",
- "- inject temporary credentials \n",
- "- if running on EC2, use [EC2 roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html) that can access all S3 buckets\n",
- "- (preferred) run this notebook on a [SageMaker notebook instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html)\n",
- "\n",
- "The code below downloads the necessary `.whl` files and installs them in the current environment. Only run the first time!\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "#WARNING - uncomment this code only if you haven't done this before\n",
- "#!aws s3 sync s3://tornasole-external-preview-use1/sdk/ts-binaries/tornasole_mxnet/py3/latest/ tornasole_mxnet/\n",
- "#!pip install tornasole_mxnet/*\n",
- "\n",
- "# If you run into a version conflict with boto, run the following\n",
- "#!pip uninstall -y botocore boto3 aioboto3 aiobotocore && pip install botocore==1.12.91 boto3==1.9.91 aiobotocore==0.10.2 aioboto3==6.4.1"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Model Training and Gradient Analysis\n",
- "At this point we have all the ingredients installed on our machine. We can now start training.\n",
- "\n",
- "The goal of this notebook is to show how to detect the Vanishing Gradient problem. We will first do it manually and then automatic."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from smdebug.mxnet import SessionHook, SaveConfig\n",
- "from smdebug.trials import LocalTrial"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can change the logging level if appropriate "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "#import logging\n",
- "#logging.getLogger(\"tornasole\").setLevel(logging.WARNING)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can define a simple network - it doesn't really matter what it is.\n",
- "Importantly - we **add the Tornasole Hook**. This hook will be run at every batch and will save selected tensors (in this case, all of them) to the desired directory (in this case, `'{base_loc}/{run_id}'`.\n",
- "\n",
- "`{base_loc}` can be either a path on a local file system (for instance, `./ts_output/`) or an S3 bucket/object (`s3://mybucket/myprefix/`).\n",
- "\n",
- "See the documentation for more details."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import smdebug.mxnet as smd\n",
- "def create_net( tornasole_save_interval, base_loc, run_id ):\n",
- " net = nn.Sequential(prefix='sequential_')\n",
- " with net.name_scope():\n",
- " net.add(nn.Dense(128, activation='relu'))\n",
- " net.add(nn.Dense(64, activation='relu'))\n",
- " net.add(nn.Dense(10))\n",
- "\n",
- " # Create and add the hook. Arguments:\n",
- " # - save data in './{base_loc}/{run_id} - Note: s3 is also supported\n",
- " # - save every 100 batches\n",
- " # - save every tensor: inputs/outputs to each layer, as well as gradients\n",
- " trial_dir = base_loc + run_id\n",
- " hook = SessionHook(out_dir=trial_dir,\n",
- " save_config=SaveConfig(save_interval=100), \n",
- " save_all=True)\n",
- " hook.register_hook(net)\n",
- " return net"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And we create a simple training script. No Tornasole-specific code here, this is a slightly modified version of the [digit recognition](https://github.com/apache/incubator-mxnet/blob/master/example/gluon/mnist/mnist.py) example on the MXNet website."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def transformer(data, label):\n",
- " data = data.reshape((-1,)).astype(np.float32)/255\n",
- " return data, label\n",
- "\n",
- "def test(ctx, val_data):\n",
- " metric = mx.metric.Accuracy()\n",
- " for data, label in val_data:\n",
- " data = data.as_in_context(ctx)\n",
- " label = label.as_in_context(ctx)\n",
- " output = net(data)\n",
- " metric.update([label], [output])\n",
- " return metric.get()\n",
- "\n",
- "def train(net, epochs, ctx, learning_rate, momentum):\n",
- " train_data = gluon.data.DataLoader(\n",
- " gluon.data.vision.MNIST('./data', train=True, transform=transformer),\n",
- " batch_size=100, shuffle=True, last_batch='discard')\n",
- "\n",
- " val_data = gluon.data.DataLoader(\n",
- " gluon.data.vision.MNIST('./data', train=False, transform=transformer),\n",
- " batch_size=100, shuffle=False)\n",
- " \n",
- " # Collect all parameters from net and its children, then initialize them.\n",
- " net.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)\n",
- " # Trainer is for updating parameters with gradient.\n",
- " trainer = gluon.Trainer(net.collect_params(), 'sgd',\n",
- " {'learning_rate': learning_rate, 'momentum': momentum})\n",
- " metric = mx.metric.Accuracy()\n",
- " loss = gluon.loss.SoftmaxCrossEntropyLoss()\n",
- "\n",
- " for epoch in range(epochs):\n",
- " # reset data iterator and metric at begining of epoch.\n",
- " metric.reset()\n",
- " for i, (data, label) in enumerate(train_data):\n",
- " # Copy data to ctx if necessary\n",
- " data = data.as_in_context(ctx)\n",
- " label = label.as_in_context(ctx)\n",
- " # Start recording computation graph with record() section.\n",
- " # Recorded graphs can then be differentiated with backward.\n",
- " with autograd.record():\n",
- " output = net(data)\n",
- " L = loss(output, label)\n",
- " L.backward()\n",
- " # take a gradient step with batch_size equal to data.shape[0]\n",
- " trainer.step(data.shape[0])\n",
- " # update metric at last.\n",
- " metric.update([label], [output])\n",
- "\n",
- " if i % 100 == 0 and i > 0:\n",
- " name, acc = metric.get()\n",
- " print('[Epoch %d Batch %d] Training: %s=%f'%(epoch, i, name, acc))\n",
- "\n",
- " name, acc = metric.get()\n",
- " print('[Epoch %d] Training: %s=%f'%(epoch, name, acc))\n",
- "\n",
- " name, val_acc = test(ctx, val_data)\n",
- " print('[Epoch %d] Validation: %s=%f'%(epoch, name, val_acc))\n",
- "\n",
- " net.save_parameters('mnist.params')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Clear up from previous runs, we remove old data (warning - we assume that we have set `ts_output` as the directory into which we send data)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!rm -rf ./ts_output/"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "At this point we are ready to train. We will train this simple model.\n",
- "\n",
- "For the purposes of this example, we will name this run as `'good'` because we know it will converge to a good solution. If you have a GPU on your machine, you can change `ctx=mx.gpu(0)`.\n",
- "\n",
- "Behind the scenes, the `SessionHook` is saving the data requested."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "net = create_net( tornasole_save_interval=100, base_loc='./ts_output/', run_id='good')\n",
- "train(net=net, epochs=4, ctx=mx.cpu(), learning_rate=0.1, momentum=0.9)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Data Analysis - Manual\n",
- "Now that we have trained the system we can analyze the data. Notice that this notebook focuses on after-the-fact analysis. Tornasole also provides a collection of tools to do automatic analysis as the training run is progressing, which will be covered in a different notebook."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We import a basic analysis library, which defines a concept of `Trial`. A `Trial` is a single training run, which is depositing values in a local directory (`LocalTrial`) or S3 (`S3Trial`). In this case we are using a `LocalTrial` - if you wish, you can change the output from `./ts_output` to `s3://mybucket/myprefix` and use `S3Trial` instead of `LocalTrial`."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And we read the data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "good_trial = LocalTrial( 'myrun', './ts_output/good/')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can list all the tensors we know something about. Each one of these names is the name of a tensor - the name is a combination of the layer name (which, in these cases, is auto-assigned by MXNet) and whether it's an input/output/gradient."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "good_trial.tensor_names()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Another interesting feature is *collections*. Collection represents a set of tensors that are grouped together by a condition. See [more docs on Collections](https://github.com/awslabs/tornasole_core/blob/alpha/docs/mxnet/api.md#collection). For example, here we can inspect which tensors got into collection named '*gradients*'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "good_trial.tensors_in_collection('gradients')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For each tensor we can ask for which steps we have data - in this case, every 100 steps"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "good_trial.tensor('gradient/sequential_dense0_weight').steps()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can obtain each tensor at each step as a `numpy` array"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "type(good_trial.tensor('gradient/sequential_dense0_weight').value(300))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Gradient Analysis"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can also create a simple function that prints the `np.mean` of the `np.abs` of each gradient. We expect each gradient to get smaller over time, as the system converges to a good solution. Now, remember that this is an interactive analysis - we are showing these tensors to give an idea of the data. \n",
- "\n",
- "Later on in this notebook we will run an automated analysis."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Define a function that, for the given tensor name, walks through all \n",
- "# the batches for which we have data and computes mean(abs(tensor)).\n",
- "# Returns the set of steps and the values\n",
- "\n",
- "def get_data(trial, tname):\n",
- " tensor = trial.tensor(tname)\n",
- " steps = tensor.steps()\n",
- " vals = []\n",
- " for s in steps:\n",
- " val = tensor.value(s)\n",
- " val = np.mean(np.abs(val))\n",
- " vals.append(val)\n",
- " return steps, vals"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def plot_gradients( lt ):\n",
- " for tname in lt.tensor_names():\n",
- " if not 'gradient' in tname: continue\n",
- " steps, data = get_data(lt, tname)\n",
- " plt.plot( steps, data, label=tname)\n",
- " plt.legend()\n",
- " plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can plot these gradiends. Notice how they are (mostly!) decreasing. We should investigate the spikes!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plot_gradients(good_trial)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can also print inputs and outputs from the model. For instance, let's print the 83th sample of the 2700th batch, as seen by the network. \n",
- "\n",
- "Notice that we have to reshape the input data from a (784,) array to a (28,28) array and multiply by 255 - the exact inverse of the transformation we did above."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# The raw tensor\n",
- "raw_t = good_trial.tensor('sequential_input_0').value(2700)[83]\n",
- "# We have to undo the transformations in 'transformer' above. First of all, multiply by 255\n",
- "raw_t = raw_t * 255\n",
- "# Then reshape from a 784-long vector to a 28x28 square.\n",
- "input_image = raw_t.reshape(28,28)\n",
- "plt.imshow(input_image, cmap=plt.get_cmap('gray'))\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can also plot the relative values emitted by the network. Notice that the last layer is of type `Dense(10)`: it will emit 10 separate confidences, one for each 0-9 digit. The one with the highest output is the predicted value.\n",
- "\n",
- "We can capture and plot the network output for the same sample."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plt.plot(good_trial.tensor('sequential_output0').value(2700)[83], 'bo')\n",
- "plt.show()\n",
- "print( 'The network predicted the value: {}'.format(np.argmax(good_trial.tensor('sequential_output0').value(2700)[83])))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Vanishing Gradient"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We have now worked through some of the basics. Let's pretend we are debugging a real problem: the [Vanishing Gradient](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). When training a network, if the `learning_rate` is too high we will end up with a Vanishing Gradient. Let's set `learning_rate=1`.\n",
- "\n",
- "Notice how the accuracy remains at around ~10% - no better than random."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "net = create_net( tornasole_save_interval=100, base_loc='./ts_output/', run_id='bad')\n",
- "train(net=net, epochs=4, ctx=mx.cpu(), learning_rate=1, momentum=0.9)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "bad_trial = LocalTrial( 'myrun', './ts_output/bad/')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can plot the gradients - notice how every single one of them (apart from one) goes to zero and stays there!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plot_gradients(bad_trial)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The `VanishingGradient` rule provided by Tornasole alerts for this automatically."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "input_image = (bad_trial.tensor('sequential_input_0').value(2700)[83]*255).reshape(28,28)\n",
- "plt.imshow(input_image, cmap=plt.get_cmap('gray'))\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plt.plot(bad_trial.tensor('sequential_output0').value(2700)[83], 'bo')\n",
- "plt.show()\n",
- "print( 'The network predicted the value: {}'.format(np.argmax(bad_trial.tensor('sequential_output0').value(2700)[83])))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Data Analysis - Automatic\n",
- "So far we have conducted a human analysis, but the real power of Tornasole comes from having automatic monitoring of training runs. To do so we will build a SageMaker-based system that monitors existing runs in real time. Data traces deposited in S3 are the exchange mechanism: \n",
- "- the training system deposits data into s3://mybucket/myrun/\n",
- "- the monitoring system watches and reads data from s3://mybucket/myrun/\n",
- "\n",
- "In this example we will simulate that situation. The only difference from SageMaker-based system is that data traces have been stored locally, not in S3, so we will use previously created `LocalTrial` objects and run rule on them."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from smdebug.rules.generic import VanishingGradient\n",
- "from smdebug.rules.rule_invoker import invoke_rule"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "vr = VanishingGradient(base_trial=good_trial, threshold=0.0001)\n",
- "invoke_rule(vr, end_step=2700)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "vr_bad = VanishingGradient(base_trial=bad_trial, threshold=0.0001)\n",
- "invoke_rule(vr_bad, end_step=2700)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This concludes this notebook. For more information see the documentation at \n",
- "- https://github.com/awslabs/tornasole_core\n"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "conda_mxnet_p36",
- "language": "python",
- "name": "conda_mxnet_p36"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.5"
- },
- "pycharm": {
- "stem_cell": {
- "cell_type": "raw",
- "metadata": {
- "collapsed": false
- },
- "source": []
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/examples/mxnet/notebooks/mxnet-realtime-analysis.ipynb b/examples/mxnet/notebooks/mxnet-realtime-analysis.ipynb
deleted file mode 100644
index 417b27592..000000000
--- a/examples/mxnet/notebooks/mxnet-realtime-analysis.ipynb
+++ /dev/null
@@ -1,453 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Debugging SageMaker Training Jobs In Real Time with Tornasole"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Overview"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. \n",
- "It lets you go beyond just looking at scalars like losses and accuracies during training and gives \n",
- "you full visibility into all tensors 'flowing through the graph' during training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training flow.\n",
- "\n",
- "Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.\n",
- "\n",
- "### Saving tensors\n",
- "\n",
- "Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis.\n",
- "\n",
- "### Analysis\n",
- "\n",
- "There are two ways to get to tensors and run analysis on them. One way is to use concept called ***Rules***. Please refer to [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more details about rules based approach to analysis. Focus of this notebook is on another way of analysis: **Manual**.\n",
- "\n",
- "Manual analysis is what you use when there are no rules available to detect type of an issue you are running into and you need to get to raw tensors in order to understand what data is travelling through your model duing training and, hopefully, root cause a problem or two with your training job.\n",
- "\n",
- "Manual analysis is powered by Tornasole API - a framework that allows to retrieve tensors and scalas (e.g. debugging data) saved during training job via few lines of code. One of the most powerful features provided by it is real time access to data - you can get tensors and scalars ***while your training job is running***.\n",
- "\n",
- "This example guides you through installation of the required components for emitting tensors in a \n",
- "SageMaker training job and using Tornasole API to access those tensors while training is running. We will use small gluon CNN model and train it on FashionMNIST dataset. While job is running we will retrieve activations of first convolutional layer from each 100 batches and visualize them. Also we will visualize weights of that level after the job is done."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup\n",
- "\n",
- "As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and provide access to Tornasole API to retrieve them."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!aws s3 sync s3://tornasole-external-preview-use1/ ./tornasole\n",
- "!pip install ./smdebug/sdk/ts-binaries/tornasole_mxnet/py3/latest/tornasole-0.3.4-py2.py3-none-any.whl --user\n",
- "!pip -q install ./smdebug/sdk/sagemaker-tornasole-latest.tar.gz\n",
- "!aws configure add-model --service-model file:///home/ec2-user/SageMaker/smdebug/sdk/sagemaker-smdebug.json --service-name sagemaker"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Training MXNet models in SageMaker with Tornasole\n",
- "\n",
- "We'll be training a small mxnet CNN model with FashonMNIST dataset in this notebook with Tornasole enabled. This will be done using SageMaker MXNet 1.4.1 Container with Script Mode. Note that Tornasole currently only works with python3, so be sure to set `py_version='py3'` when creating SageMaker Estimator.\n",
- "\n",
- "Let us first train with a simple training script mnist_gluon_realtime_visualize_demo.py with Tornasole enabled in SageMaker using the SageMaker Estimator API. In this example, for simplicity sake, Tornasole will capture all tensors as specified in its configuration every 100 steps (1 step is 1 batch). While training job is running we will use Tornasole API to access saved tensors in real time and visualize them. We will rely on Tornasole to take care of downloading fresh set of tensors every time we query for them.\n",
- "\n",
- "## Enable Tornasole in the training script\n",
- "\n",
- "Integrating Tornasole into the training job can be accomplished by following steps below.\n",
- "\n",
- "### Import the hook package\n",
- "Import the SessionHook class along with other helper classes in your training script as shown below\n",
- "\n",
- "```\n",
- "from smdebug.mxnet import SessionHook\n",
- "from smdebug import SaveConfig, modes\n",
- "```\n",
- "\n",
- "### Instantiate and initialize hook\n",
- "\n",
- "```\n",
- " # Create SaveConfig object that instructs engine to log graph tensors every 100 steps (1 step == 1 batch).\n",
- " save_config = SaveConfig(save_interval=100)\n",
- " # Create a hook that logs ***all*** tensors while training the model.\n",
- " hook = SessionHook(save_config=save_config, save_all=True)\n",
- "```\n",
- "\n",
- "### Register Tornasole hook to the model before starting of the training.\n",
- "\n",
- "*NOTE: The hook can only be registered to Gluon Non-hybrid models.\n",
- "*\n",
- "\n",
- "After creating or loading the desired model, you can register Tornasole hook with the model as shown below.\n",
- "\n",
- "```\n",
- "# Create a Gluon Model.\n",
- "net = create_gluon_model()\n",
- "\n",
- "# Create a hook for logging all tensors.\n",
- "hook = create_hook()\n",
- "\n",
- "# Apply hook to the model (e.g. instruct engine to recognize hook configuration\n",
- "# and enable mode in which engine will log graph tensors\n",
- "hook.register_hook(net)\n",
- "```\n",
- "\n",
- "#### Set the mode\n",
- "Tornasole has the concept of modes (TRAIN, EVAL, PREDICT) to separate out different modes of the jobs.\n",
- "Set the mode you are running in your job. Every time the mode changes in your job, please set the current mode. This helps you group steps by mode, for easier analysis. Setting the mode is optional but recommended. If you do not specify this, Tornasole saves all steps under a `GLOBAL` mode. \n",
- "```\n",
- "hook.set_mode(smd.modes.TRAIN)\n",
- "```\n",
- "\n",
- "Refer [DeveloperGuide_MXNet.md](../../DeveloperGuide_MXNet.md) for more details on the APIs Tornasole provides to help you save tensors.\n",
- "\n",
- "### Docker Images with Tornasole\n",
- "\n",
- "We have built SageMaker MXNet containers with smdebug. You can use them from ECR from SageMaker. Here are the links to the images. Please use the image from the appropriate region in which you want your jobs to run."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%load_ext autoreload\n",
- "%autoreload 2\n",
- "import sagemaker\n",
- "import boto3\n",
- "import os\n",
- "from sagemaker.mxnet import MXNet\n",
- "from smdebug.mxnet import modes\n",
- "\n",
- "# Below changes the region to be one where this notebook is running\n",
- "TAG='latest'\n",
- "REGION = boto3.Session().region_name\n",
- "os.environ['AWS_REGION'] = REGION\n",
- "\n",
- "cpu_docker_image_name= '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-mxnet-1.4.1-cpu:{}'.format(REGION, TAG)\n",
- "#gpu_docker_image_name= '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-mxnet-1.4.1-gpu:{}'.format(REGION, TAG)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Configuring the inputs for the training job\n",
- "\n",
- "Now we'll call the Sagemaker MXNet Estimator to kick off a training job along with enabling Tornasole functionality.\n",
- "\n",
- "The *entry_point_script* points to the MXNet training script that has the SessionHook integrated.\n",
- "\n",
- "The *hyperparameters* are the parameters that will be passed to the training script."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "entry_point_script = '../scripts/mnist_gluon_realtime_visualize_demo.py'\n",
- "hyperparameters = {'batch-size': 256, 'learning_rate': 0.1, 'epochs': 10}\n",
- "base_job_name = 'mxnet-TS-realtime-analysis'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_simple_estimator = MXNet(role=sagemaker.get_execution_role(),\n",
- " base_job_name=base_job_name,\n",
- " train_instance_count=1,\n",
- " train_instance_type='ml.m4.xlarge',\n",
- " image_name=cpu_docker_image_name,\n",
- " entry_point=entry_point_script,\n",
- " hyperparameters=hyperparameters,\n",
- " framework_version='1.4.1',\n",
- " py_version='py3',\n",
- " # following parameter is necesary to instruct SageMaker \n",
- " # that debugging data generated by Tornasole needs to be \n",
- " # uploaded to S3 bucket in your account. This way we can \n",
- " # access it while training job is running.\n",
- " debug=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# This is a fire and forget event. By setting wait=False, we just submit the job to run in the background.\n",
- "# SageMaker will spin off one training job and release control to next cells in the notebook.\n",
- "# Please follow this notebook to see status of the training job.\n",
- "sagemaker_simple_estimator.fit(wait=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "### Result\n",
- "\n",
- "As a result of the above command, SageMaker will spin off 1 training job for you and it will produce the tensors to be analyzed. This job will run in a background without you having to wait for it to complete in order to continue with the rest of the notebook. Because of this async nature of training job we will need to monitor its status so that we don't start to request debugging tensors too early. Tensors are only produced during training phase of SageMaker training job hence let's wait until that begins.\n",
- "\n",
- "### Checking on the training job status\n",
- "\n",
- "We can check the status of the training job by running the following code. It will check on a status of SageMaker training job every five seconds. Once job has started its traning cycle control is released to next cells in the notebook."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# some helper method first, to render status status updates\n",
- "import time\n",
- "import sys\n",
- "from time import gmtime, strftime\n",
- "\n",
- "def print_same_line(s):\n",
- " sys.stdout.write('\\r{}: {}'.format(strftime('%X', gmtime()), s))\n",
- " sys.stdout.flush()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Below command will give the status of training job\n",
- "# Note: In the output of below command you will see DebugConfig parameter \n",
- "import time\n",
- "\n",
- "job_name = sagemaker_simple_estimator.latest_training_job.name\n",
- "print('Training job name: ' + job_name)\n",
- "\n",
- "client = sagemaker_simple_estimator.sagemaker_session.sagemaker_client\n",
- "\n",
- "description = client.describe_training_job(TrainingJobName=job_name)\n",
- "\n",
- "if description['TrainingJobStatus'] != 'Completed':\n",
- " while description['SecondaryStatus'] not in {'Training', 'Completed'}:\n",
- " description = client.describe_training_job(TrainingJobName=job_name)\n",
- " primary_status = description['TrainingJobStatus']\n",
- " secondary_status = description['SecondaryStatus']\n",
- " print_same_line('Current job status: [PrimaryStatus: {}, SecondaryStatus: {}]'.format(primary_status, secondary_status))\n",
- " time.sleep(5)\n",
- "\n",
- "# uncomment next line to see full details of training job \n",
- "# client.describe_training_job(TrainingJobName=job_name)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Retrieving and Analyzing tensors\n",
- "\n",
- "Before getting to analysis, here are some notes on concepts being used in Tornasole that help with analysis.\n",
- "- ***Trial*** - object that is a center piece of Tornasole API when it comes to getting access to tensors. It is a top level abstract that represents a single run of a training job. All tensors emitted by training job are associated with its Trial.\n",
- "- ***Step*** - object that represents next level of abstraction. In Tornasole step is a representation of a single batch of a training job. Each trial has multiple steps. Each tensor is associated with multiple steps - having a particular value at each of the steps.\n",
- "- ***Tensor*** - object that represent actual tensor saved during training job. *Note* - it could be a scalar as well.\n",
- "- ***Mode*** - each DL engine does forward and backward passes during training job. During each of those passes tensors generated by the model are saved. However, in addition to training itself DL engine also uses forward passes for validation phase, and tensors generated during such forward passes are also saved. Tornasole introduces concept of Training Mode in order to allow to differentiate between tensors of each of the phases.\n",
- "\n",
- "For more details on aforementioned concepts as well as on Tornasole API in general (including examples) please refer to [Rules API](../../docs/rules/readme.md)\n",
- "\n",
- "Below, you can find several methods to help with retrieving and plotting tensors. In *get_data* we use concepts described above to retrieve data. We expect to get steps_range that will have 1 or more steps (batches) for which we want to get tensors for. Please note that we are going to retrieve only tensors saved by main training loop and will excluse tensors saved during validation (`mode=modes.TRAIN` will help with that). Two other methods are helpers to plot tensors."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "def get_data(trial, tname, batch_index, steps_range):\n",
- " tensor = trial.tensor(tname)\n",
- " vals = []\n",
- " for s in steps_range:\n",
- " val = tensor.value(step_num=s, mode=modes.TRAIN)[batch_index][0]\n",
- " vals.append(val)\n",
- " return vals\n",
- "\n",
- "def create_plots(steps_range):\n",
- " fig, axs = plt.subplots(nrows=1, ncols=len(steps_range), constrained_layout=True, figsize=(2*len(steps_range), 2),\n",
- " subplot_kw={'xticks': [], 'yticks': []})\n",
- " return fig, axs\n",
- "\n",
- "def plot_tensors(trial, layer, batch_index, steps_range):\n",
- " if len(steps_range) > 0: \n",
- " fig, axs = create_plots(steps_range)\n",
- " vals = get_data(trial, layer, batch_index, steps_range)\n",
- "\n",
- " for ax, image, step in zip(axs.flat if isinstance(axs, np.ndarray) else np.array([axs]), vals, steps_range):\n",
- " ax.imshow(image, cmap='gray')\n",
- " ax.set_title(str(step))\n",
- " plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now that we are prepared with methods to get data and plot it, let's get to it. The goal of the next block is to instantiate a ***Trial***, a central access point for all Tornasole API calls to get tensors. We will do that by inspecting currently running training job and extract necessary params from its debug config to instruct Tornasole where the data we are looking for is located. Couple notes here:\n",
- "- Tensors are being stored in your own S3 bucket to which you can navigate and manually inspect its content if desired.\n",
- "- You might notice a slight delay before trial object is created (last line in the cell). It is normal as Tornasole will monitor corresponding bucket with tensors and wait until tensors appear in it. The delay is introduced by less then instantenous upload of tensors from training container to your S3 bucket. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "from urllib.parse import urlparse\n",
- "import smdebug.trials\n",
- "from smdebug.trials import S3Trial\n",
- "import logging\n",
- "\n",
- "description = client.describe_training_job(TrainingJobName=job_name)\n",
- "s3_output_path = description[\"DebugConfig\"][\"DebugHookConfig\"][\"S3OutputPath\"]\n",
- "parse_result = urlparse(s3_output_path)\n",
- "bucket_name = parse_result.netloc\n",
- "prefix_name = parse_result.path.strip('/')\n",
- "\n",
- "logging.getLogger(\"tornasole\").setLevel(logging.INFO)\n",
- "\n",
- "# this is where we create a Trial object that allows access to saved tensors\n",
- "trial = S3Trial(base_job_name, bucket_name, prefix_name)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# feel free to inspect all tensors logged by uncommenting below line\n",
- "# trial.tensor_names()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Visualize tensors of a running training job\n",
- "Now to the final part of our example. Below we will wait until Tornasole has downloaded initial chunk of tensors for us to look at. Once that first chunk is ready - we will keep getting new chunks every 5 seconds and plot their tensors correspondingly one under another."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Below we select the very first tensor from every batch.\n",
- "# Feel free to modify this and select another tensor from the batch.\n",
- "batch_index = 0\n",
- "\n",
- "# This is a name of a tensor to retrieve data of.\n",
- "# Variable is called `layer` as this tensor happens to be output of first convolutional layer.\n",
- "layer = 'conv0_output0'\n",
- "\n",
- "steps = 0\n",
- "while steps == 0:\n",
- " # trial.steps return all steps that have been downloaded by Tornasole to date.\n",
- " # It doesn't represent all steps that are to be available once training job is complete -\n",
- " # it is a snapshot of a current state of the system. If you call it after training job is done\n",
- " # you will get all tensors available.\n",
- " steps = trial.steps(mode=modes.TRAIN)\n",
- " print_same_line('Waiting for tensors to become available...')\n",
- " time.sleep(3)\n",
- "print('\\nDone')\n",
- "\n",
- "print('Getting tensors and plotting...')\n",
- "rendered_steps = []\n",
- "# trial.training_ended is a way to keep monitoring for a state of a training job as seen by smdebug.\n",
- "# When SageMaker completes training job, trial becomes aware of it.\n",
- "while not trial.training_ended():\n",
- " steps = trial.steps(mode=modes.TRAIN)\n",
- " # quick way to get diff between two lists\n",
- " steps_to_render = list(set(steps).symmetric_difference(set(rendered_steps)))\n",
- " # plot only tensors from newer chunk\n",
- " plot_tensors(trial, layer, batch_index, steps_to_render)\n",
- " rendered_steps.extend(steps_to_render)\n",
- " time.sleep(5)\n",
- "print('\\nDone')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Additional visualizations\n",
- "\n",
- "Now that we completed plotting of tensors during training job run, let's plot some more tensors. This time we will get all of them at once as training job has finished and Tornasole is aware of all tensors emitted by it. Let's visualize tensors representing weights of first convolutional layer (e.g. its kernels). By inspecting each row of plotted tensors from left to right you can notice progression in how each kernel was \"learning\" its values."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Let's visualize weights of the first convolutional layer.\n",
- "layer = 'conv0_weight'\n",
- "\n",
- "for i in range(0, trial.tensor(layer).value(step_num=trial.tensor(layer).steps(mode=modes.TRAIN)[0], mode=modes.TRAIN).shape[0]):\n",
- " plot_tensors(trial, layer, i, trial.tensor(layer).steps(mode=modes.TRAIN))"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "conda_mxnet_p36",
- "language": "python",
- "name": "conda_mxnet_p36"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.5"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/examples/mxnet/sagemaker-notebooks/mxnet.ipynb b/examples/mxnet/sagemaker-notebooks/mxnet.ipynb
deleted file mode 100644
index b4ce8c674..000000000
--- a/examples/mxnet/sagemaker-notebooks/mxnet.ipynb
+++ /dev/null
@@ -1,711 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Debugging SageMaker Training Jobs with Tornasole"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Overview"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. \n",
- "It lets you go beyond just looking at scalars like losses and accuracies during training and gives \n",
- "you full visibility into all tensors 'flowing through the graph' during training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training flow.\n",
- "\n",
- "Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.\n",
- "\n",
- "### Saving tensors\n",
- "\n",
- "Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis\n",
- "\n",
- "### Analysis\n",
- "\n",
- "Analysis of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, \n",
- "a rule is a python code used to detect certain conditions during training. Some of the conditions that a data scientist training a deep learning model may care about are monitoring for gradients getting too large or too small, detecting overfitting, and so on.\n",
- "Tornasole will come pre-packaged with certain rules. Users can write their own rules using the Tornasole APIs.\n",
- "You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook, using Tornasole's full set of APIs. \n",
- "Please refer [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more details about analysis.\n",
- "\n",
- "This example guides you through installation of the required components for emitting tensors in a \n",
- "SageMaker training job and applying a rule over the tensors to monitor the live status of the job."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup\n",
- "\n",
- "As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "download: s3://tornasole-external-preview-use1/sdk/sagemaker-1.35.2.dev0.tar.gz to ./sagemaker-1.35.2.dev0.tar.gz\n",
- "Processing ./sagemaker-1.35.2.dev0.tar.gz\n",
- "Requirement already satisfied: boto3>=1.9.169 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from sagemaker==1.35.2.dev0) (1.9.213)\n",
- "Requirement already satisfied: numpy>=1.9.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from sagemaker==1.35.2.dev0) (1.14.5)\n",
- "Requirement already satisfied: protobuf>=3.1 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from sagemaker==1.35.2.dev0) (3.5.2)\n",
- "Requirement already satisfied: scipy>=0.19.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from sagemaker==1.35.2.dev0) (1.1.0)\n",
- "Requirement already satisfied: urllib3<1.25,>=1.21 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from sagemaker==1.35.2.dev0) (1.23)\n",
- "Requirement already satisfied: protobuf3-to-dict>=0.1.5 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from sagemaker==1.35.2.dev0) (0.1.5)\n",
- "Requirement already satisfied: docker-compose>=1.23.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from sagemaker==1.35.2.dev0) (1.24.1)\n",
- "Requirement already satisfied: requests<2.21,>=2.20.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from sagemaker==1.35.2.dev0) (2.20.0)\n",
- "Requirement already satisfied: enum34>=1.1.6 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from sagemaker==1.35.2.dev0) (1.1.6)\n",
- "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from boto3>=1.9.169->sagemaker==1.35.2.dev0) (0.9.4)\n",
- "Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from boto3>=1.9.169->sagemaker==1.35.2.dev0) (0.2.1)\n",
- "Requirement already satisfied: botocore<1.13.0,>=1.12.213 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from boto3>=1.9.169->sagemaker==1.35.2.dev0) (1.12.213)\n",
- "Requirement already satisfied: six>=1.9 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from protobuf>=3.1->sagemaker==1.35.2.dev0) (1.11.0)\n",
- "Requirement already satisfied: setuptools in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from protobuf>=3.1->sagemaker==1.35.2.dev0) (39.1.0)\n",
- "Requirement already satisfied: docker[ssh]<4.0,>=3.7.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (3.7.3)\n",
- "Requirement already satisfied: backports.ssl-match-hostname>=3.5; python_version < \"3.5\" in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (3.5.0.1)\n",
- "Requirement already satisfied: PyYAML<4.3,>=3.10 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (3.12)\n",
- "Requirement already satisfied: texttable<0.10,>=0.9.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (0.9.1)\n",
- "Requirement already satisfied: dockerpty<0.5,>=0.4.1 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (0.4.1)\n",
- "Requirement already satisfied: ipaddress>=1.0.16; python_version < \"3.3\" in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (1.0.22)\n",
- "Requirement already satisfied: websocket-client<1.0,>=0.32.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (0.56.0)\n",
- "Requirement already satisfied: docopt<0.7,>=0.6.1 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (0.6.2)\n",
- "Requirement already satisfied: jsonschema<3,>=2.5.1 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (2.6.0)\n",
- "Requirement already satisfied: cached-property<2,>=1.2.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (1.5.1)\n",
- "Requirement already satisfied: idna<2.8,>=2.5 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from requests<2.21,>=2.20.0->sagemaker==1.35.2.dev0) (2.6)\n",
- "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from requests<2.21,>=2.20.0->sagemaker==1.35.2.dev0) (3.0.4)\n",
- "Requirement already satisfied: certifi>=2017.4.17 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from requests<2.21,>=2.20.0->sagemaker==1.35.2.dev0) (2019.6.16)\n",
- "Requirement already satisfied: futures<4.0.0,>=2.2.0; python_version == \"2.6\" or python_version == \"2.7\" in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from s3transfer<0.3.0,>=0.2.0->boto3>=1.9.169->sagemaker==1.35.2.dev0) (3.2.0)\n",
- "Requirement already satisfied: docutils<0.16,>=0.10 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from botocore<1.13.0,>=1.12.213->boto3>=1.9.169->sagemaker==1.35.2.dev0) (0.14)\n",
- "Requirement already satisfied: python-dateutil<3.0.0,>=2.1; python_version >= \"2.7\" in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from botocore<1.13.0,>=1.12.213->boto3>=1.9.169->sagemaker==1.35.2.dev0) (2.7.3)\n",
- "Requirement already satisfied: docker-pycreds>=0.4.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker[ssh]<4.0,>=3.7.0->docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (0.4.0)\n",
- "Requirement already satisfied: paramiko>=2.4.2; extra == \"ssh\" in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from docker[ssh]<4.0,>=3.7.0->docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (2.6.0)\n",
- "Requirement already satisfied: functools32 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from jsonschema<3,>=2.5.1->docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (3.2.3.post2)\n",
- "Requirement already satisfied: pynacl>=1.0.1 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from paramiko>=2.4.2; extra == \"ssh\"->docker[ssh]<4.0,>=3.7.0->docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (1.3.0)\n",
- "Requirement already satisfied: cryptography>=2.5 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from paramiko>=2.4.2; extra == \"ssh\"->docker[ssh]<4.0,>=3.7.0->docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (2.5)\n",
- "Requirement already satisfied: bcrypt>=3.1.3 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from paramiko>=2.4.2; extra == \"ssh\"->docker[ssh]<4.0,>=3.7.0->docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (3.1.7)\n",
- "Requirement already satisfied: cffi>=1.4.1 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from pynacl>=1.0.1->paramiko>=2.4.2; extra == \"ssh\"->docker[ssh]<4.0,>=3.7.0->docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (1.11.5)\n",
- "Requirement already satisfied: asn1crypto>=0.21.0 in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from cryptography>=2.5->paramiko>=2.4.2; extra == \"ssh\"->docker[ssh]<4.0,>=3.7.0->docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (0.24.0)\n",
- "Requirement already satisfied: pycparser in /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages (from cffi>=1.4.1->pynacl>=1.0.1->paramiko>=2.4.2; extra == \"ssh\"->docker[ssh]<4.0,>=3.7.0->docker-compose>=1.23.0->sagemaker==1.35.2.dev0) (2.18)\n",
- "Building wheels for collected packages: sagemaker\n",
- " Running setup.py bdist_wheel for sagemaker ... \u001b[?25ldone\n",
- "\u001b[?25h Stored in directory: /home/ec2-user/.cache/pip/wheels/93/13/04/5fee9d22051b05a9789db3ce41ac6f7c50de67b47419007761\n",
- "Successfully built sagemaker\n",
- "\u001b[31mtyping-extensions 3.7.4 has requirement typing>=3.7.4, but you'll have typing 3.6.4 which is incompatible.\u001b[0m\n",
- "Installing collected packages: sagemaker\n",
- " Found existing installation: sagemaker 1.35.2.dev0\n",
- " Uninstalling sagemaker-1.35.2.dev0:\n",
- " Successfully uninstalled sagemaker-1.35.2.dev0\n",
- "Successfully installed sagemaker-1.35.2.dev0\n",
- "\u001b[33mYou are using pip version 10.0.1, however version 19.2.3 is available.\n",
- "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n",
- "download: s3://tornasole-external-preview-use1/sdk/sagemaker-smdebug.json to ./sagemaker-smdebug.json\n",
- "\n",
- "No JSON object could be decoded\n"
- ]
- }
- ],
- "source": [
- "!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-1.35.2.dev0.tar.gz .\n",
- "!pip install sagemaker-1.35.2.dev0.tar.gz\n",
- "!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-smdebug.json .\n",
- "!aws configure add-model --service-model sagemaker-smdebug.json --service-name sagemaker"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Training MXNet models in SageMaker with Tornasole\n",
- "\n",
- "We'll be training a mxnet gluon model for FashonMNIST dataset in this notebook with Tornasole enabled and monitor the training jobs with Tornasole's Rules. This will be done using SageMaker MXNet 1.4.1 Container with Script Mode. Note that Tornasole currently only works with python3, so be sure to set `py_version='py3'` when creating SageMaker Estimator.\n",
- "\n",
- "\n",
- "Let us first train a simple example training script mnist_gluon_vg_demo.py with Tornasole enabled in SageMaker using the SageMaker Estimator API, along with a VanishingGradient Rule to monitor the training job in realtime. A Tornasole Rule is essentially python code which analyses tensors saved by tornasole and validates some condition. VanishingGradient rule is a first party (1P) rule provided by smdebug. During training, Tornasole will capture tensors as specified in its configuration and VanishingGradient Rule job will monitor whether any gradient tensor has reached 0. The rule will emit a cloudwatch event if it finds an vanishing gradient tensor during training.\n",
- "\n",
- "## Enable Tornasole in the training script\n",
- "\n",
- "Integrating Tornasole into the training job can be accomplished by following steps below.\n",
- "\n",
- "### Import the hook package\n",
- "Import the SessionHook class along with other helper classes in your training script as shown below\n",
- "\n",
- "```\n",
- "from smdebug.mxnet.hook import SessionHook\n",
- "from smdebug.mxnet import SaveConfig, Collection\n",
- "```\n",
- "\n",
- "### Instantiate and initialize hook\n",
- "\n",
- "```\n",
- " # Create SaveConfig that instructs engine to log graph tensors every 10 steps.\n",
- " save_config = SaveConfig(save_interval=10)\n",
- " # Create a hook that logs tensors of weights, biases and gradients while training the model.\n",
- " hook = SessionHook(save_config=save_config)\n",
- "```\n",
- "\n",
- "### Register Tornasole hook to the model before starting of the training.\n",
- "\n",
- "### NOTE: The hook can only be registered to Gluon Non-hybrid models.\n",
- "\n",
- "After creating or loading the desired model, users can register the hook with the model as shown below.\n",
- "\n",
- "```\n",
- "net = create_gluon_model()\n",
- " # Apply hook to the model (e.g. instruct engine to recognize hook configuration\n",
- " # and enable mode in which engine will log graph tensors\n",
- "hook.register_hook(net)\n",
- "```\n",
- "\n",
- "#### Set the mode\n",
- "Tornasole has the concept of modes (TRAIN, EVAL, PREDICT) to separate out different modes of the jobs.\n",
- "Set the mode you are running in your job. Every time the mode changes in your job, please set the current mode. This helps you group steps by mode, for easier analysis. Setting the mode is optional but recommended. If you do not specify this, Tornasole saves all steps under a `GLOBAL` mode. \n",
- "```\n",
- "hook.set_mode(smd.modes.TRAIN)\n",
- "```\n",
- "\n",
- "Refer [DeveloperGuide_MXNet.md](../../DeveloperGuide_MXNet.md) for more details on the APIs Tornasole provides to help you save tensors.\n",
- "\n",
- "\n",
- "### Docker Images with Tornasole\n",
- "\n",
- "We have built SageMaker MXNet containers with smdebug. You can use them from ECR from SageMaker. Here are the links to the images. Please use the image from the appropriate region in which you want your jobs to run.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "import sagemaker\n",
- "import boto3\n",
- "from sagemaker.mxnet import MXNet\n",
- "\n",
- "# Below changes the region to be one where this notebook is running\n",
- "REGION = boto3.Session().region_name\n",
- "TAG='latest'\n",
- "\n",
- "cpu_docker_image_name= '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-mxnet-1.4.1-cpu:{}'.format(REGION, TAG)\n",
- "gpu_docker_image_name= '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-mxnet-1.4.1-gpu:{}'.format(REGION, TAG)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Configuring the inputs for the training job\n",
- "\n",
- "Now we'll call the Sagemaker MXNet Estimator to kick off a training job along with the VanishingGradient 1P rule to monitor the job.\n",
- "\n",
- "The 'entry_point_script' points to the MXNet training script that has the SessionHook integrated.\n",
- "\n",
- "The 'hyperparameters' are the parameters that will be passed to the training script.\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "entry_point_script = '../scripts/mnist_gluon_vg_demo.py'\n",
- "bad_hyperparameters = {'random_seed' : True, 'num_steps': 33, 'save_frequency' : 30}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_simple_estimator = MXNet(role=sagemaker.get_execution_role(),\n",
- " base_job_name='mxnet-tornasole-simple-demo',\n",
- " train_instance_count=1,\n",
- " train_instance_type='ml.m4.xlarge',\n",
- " image_name=cpu_docker_image_name,\n",
- " entry_point=entry_point_script,\n",
- " hyperparameters=bad_hyperparameters,\n",
- " framework_version='1.4.1',\n",
- " debug=True,\n",
- " py_version='py3',\n",
- " # These are Tornasole specific parameters, \n",
- " # debug= True means rule specified in rules_specification \n",
- " # will run as rule job. \n",
- " # Below, we specify to run the first party rule VanishingGradient\n",
- " # on a ml.c5.4xlarge instance\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"VanishingGradient\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"end-step\": \"33\"\n",
- " }\n",
- " }\n",
- " ])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_simple_estimator.fit(wait=False)\n",
- "# This is a fire and forget event. By setting wait=False, we just submit the job to run in the background.\n",
- "# In the background SageMaker will spin off 1 training job and 1 rule job for you.\n",
- "# Please follow this notebook to see status of the training job and the rule job"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "### Result\n",
- "\n",
- "As a result of the above command, SageMaker will spin off 1 training job and 1 rule job for you - the first one being the job which produces the tensors to be analyzed and the second one, which analyzes the tensors to check if there are any tensors that demonstrate the vanishing gradient issue during training. You will see that the VanishingGradient rule is triggered. Please note that the rule triggering is represented as custom Tornasole Python exception: whenever rule condition is evaluated to True - custom Python exception is thrown to signify that.\n",
- "\n",
- "### Describing the training job\n",
- "\n",
- "We can check the status of the training job by running the following command:\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Below command will give the status of training job\n",
- "# Note: In the output of below command you will see DebugConfig parameter \n",
- "\n",
- "job_name = sagemaker_simple_estimator.latest_training_job.name\n",
- "\n",
- "client = sagemaker_simple_estimator.sagemaker_session.sagemaker_client\n",
- "\n",
- "description = client.describe_training_job(TrainingJobName=job_name)\n",
- "\n",
- "# uncomment next line to see full details of training job \n",
- "# description\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "u'InProgress'"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# The status of the training job can be seen below\n",
- "description['TrainingJobStatus']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Once your training job is started SageMaker will spin up a rule execution job to run the ExplodingTensor rule.\n",
- "\n",
- "### Tornasole specific parameters in the description\n",
- "\n",
- "DebugConfig parameter has details about Tornasole related configuration. The key parameters to look for below are\n",
- "\n",
- "*S3OutputPath* : This is the path where output tensors from tornasole is getting saved.\n",
- "\n",
- "*RuleConfig* : This parameter tells about the rule config parameter that was passed when creating the trainning job. In this you should be able to see details of the rule that ran for training.\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{u'DebugHookConfig': {u'DebugHookSpecificationList': [],\n",
- " u'LocalPath': u'/opt/ml/output/tensors',\n",
- " u'S3OutputPath': u's3://sagemaker-ca-central-1-072677473360/tensors-mxnet-tornasole-simple-demo-2019-08-30-04-24-08-626'},\n",
- " u'RuleConfig': {u'RuleSpecificationList': [{u'InstanceType': u'ml.c5.4xlarge',\n",
- " u'RuleEvaluatorImage': u'453379255795.dkr.ecr.ca-central-1.amazonaws.com/script-rule-executor:latest',\n",
- " u'RuleName': u'VanishingGradient',\n",
- " u'RuntimeConfigurations': {u'end-step': u'33'},\n",
- " u'VolumeSizeInGB': 10}]}}"
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "description['DebugConfig']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "### Check the status of the Rule Execution Job\n",
- "\n",
- "To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. These details are also available as part of the response description above under: description['RuleMonitoringStatuses']\n",
- "\n",
- "The logs of the training job are available in the `Cloudwatch Logstream` /aws/sagemaker/TrainingJobs with `RuleExecutionJobArn`.\n",
- "\n",
- "You will see that once the rule execution job starts, that it identifies the vanishing gradient situation in the training job, raises the `RuleEvaluationConditionMet` exception and ends the job.\n",
- "\n",
- "**Note: The next cell blocks till the rule execution job ends. Once it says RuleStatus is Started, and shows the RuleExecutionJobArn, you can look at the status of the rule being monitored. At that point, we can also look at the logs as shown in the next cell**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Wait to get status for Rule Execution Jobs...\n",
- "=============================================\n",
- "RuleName: VanishingGradient\n",
- "RuleStatus: NotStarted\n",
- "=============================================\n",
- "Wait to get status for Rule Execution Jobs...\n",
- "=============================================\n",
- "RuleName: VanishingGradient\n",
- "RuleStatus: NotStarted\n",
- "=============================================\n"
- ]
- },
- {
- "ename": "KeyboardInterrupt",
- "evalue": "",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m\u001b[0m",
- "\u001b[0;31mKeyboardInterrupt\u001b[0mTraceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mstatuses\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msagemaker_simple_estimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdescribe_rule_execution_jobs\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
- "\u001b[0;32m/home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc\u001b[0m in \u001b[0;36mdescribe_rule_execution_jobs\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 524\u001b[0m \u001b[0mruleMonitoringStatuses\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mjob_details\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"RuleMonitoringStatuses\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 525\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mall_rules_completed\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 526\u001b[0;31m \u001b[0mtime\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msleep\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m60\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 527\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 528\u001b[0m \u001b[0;32mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"RuleMonitoringStatuses not found in this training job\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
- ]
- }
- ],
- "source": [
- "statuses = sagemaker_simple_estimator.describe_rule_execution_jobs()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Check the logs of the Rule Execution Job\n",
- "\n",
- "If you want to access the logs of a particular rule job name, you can do the following.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "rule_job_name = statuses[0].get('RuleExecutionJobName')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we can attach to this job to see its logs"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sagemaker.estimator import Estimator\n",
- "vanishing_gradient_tensor = Estimator.attach(rule_job_name)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Receive a CloudWatch Event for Rules\n",
- "When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus [CloudWatch events](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) are emitted. More details on this, see [below](#CloudWatch-Event-Integration-for-Rules). "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Making this a good run\n",
- "\n",
- "In above example, we saw how a VanishingGradient Rule was run which analyzed the tensors when training was running and produced an alert in form of cloudwatch event.\n",
- "\n",
- "You can create the estimator with following *entry_point_script* and *bad_hyperparameters*. Start a new training job. You will see that VanishingGradient rule is not fired in that case as no tensors demonstrate vanishing gradient issue.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "entry_point_script = '../scripts/mnist_gluon_basic_hook_demo.py'\n",
- "good_hyperparameters = {'random_seed' : True, 'num_steps': 6}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "simple_estimator = MXNet(role=sagemaker.get_execution_role(),\n",
- " base_job_name='mxnet-trsl-test-nb',\n",
- " train_instance_count=1,\n",
- " train_instance_type='ml.m4.xlarge',\n",
- " image_name=cpu_docker_image_name,\n",
- " entry_point=entry_point_script,\n",
- " hyperparameters=good_hyperparameters,\n",
- " framework_version='1.4.1',\n",
- " debug=True,\n",
- " py_version='py3',\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"VanishingGradient\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"start-step\" : \"1\",\n",
- " \"end-step\": \"5\"\n",
- " }\n",
- " }\n",
- " ])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "simple_estimator.fit(wait=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "simple_estimator.describe_rule_execution_jobs()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Enabling Tornasole with SageMaker\n",
- "#### Storage\n",
- "The tensors saved by Tornasole are, by default, stored in the S3 output path of the training job, under the folder **`/tensors-`**. This is done to ensure that we don't end up accidentally overwriting the tensors from a training job with the others. Rules evaluation require separation of the tensors paths to be evaluated correctly.\n",
- "\n",
- "If you don't provide an S3 output path to the estimator, SageMaker creates one for you as: **`s3://sagemaker--/`**\n",
- "\n",
- "This path is used to create a Tornasole Trial taken by Rules (see below).\n",
- "\n",
- "#### New Parameters \n",
- "The new parameters in Sagemaker Estimator to look out for are\n",
- "\n",
- "- `debug` :(bool)\n",
- "This indicates that debugging should be enabled for the training job. \n",
- "Setting this as `True` would make Tornasole available for use with the job\n",
- "\n",
- "- `rules_specification`: (list[*dict*])\n",
- "You can specify any number of rules to monitor your SageMaker training job. This parameter takes a list of python dictionaries, one for each rule you want to enable. Each `dict` is of the following form:\n",
- "```\n",
- "{\n",
- " \"RuleName\": \n",
- " # The name of the class implementing the Tornasole Rule interface. (required)\n",
- "\n",
- " \"SourceS3Uri\": \n",
- " # S3 URI of the rule script containing the class in 'RuleName'. \n",
- " # This is not required if you want to use one of the\n",
- " # First Party rules provided to you by Amazon. \n",
- " # In such a case you can leave it empty or not pass it. If you want to run a custom rule \n",
- " # defined by you, you will need to define the custom rule class in a python \n",
- " # file and provide it to SageMaker as a S3 URI. \n",
- " # SageMaker will fetch this file and try to look for the rule class \n",
- " # identified by RuleName in this file.\n",
- " \n",
- " \"InstanceType\": \n",
- " # The ML instance type which should be used to run the rule evaluation job\n",
- " \n",
- " \"VolumeSizeInGB\": \n",
- " # The volume size to store the runtime artifacts from the rule evaluation \n",
- " \n",
- " \"RuntimeConfigurations\": {\n",
- " # Map defining the parameters required to instantiate the Rule class and\n",
- " # parameters regarding invokation of the rule (start-step and end-step)\n",
- " # This can be any parameter taken by the rule. \n",
- " # Every value here needs to be a string. \n",
- " # So when you write custom rules, ensure that you can parse each argument from a string.\n",
- " #\n",
- " # PARAMS CAN BE\n",
- " #\n",
- " # STANDARD PARAMS FOR RULE EXECUTION\n",
- " # \"start-step\": \n",
- " # \"end-step\": \n",
- " # \"other-trials-paths\": (';' separated list of s3 paths as a string)\n",
- " # \"logging-level\": (can be one of \"CRITICAL\", \"FATAL\", \"ERROR\", \n",
- " # \"WARNING\", \"WARN\", \"DEBUG\", \"NOTSET\")\n",
- " #\n",
- " # ANY OTHER PARAMETER TAKEN BY THE RULE\n",
- " # \"parameter\" : \n",
- " # : \n",
- " }\n",
- "}\n",
- "```\n",
- "\n",
- "### Inputs\n",
- "Just a quick reminder if you are not familiar with script mode in SageMaker. You can pass command line arguments taken by your training script with a hyperparameter dictionary which gets passed to the SageMaker Estimator class. You can see this in the examples below.\n",
- "\n",
- "\n",
- "### Rules\n",
- "Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the job.\n",
- "They can be used to assert certain conditions during training, and raise Cloudwatch Events based on them that you can\n",
- "use to process in any way you like. \n",
- "\n",
- "Tornasole comes with a set of **First Party rules** (1P rules).\n",
- "You can also write your own rules looking at these 1P rules for inspiration. \n",
- "Refer [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more on the APIs you can use to write your own rules as well as descriptions for the 1P rules that we provide. \n",
- " \n",
- "Here we will talk about how to use Sagemaker to evalute these rules on the training jobs.\n",
- "\n",
- "\n",
- "##### 1P Rule \n",
- "If you want to use a 1P rule. Specify the RuleName field with the 1P RuleName, and the rule will be automatically applied. You can pass any parameters accepted by the rule as part of the RuntimeConfigurations dictionary. Rules constructor take trial as parameter. \n",
- "A Trial in Tornasole's context refers to a training job. It is identified by the path where the saved tensors for the job are stored. \n",
- "A rule takes a `base_trial` which refers to the job whose run invokes the rule execution. \n",
- "\n",
- "**Note:** A rule can be written to compare & analyze tensors across training jobs. A rule which needs to compare tensors across trials can be run by passing the argument `other_trials`. The argument `base_trial` will automatically be set by SageMaker when executing the rule. The parameter `other_trials` (if taken by the rule) can be passed by passing `other-trials-paths` in the RuntimeConfigurations dictionary. The value for this argument should be `;` separated list of S3 output paths where the tensors for those trials are stored.\n",
- "\n",
- "Here's a example of a complex configuration for the SimilarAcrossRuns (which accepts one other trial and a regex pattern) where we ask for the rule to be invoked for the steps between 10 and 100.\n",
- "\n",
- "``` \n",
- "rules_specification = [ \n",
- " {\n",
- " \"RuleName\": \"SimilarAcrossRuns\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"other_trials\": \"s3://sagemaker--/past-job\",\n",
- " \"include_regex\": \".*\",\n",
- " \"start-step\": \"10\",\n",
- " \"end-step\": \"100\"\n",
- " }\n",
- " }\n",
- "]\n",
- "```\n",
- "List of 1P rules and details about the rules can be found in *First party rules* section in [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) \n",
- "\n",
- "\n",
- "##### Custom rule\n",
- "In this case you need to define a custom rule class which inherits from `smdebug.rules.Rule` class.\n",
- "You need to provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`. Again, you can pass any arguments taken by this rule through the RuntimeConfigurations dictionary. Note that the custom rules can only have arguments which expect a string as the value except the two arguments specifying trials to the Rule. Refer section *Writing a rule* in [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more details.\n",
- "\n",
- "Here's an example:\n",
- "```\n",
- "rules_specification = [\n",
- " {\n",
- " \"RuleName\": \"CustomRule\",\n",
- " \"SourceS3Uri\": \"s3://weiyou-tornasole-test/rule-script/custom_rule.py\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"threshold\" : \"0.5\"\n",
- " }\n",
- " }\n",
- "]\n",
- "```\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### CloudWatch Event Integration for Rules\n",
- "When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus [CloudWatch events](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) are emitted.\n",
- "\n",
- "After GA, you can configure a CloudWatch event rule to receive and process these events by setting up a target (Lambda function, SNS) as follows:\n",
- "\n",
- "- The SageMaker TrainingJobStatus CW event (https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#sagemaker_event_types) will include rule job statuses associated with the training job\n",
- "- A CW event will be emitted when a RuleStatus changes\n",
- "- Customer can create a CloudWatch event rule that monitors the Training Job customer started\n",
- "- Customer can set a Target (Lambda funtion, SQS) for the CloudWatch event rule that processes the event, and triggers an alarm for the customer based on the RuleStatus. \n",
- "\n",
- "Refer [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) for more details. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "conda_amazonei_mxnet_p27",
- "language": "python",
- "name": "conda_amazonei_mxnet_p27"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 2
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython2",
- "version": "2.7.15"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/examples/pytorch/README.md b/examples/pytorch/README.md
new file mode 100644
index 000000000..9818d371a
--- /dev/null
+++ b/examples/pytorch/README.md
@@ -0,0 +1,2 @@
+## Example Notebooks
+Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger)
diff --git a/examples/pytorch/notebooks/PyTorch-SimpleInteractiveAnalysis.ipynb b/examples/pytorch/notebooks/PyTorch-SimpleInteractiveAnalysis.ipynb
deleted file mode 100644
index 520295125..000000000
--- a/examples/pytorch/notebooks/PyTorch-SimpleInteractiveAnalysis.ipynb
+++ /dev/null
@@ -1,1559 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Simple Interactive Analysis in Tornasole\n",
- "This notebook will demonstrate the simplest kind of interactive analysis that can be run in smdebug. It will focus on the [vanishing/exploding gradient](https://medium.com/learn-love-ai/the-curious-case-of-the-vanishing-exploding-gradient-bf58ec6822eb) problems on a simple MNIST digit recognition."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "First of all, we will import some basic libraries for deep learning and plotting."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "%load_ext autoreload\n",
- "%autoreload 2"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import torch\n",
- "import torch.utils.data\n",
- "from torch import nn\n",
- "import matplotlib.pyplot as plt\n",
- "import torch.nn.functional as F\n",
- "import torch.optim as optim\n",
- "from torchvision import datasets, transforms\n",
- "from torch.autograd import Variable"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's copy the Tornasole libraries to this instance, this step has to be executed only once. Please make sure that the AWS account you are using can access the tornasole-wheels-alpha bucket.\n",
- "\n",
- "To do so you'll need the appropriate AWS credentials. There are several ways of doing this:\n",
- "\n",
- "inject temporary credentials\n",
- "if running on EC2, use EC2 roles that can access all S3 buckets\n",
- "(preferred) run this notebook on a SageMaker notebook instance\n",
- "The code below downloads the necessary .whl files and installs them in the current environment. Only run the first time!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "#WARNING - uncomment this code only if you haven't done this before\n",
- "#!aws s3 sync s3://tornasole-external-preview-use1/sdk/ts-binaries/tornasole_pytorch/py3/latest/ tornasole_pytorch/\n",
- "#!pip install tornasole_pytorch/*\n",
- "\n",
- "# If you run into a version conflict with boto, run the following\n",
- "# !pip uninstall -y botocore boto3 aioboto3 aiobotocore && pip install botocore==1.12.91 boto3==1.9.91 aiobotocore==0.10.2 aioboto3==6.4.1"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's import the Tornasole libraries, all we need is a `SessionHook` to use as a callback, as well as some ancillary data structures."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [],
- "source": [
- "from smdebug.pytorch.hook import *\n",
- "from smdebug.core.save_config import SaveConfig\n",
- "\n",
- "import logging\n",
- "logging.getLogger(\"tornasole\").setLevel(logging.ERROR)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can define a simple network - it doesn't really matter what it is.\n",
- "Importantly - we **add the Tornasole Hook**. This hook will be run at every batch and will save selected tensors (in this case, all of them) to the desired directory (in this case, `'./ts_output/{run_id}'`.\n",
- "\n",
- "See the documentation for more details."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "class Net(nn.Module):\n",
- " def __init__(self):\n",
- " super(Net, self).__init__()\n",
- "\n",
- " # self.conv1 = nn.Conv2d(1, 20, 5, 1)\n",
- " self.add_module('conv1', nn.Conv2d(1, 20, 5, 1))\n",
- " self.add_module('conv2', nn.Conv2d(20, 50, 5, 1))\n",
- " self.add_module('fc1', nn.Linear(4*4*50, 500))\n",
- " self.add_module('fc2', nn.Linear(500, 10))\n",
- "\n",
- " def forward(self, x):\n",
- " x = F.relu(self.conv1(x))\n",
- " x = F.max_pool2d(x, 2, 2)\n",
- " x = F.relu(self.conv2(x))\n",
- " x = F.max_pool2d(x, 2, 2)\n",
- " x = x.view(-1, 4*4*50)\n",
- " x = F.relu(self.fc1(x))\n",
- " x = self.fc2(x)\n",
- " return F.log_softmax(x, dim=1)\n",
- "\n",
- "def create_net(tornasole_save_interval, base_loc, run_id):\n",
- " model = Net()\n",
- " # Create and add the hook. Arguments:\n",
- " # - save data in './{base_loc}/{run_id} - Note: s3 is also supported\n",
- " # - save every 100 batches\n",
- " # - save every tensor: inputs/outputs to each layer, as well as gradients\n",
- " hook = SessionHook(out_dir=base_loc + \"/\" + run_id, save_config=SaveConfig(save_interval=100), save_all=True)\n",
- " hook.register_hook(model)\n",
- " return model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And we create a simple training script. No Tornasole-specific code here, this is a slightly modified version of the [digit recognition](https://github.com/pytorch/examples/blob/master/mnist/main.py) example on the PyTorch github."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "def transformer(data, label):\n",
- " data = data.reshape((-1,)).astype(np.float32)/255\n",
- " return data, label\n",
- "\n",
- "def test(model, device, test_loader):\n",
- " model.eval()\n",
- " test_loss = 0\n",
- " correct = 0\n",
- " with torch.no_grad():\n",
- " for data, target in test_loader:\n",
- " data, target = data.to(device), target.to(device)\n",
- " output = model(data)\n",
- " test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss\n",
- " pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability\n",
- " correct += pred.eq(target.view_as(pred)).sum().item()\n",
- "\n",
- " test_loss /= len(test_loader.dataset)\n",
- "\n",
- " print('\\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\\n'.format(\n",
- " test_loss, correct, len(test_loader.dataset),\n",
- " 100. * correct / len(test_loader.dataset)))\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "def train(model, epochs, learning_rate, momentum, batch_size, device):\n",
- " train_loader = torch.utils.data.DataLoader(\n",
- " datasets.MNIST('./data', train=True, download=True,\n",
- " transform=transforms.Compose([\n",
- " transforms.ToTensor(),\n",
- " transforms.Normalize((0.1307,), (0.3081,))\n",
- " ])),\n",
- " batch_size=batch_size, shuffle=True)\n",
- "\n",
- " val_data = torch.utils.data.DataLoader(\n",
- " datasets.MNIST('./data', train=False, download=True,\n",
- " transform=transforms.Compose([\n",
- " transforms.ToTensor(),\n",
- " transforms.Normalize((0.1307,), (0.3081,))\n",
- " ])),\n",
- " batch_size=batch_size, shuffle=False)\n",
- " \n",
- " # Collect all parameters from net and its children, then initialize them.\n",
- " optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)\n",
- " model = model.to(device)\n",
- " total_step = len(train_loader)\n",
- " for epoch in range(epochs):\n",
- " model.train()\n",
- " count = 0\n",
- " for batch_idx, (data, target) in enumerate(train_loader):\n",
- " data, target = data.to(device), target.to(device)\n",
- " optimizer.zero_grad()\n",
- " output = model(Variable(data, requires_grad = True))\n",
- " loss = F.nll_loss(output, target)\n",
- " loss.backward()\n",
- " count += 1\n",
- "\n",
- " optimizer.step()\n",
- " if batch_idx % 10 == 0:\n",
- " print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n",
- " epoch, batch_idx * len(data), len(train_loader.dataset),\n",
- " 100. * batch_idx / len(train_loader), loss.item()))\n",
- " \n",
- "# torch.save(model.state_dict(),\"mnist_params.pt\")\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Clear up from previous runs, we remove old data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [],
- "source": [
- "!rm -rf ./ts_output/"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "At this point we are ready to train. For the purposes of this example, we will name this run as `'good'` because we know it will converge to a good solution. \n",
- "\n",
- "If you have a GPU on your machine, you can change the device line appropriately -- e.g for an NVIDIA GPU, it would be `device = torch.device(\"cuda\")`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Train Epoch: 0 [0/60000 (0%)]\tLoss: 2.295261\n",
- "Train Epoch: 0 [640/60000 (1%)]\tLoss: 1.748892\n",
- "Train Epoch: 0 [1280/60000 (2%)]\tLoss: 0.695785\n",
- "Train Epoch: 0 [1920/60000 (3%)]\tLoss: 0.629404\n",
- "Train Epoch: 0 [2560/60000 (4%)]\tLoss: 0.355879\n",
- "Train Epoch: 0 [3200/60000 (5%)]\tLoss: 0.461338\n",
- "Train Epoch: 0 [3840/60000 (6%)]\tLoss: 0.465871\n",
- "Train Epoch: 0 [4480/60000 (7%)]\tLoss: 0.286747\n",
- "Train Epoch: 0 [5120/60000 (9%)]\tLoss: 0.262511\n",
- "Train Epoch: 0 [5760/60000 (10%)]\tLoss: 0.276488\n",
- "Train Epoch: 0 [6400/60000 (11%)]\tLoss: 0.234418\n",
- "Train Epoch: 0 [7040/60000 (12%)]\tLoss: 0.523879\n",
- "Train Epoch: 0 [7680/60000 (13%)]\tLoss: 0.582521\n",
- "Train Epoch: 0 [8320/60000 (14%)]\tLoss: 0.257838\n",
- "Train Epoch: 0 [8960/60000 (15%)]\tLoss: 0.162327\n",
- "Train Epoch: 0 [9600/60000 (16%)]\tLoss: 0.202325\n",
- "Train Epoch: 0 [10240/60000 (17%)]\tLoss: 0.204970\n",
- "Train Epoch: 0 [10880/60000 (18%)]\tLoss: 0.517194\n",
- "Train Epoch: 0 [11520/60000 (19%)]\tLoss: 0.138784\n",
- "Train Epoch: 0 [12160/60000 (20%)]\tLoss: 0.174099\n",
- "Train Epoch: 0 [12800/60000 (21%)]\tLoss: 0.358449\n",
- "Train Epoch: 0 [13440/60000 (22%)]\tLoss: 0.234432\n",
- "Train Epoch: 0 [14080/60000 (23%)]\tLoss: 0.239579\n",
- "Train Epoch: 0 [14720/60000 (25%)]\tLoss: 0.143086\n",
- "Train Epoch: 0 [15360/60000 (26%)]\tLoss: 0.464434\n",
- "Train Epoch: 0 [16000/60000 (27%)]\tLoss: 0.213992\n",
- "Train Epoch: 0 [16640/60000 (28%)]\tLoss: 0.198865\n",
- "Train Epoch: 0 [17280/60000 (29%)]\tLoss: 0.426876\n",
- "Train Epoch: 0 [17920/60000 (30%)]\tLoss: 0.236948\n",
- "Train Epoch: 0 [18560/60000 (31%)]\tLoss: 0.019676\n",
- "Train Epoch: 0 [19200/60000 (32%)]\tLoss: 0.365207\n",
- "Train Epoch: 0 [19840/60000 (33%)]\tLoss: 0.177355\n",
- "Train Epoch: 0 [20480/60000 (34%)]\tLoss: 0.101089\n",
- "Train Epoch: 0 [21120/60000 (35%)]\tLoss: 0.217447\n",
- "Train Epoch: 0 [21760/60000 (36%)]\tLoss: 0.122373\n",
- "Train Epoch: 0 [22400/60000 (37%)]\tLoss: 0.083280\n",
- "Train Epoch: 0 [23040/60000 (38%)]\tLoss: 0.044131\n",
- "Train Epoch: 0 [23680/60000 (39%)]\tLoss: 0.164748\n",
- "Train Epoch: 0 [24320/60000 (41%)]\tLoss: 0.217344\n",
- "Train Epoch: 0 [24960/60000 (42%)]\tLoss: 0.168444\n",
- "Train Epoch: 0 [25600/60000 (43%)]\tLoss: 0.094563\n",
- "Train Epoch: 0 [26240/60000 (44%)]\tLoss: 0.160949\n",
- "Train Epoch: 0 [26880/60000 (45%)]\tLoss: 0.190556\n",
- "Train Epoch: 0 [27520/60000 (46%)]\tLoss: 0.049234\n",
- "Train Epoch: 0 [28160/60000 (47%)]\tLoss: 0.106162\n",
- "Train Epoch: 0 [28800/60000 (48%)]\tLoss: 0.104469\n",
- "Train Epoch: 0 [29440/60000 (49%)]\tLoss: 0.053504\n",
- "Train Epoch: 0 [30080/60000 (50%)]\tLoss: 0.110556\n",
- "Train Epoch: 0 [30720/60000 (51%)]\tLoss: 0.322309\n",
- "Train Epoch: 0 [31360/60000 (52%)]\tLoss: 0.086979\n",
- "Train Epoch: 0 [32000/60000 (53%)]\tLoss: 0.134221\n",
- "Train Epoch: 0 [32640/60000 (54%)]\tLoss: 0.201668\n",
- "Train Epoch: 0 [33280/60000 (55%)]\tLoss: 0.062239\n",
- "Train Epoch: 0 [33920/60000 (57%)]\tLoss: 0.107864\n",
- "Train Epoch: 0 [34560/60000 (58%)]\tLoss: 0.082699\n",
- "Train Epoch: 0 [35200/60000 (59%)]\tLoss: 0.187130\n",
- "Train Epoch: 0 [35840/60000 (60%)]\tLoss: 0.436204\n",
- "Train Epoch: 0 [36480/60000 (61%)]\tLoss: 0.216804\n",
- "Train Epoch: 0 [37120/60000 (62%)]\tLoss: 0.295533\n",
- "Train Epoch: 0 [37760/60000 (63%)]\tLoss: 0.151546\n",
- "Train Epoch: 0 [38400/60000 (64%)]\tLoss: 0.109190\n",
- "Train Epoch: 0 [39040/60000 (65%)]\tLoss: 0.089795\n",
- "Train Epoch: 0 [39680/60000 (66%)]\tLoss: 0.111167\n",
- "Train Epoch: 0 [40320/60000 (67%)]\tLoss: 0.148806\n",
- "Train Epoch: 0 [40960/60000 (68%)]\tLoss: 0.086549\n",
- "Train Epoch: 0 [41600/60000 (69%)]\tLoss: 0.166614\n",
- "Train Epoch: 0 [42240/60000 (70%)]\tLoss: 0.076532\n",
- "Train Epoch: 0 [42880/60000 (71%)]\tLoss: 0.207414\n",
- "Train Epoch: 0 [43520/60000 (72%)]\tLoss: 0.057692\n",
- "Train Epoch: 0 [44160/60000 (74%)]\tLoss: 0.135699\n",
- "Train Epoch: 0 [44800/60000 (75%)]\tLoss: 0.070328\n",
- "Train Epoch: 0 [45440/60000 (76%)]\tLoss: 0.287908\n",
- "Train Epoch: 0 [46080/60000 (77%)]\tLoss: 0.181923\n",
- "Train Epoch: 0 [46720/60000 (78%)]\tLoss: 0.109931\n",
- "Train Epoch: 0 [47360/60000 (79%)]\tLoss: 0.082871\n",
- "Train Epoch: 0 [48000/60000 (80%)]\tLoss: 0.336507\n",
- "Train Epoch: 0 [48640/60000 (81%)]\tLoss: 0.132857\n",
- "Train Epoch: 0 [49280/60000 (82%)]\tLoss: 0.124299\n",
- "Train Epoch: 0 [49920/60000 (83%)]\tLoss: 0.064722\n",
- "Train Epoch: 0 [50560/60000 (84%)]\tLoss: 0.102338\n",
- "Train Epoch: 0 [51200/60000 (85%)]\tLoss: 0.081316\n",
- "Train Epoch: 0 [51840/60000 (86%)]\tLoss: 0.023367\n",
- "Train Epoch: 0 [52480/60000 (87%)]\tLoss: 0.009987\n",
- "Train Epoch: 0 [53120/60000 (88%)]\tLoss: 0.021232\n",
- "Train Epoch: 0 [53760/60000 (90%)]\tLoss: 0.234437\n",
- "Train Epoch: 0 [54400/60000 (91%)]\tLoss: 0.076132\n",
- "Train Epoch: 0 [55040/60000 (92%)]\tLoss: 0.099620\n",
- "Train Epoch: 0 [55680/60000 (93%)]\tLoss: 0.153108\n",
- "Train Epoch: 0 [56320/60000 (94%)]\tLoss: 0.008535\n",
- "Train Epoch: 0 [56960/60000 (95%)]\tLoss: 0.069565\n",
- "Train Epoch: 0 [57600/60000 (96%)]\tLoss: 0.345571\n",
- "Train Epoch: 0 [58240/60000 (97%)]\tLoss: 0.502152\n",
- "Train Epoch: 0 [58880/60000 (98%)]\tLoss: 0.153598\n",
- "Train Epoch: 0 [59520/60000 (99%)]\tLoss: 0.009579\n",
- "Train Epoch: 1 [0/60000 (0%)]\tLoss: 0.290723\n",
- "Train Epoch: 1 [640/60000 (1%)]\tLoss: 0.145595\n",
- "Train Epoch: 1 [1280/60000 (2%)]\tLoss: 0.069823\n",
- "Train Epoch: 1 [1920/60000 (3%)]\tLoss: 0.064840\n",
- "Train Epoch: 1 [2560/60000 (4%)]\tLoss: 0.006950\n",
- "Train Epoch: 1 [3200/60000 (5%)]\tLoss: 0.000492\n",
- "Train Epoch: 1 [3840/60000 (6%)]\tLoss: 0.194620\n",
- "Train Epoch: 1 [4480/60000 (7%)]\tLoss: 0.240464\n",
- "Train Epoch: 1 [5120/60000 (9%)]\tLoss: 0.073014\n",
- "Train Epoch: 1 [5760/60000 (10%)]\tLoss: 0.177253\n",
- "Train Epoch: 1 [6400/60000 (11%)]\tLoss: 0.048263\n",
- "Train Epoch: 1 [7040/60000 (12%)]\tLoss: 0.036351\n",
- "Train Epoch: 1 [7680/60000 (13%)]\tLoss: 0.011458\n",
- "Train Epoch: 1 [8320/60000 (14%)]\tLoss: 0.341599\n",
- "Train Epoch: 1 [8960/60000 (15%)]\tLoss: 0.373527\n",
- "Train Epoch: 1 [9600/60000 (16%)]\tLoss: 0.047152\n",
- "Train Epoch: 1 [10240/60000 (17%)]\tLoss: 0.124424\n",
- "Train Epoch: 1 [10880/60000 (18%)]\tLoss: 0.138054\n",
- "Train Epoch: 1 [11520/60000 (19%)]\tLoss: 0.314980\n",
- "Train Epoch: 1 [12160/60000 (20%)]\tLoss: 0.244121\n",
- "Train Epoch: 1 [12800/60000 (21%)]\tLoss: 0.194258\n",
- "Train Epoch: 1 [13440/60000 (22%)]\tLoss: 0.092362\n",
- "Train Epoch: 1 [14080/60000 (23%)]\tLoss: 0.205115\n",
- "Train Epoch: 1 [14720/60000 (25%)]\tLoss: 0.202674\n",
- "Train Epoch: 1 [15360/60000 (26%)]\tLoss: 0.189018\n",
- "Train Epoch: 1 [16000/60000 (27%)]\tLoss: 0.168465\n",
- "Train Epoch: 1 [16640/60000 (28%)]\tLoss: 0.075228\n",
- "Train Epoch: 1 [17280/60000 (29%)]\tLoss: 0.024219\n",
- "Train Epoch: 1 [17920/60000 (30%)]\tLoss: 0.249284\n",
- "Train Epoch: 1 [18560/60000 (31%)]\tLoss: 0.055043\n",
- "Train Epoch: 1 [19200/60000 (32%)]\tLoss: 0.199740\n",
- "Train Epoch: 1 [19840/60000 (33%)]\tLoss: 0.264624\n",
- "Train Epoch: 1 [20480/60000 (34%)]\tLoss: 0.145213\n",
- "Train Epoch: 1 [21120/60000 (35%)]\tLoss: 0.182477\n",
- "Train Epoch: 1 [21760/60000 (36%)]\tLoss: 0.181954\n",
- "Train Epoch: 1 [22400/60000 (37%)]\tLoss: 0.041947\n",
- "Train Epoch: 1 [23040/60000 (38%)]\tLoss: 0.165648\n",
- "Train Epoch: 1 [23680/60000 (39%)]\tLoss: 0.075048\n",
- "Train Epoch: 1 [24320/60000 (41%)]\tLoss: 0.091085\n",
- "Train Epoch: 1 [24960/60000 (42%)]\tLoss: 0.267341\n",
- "Train Epoch: 1 [25600/60000 (43%)]\tLoss: 0.419169\n",
- "Train Epoch: 1 [26240/60000 (44%)]\tLoss: 0.397417\n",
- "Train Epoch: 1 [26880/60000 (45%)]\tLoss: 0.059258\n",
- "Train Epoch: 1 [27520/60000 (46%)]\tLoss: 0.678994\n",
- "Train Epoch: 1 [28160/60000 (47%)]\tLoss: 0.097712\n",
- "Train Epoch: 1 [28800/60000 (48%)]\tLoss: 0.078830\n",
- "Train Epoch: 1 [29440/60000 (49%)]\tLoss: 0.083803\n",
- "Train Epoch: 1 [30080/60000 (50%)]\tLoss: 0.373137\n",
- "Train Epoch: 1 [30720/60000 (51%)]\tLoss: 0.317618\n",
- "Train Epoch: 1 [31360/60000 (52%)]\tLoss: 0.076827\n",
- "Train Epoch: 1 [32000/60000 (53%)]\tLoss: 0.125064\n",
- "Train Epoch: 1 [32640/60000 (54%)]\tLoss: 0.057970\n",
- "Train Epoch: 1 [33280/60000 (55%)]\tLoss: 0.167010\n",
- "Train Epoch: 1 [33920/60000 (57%)]\tLoss: 0.026072\n",
- "Train Epoch: 1 [34560/60000 (58%)]\tLoss: 0.160082\n",
- "Train Epoch: 1 [35200/60000 (59%)]\tLoss: 0.046618\n",
- "Train Epoch: 1 [35840/60000 (60%)]\tLoss: 0.050997\n",
- "Train Epoch: 1 [36480/60000 (61%)]\tLoss: 0.370405\n",
- "Train Epoch: 1 [37120/60000 (62%)]\tLoss: 0.106518\n",
- "Train Epoch: 1 [37760/60000 (63%)]\tLoss: 0.101690\n",
- "Train Epoch: 1 [38400/60000 (64%)]\tLoss: 0.064859\n",
- "Train Epoch: 1 [39040/60000 (65%)]\tLoss: 0.079881\n",
- "Train Epoch: 1 [39680/60000 (66%)]\tLoss: 0.110059\n",
- "Train Epoch: 1 [40320/60000 (67%)]\tLoss: 0.067634\n",
- "Train Epoch: 1 [40960/60000 (68%)]\tLoss: 0.208821\n",
- "Train Epoch: 1 [41600/60000 (69%)]\tLoss: 0.088838\n",
- "Train Epoch: 1 [42240/60000 (70%)]\tLoss: 0.079848\n",
- "Train Epoch: 1 [42880/60000 (71%)]\tLoss: 0.193431\n",
- "Train Epoch: 1 [43520/60000 (72%)]\tLoss: 0.171546\n",
- "Train Epoch: 1 [44160/60000 (74%)]\tLoss: 0.178438\n",
- "Train Epoch: 1 [44800/60000 (75%)]\tLoss: 0.169155\n",
- "Train Epoch: 1 [45440/60000 (76%)]\tLoss: 0.055189\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Train Epoch: 1 [46080/60000 (77%)]\tLoss: 0.028144\n",
- "Train Epoch: 1 [46720/60000 (78%)]\tLoss: 0.021224\n",
- "Train Epoch: 1 [47360/60000 (79%)]\tLoss: 0.071242\n",
- "Train Epoch: 1 [48000/60000 (80%)]\tLoss: 0.010806\n",
- "Train Epoch: 1 [48640/60000 (81%)]\tLoss: 0.110852\n",
- "Train Epoch: 1 [49280/60000 (82%)]\tLoss: 0.010513\n",
- "Train Epoch: 1 [49920/60000 (83%)]\tLoss: 0.364887\n",
- "Train Epoch: 1 [50560/60000 (84%)]\tLoss: 0.013409\n",
- "Train Epoch: 1 [51200/60000 (85%)]\tLoss: 0.419877\n",
- "Train Epoch: 1 [51840/60000 (86%)]\tLoss: 0.067204\n",
- "Train Epoch: 1 [52480/60000 (87%)]\tLoss: 0.067301\n",
- "Train Epoch: 1 [53120/60000 (88%)]\tLoss: 0.179633\n",
- "Train Epoch: 1 [53760/60000 (90%)]\tLoss: 0.168005\n",
- "Train Epoch: 1 [54400/60000 (91%)]\tLoss: 0.053020\n",
- "Train Epoch: 1 [55040/60000 (92%)]\tLoss: 0.127693\n",
- "Train Epoch: 1 [55680/60000 (93%)]\tLoss: 0.000790\n",
- "Train Epoch: 1 [56320/60000 (94%)]\tLoss: 0.638172\n",
- "Train Epoch: 1 [56960/60000 (95%)]\tLoss: 0.190752\n",
- "Train Epoch: 1 [57600/60000 (96%)]\tLoss: 0.001666\n",
- "Train Epoch: 1 [58240/60000 (97%)]\tLoss: 0.040600\n",
- "Train Epoch: 1 [58880/60000 (98%)]\tLoss: 0.218874\n",
- "Train Epoch: 1 [59520/60000 (99%)]\tLoss: 0.056282\n",
- "Train Epoch: 2 [0/60000 (0%)]\tLoss: 0.076416\n",
- "Train Epoch: 2 [640/60000 (1%)]\tLoss: 0.112750\n",
- "Train Epoch: 2 [1280/60000 (2%)]\tLoss: 0.171557\n",
- "Train Epoch: 2 [1920/60000 (3%)]\tLoss: 0.014248\n",
- "Train Epoch: 2 [2560/60000 (4%)]\tLoss: 0.062175\n",
- "Train Epoch: 2 [3200/60000 (5%)]\tLoss: 0.182906\n",
- "Train Epoch: 2 [3840/60000 (6%)]\tLoss: 0.135498\n",
- "Train Epoch: 2 [4480/60000 (7%)]\tLoss: 0.123534\n",
- "Train Epoch: 2 [5120/60000 (9%)]\tLoss: 0.122674\n",
- "Train Epoch: 2 [5760/60000 (10%)]\tLoss: 0.111532\n",
- "Train Epoch: 2 [6400/60000 (11%)]\tLoss: 0.099646\n",
- "Train Epoch: 2 [7040/60000 (12%)]\tLoss: 0.051113\n",
- "Train Epoch: 2 [7680/60000 (13%)]\tLoss: 0.227051\n",
- "Train Epoch: 2 [8320/60000 (14%)]\tLoss: 0.138824\n",
- "Train Epoch: 2 [8960/60000 (15%)]\tLoss: 0.088158\n",
- "Train Epoch: 2 [9600/60000 (16%)]\tLoss: 0.103052\n",
- "Train Epoch: 2 [10240/60000 (17%)]\tLoss: 0.369061\n",
- "Train Epoch: 2 [10880/60000 (18%)]\tLoss: 0.165350\n",
- "Train Epoch: 2 [11520/60000 (19%)]\tLoss: 0.142054\n",
- "Train Epoch: 2 [12160/60000 (20%)]\tLoss: 0.034043\n",
- "Train Epoch: 2 [12800/60000 (21%)]\tLoss: 0.093324\n",
- "Train Epoch: 2 [13440/60000 (22%)]\tLoss: 0.129838\n",
- "Train Epoch: 2 [14080/60000 (23%)]\tLoss: 0.023088\n",
- "Train Epoch: 2 [14720/60000 (25%)]\tLoss: 0.110030\n",
- "Train Epoch: 2 [15360/60000 (26%)]\tLoss: 0.355520\n",
- "Train Epoch: 2 [16000/60000 (27%)]\tLoss: 0.072964\n",
- "Train Epoch: 2 [16640/60000 (28%)]\tLoss: 0.002617\n",
- "Train Epoch: 2 [17280/60000 (29%)]\tLoss: 0.415827\n",
- "Train Epoch: 2 [17920/60000 (30%)]\tLoss: 0.061368\n",
- "Train Epoch: 2 [18560/60000 (31%)]\tLoss: 0.059896\n",
- "Train Epoch: 2 [19200/60000 (32%)]\tLoss: 0.006614\n",
- "Train Epoch: 2 [19840/60000 (33%)]\tLoss: 0.072400\n",
- "Train Epoch: 2 [20480/60000 (34%)]\tLoss: 0.084101\n",
- "Train Epoch: 2 [21120/60000 (35%)]\tLoss: 0.060527\n",
- "Train Epoch: 2 [21760/60000 (36%)]\tLoss: 0.245168\n",
- "Train Epoch: 2 [22400/60000 (37%)]\tLoss: 0.104240\n",
- "Train Epoch: 2 [23040/60000 (38%)]\tLoss: 0.039879\n",
- "Train Epoch: 2 [23680/60000 (39%)]\tLoss: 0.127886\n",
- "Train Epoch: 2 [24320/60000 (41%)]\tLoss: 0.151734\n",
- "Train Epoch: 2 [24960/60000 (42%)]\tLoss: 0.156671\n",
- "Train Epoch: 2 [25600/60000 (43%)]\tLoss: 0.109427\n",
- "Train Epoch: 2 [26240/60000 (44%)]\tLoss: 0.080561\n",
- "Train Epoch: 2 [26880/60000 (45%)]\tLoss: 0.092277\n",
- "Train Epoch: 2 [27520/60000 (46%)]\tLoss: 0.210802\n",
- "Train Epoch: 2 [28160/60000 (47%)]\tLoss: 0.057986\n",
- "Train Epoch: 2 [28800/60000 (48%)]\tLoss: 0.124804\n",
- "Train Epoch: 2 [29440/60000 (49%)]\tLoss: 0.119243\n",
- "Train Epoch: 2 [30080/60000 (50%)]\tLoss: 0.279113\n",
- "Train Epoch: 2 [30720/60000 (51%)]\tLoss: 0.214091\n",
- "Train Epoch: 2 [31360/60000 (52%)]\tLoss: 0.106701\n",
- "Train Epoch: 2 [32000/60000 (53%)]\tLoss: 0.547162\n",
- "Train Epoch: 2 [32640/60000 (54%)]\tLoss: 0.377766\n",
- "Train Epoch: 2 [33280/60000 (55%)]\tLoss: 0.128663\n",
- "Train Epoch: 2 [33920/60000 (57%)]\tLoss: 0.078428\n",
- "Train Epoch: 2 [34560/60000 (58%)]\tLoss: 0.096952\n",
- "Train Epoch: 2 [35200/60000 (59%)]\tLoss: 0.047050\n",
- "Train Epoch: 2 [35840/60000 (60%)]\tLoss: 0.106311\n",
- "Train Epoch: 2 [36480/60000 (61%)]\tLoss: 0.092369\n",
- "Train Epoch: 2 [37120/60000 (62%)]\tLoss: 0.038745\n",
- "Train Epoch: 2 [37760/60000 (63%)]\tLoss: 0.474230\n",
- "Train Epoch: 2 [38400/60000 (64%)]\tLoss: 0.213040\n",
- "Train Epoch: 2 [39040/60000 (65%)]\tLoss: 0.665591\n",
- "Train Epoch: 2 [39680/60000 (66%)]\tLoss: 0.068594\n",
- "Train Epoch: 2 [40320/60000 (67%)]\tLoss: 0.036250\n",
- "Train Epoch: 2 [40960/60000 (68%)]\tLoss: 0.144957\n",
- "Train Epoch: 2 [41600/60000 (69%)]\tLoss: 0.355639\n",
- "Train Epoch: 2 [42240/60000 (70%)]\tLoss: 0.198450\n",
- "Train Epoch: 2 [42880/60000 (71%)]\tLoss: 0.221584\n",
- "Train Epoch: 2 [43520/60000 (72%)]\tLoss: 0.043087\n",
- "Train Epoch: 2 [44160/60000 (74%)]\tLoss: 0.053449\n",
- "Train Epoch: 2 [44800/60000 (75%)]\tLoss: 0.244004\n",
- "Train Epoch: 2 [45440/60000 (76%)]\tLoss: 0.051597\n",
- "Train Epoch: 2 [46080/60000 (77%)]\tLoss: 0.018794\n",
- "Train Epoch: 2 [46720/60000 (78%)]\tLoss: 0.047302\n",
- "Train Epoch: 2 [47360/60000 (79%)]\tLoss: 0.233751\n",
- "Train Epoch: 2 [48000/60000 (80%)]\tLoss: 0.523653\n",
- "Train Epoch: 2 [48640/60000 (81%)]\tLoss: 0.011048\n",
- "Train Epoch: 2 [49280/60000 (82%)]\tLoss: 0.185908\n",
- "Train Epoch: 2 [49920/60000 (83%)]\tLoss: 0.085652\n",
- "Train Epoch: 2 [50560/60000 (84%)]\tLoss: 0.065321\n",
- "Train Epoch: 2 [51200/60000 (85%)]\tLoss: 0.174393\n",
- "Train Epoch: 2 [51840/60000 (86%)]\tLoss: 0.031607\n",
- "Train Epoch: 2 [52480/60000 (87%)]\tLoss: 0.174475\n",
- "Train Epoch: 2 [53120/60000 (88%)]\tLoss: 0.217395\n",
- "Train Epoch: 2 [53760/60000 (90%)]\tLoss: 0.061645\n",
- "Train Epoch: 2 [54400/60000 (91%)]\tLoss: 0.141715\n",
- "Train Epoch: 2 [55040/60000 (92%)]\tLoss: 0.198288\n",
- "Train Epoch: 2 [55680/60000 (93%)]\tLoss: 0.254158\n",
- "Train Epoch: 2 [56320/60000 (94%)]\tLoss: 0.110041\n",
- "Train Epoch: 2 [56960/60000 (95%)]\tLoss: 0.270937\n",
- "Train Epoch: 2 [57600/60000 (96%)]\tLoss: 0.070328\n",
- "Train Epoch: 2 [58240/60000 (97%)]\tLoss: 0.024610\n",
- "Train Epoch: 2 [58880/60000 (98%)]\tLoss: 0.236358\n",
- "Train Epoch: 2 [59520/60000 (99%)]\tLoss: 0.117915\n",
- "Train Epoch: 3 [0/60000 (0%)]\tLoss: 0.146749\n",
- "Train Epoch: 3 [640/60000 (1%)]\tLoss: 0.039942\n",
- "Train Epoch: 3 [1280/60000 (2%)]\tLoss: 0.005945\n",
- "Train Epoch: 3 [1920/60000 (3%)]\tLoss: 0.118340\n",
- "Train Epoch: 3 [2560/60000 (4%)]\tLoss: 0.212263\n",
- "Train Epoch: 3 [3200/60000 (5%)]\tLoss: 0.108361\n",
- "Train Epoch: 3 [3840/60000 (6%)]\tLoss: 0.123859\n",
- "Train Epoch: 3 [4480/60000 (7%)]\tLoss: 0.151609\n",
- "Train Epoch: 3 [5120/60000 (9%)]\tLoss: 0.190431\n",
- "Train Epoch: 3 [5760/60000 (10%)]\tLoss: 0.044887\n",
- "Train Epoch: 3 [6400/60000 (11%)]\tLoss: 0.118531\n",
- "Train Epoch: 3 [7040/60000 (12%)]\tLoss: 0.175035\n",
- "Train Epoch: 3 [7680/60000 (13%)]\tLoss: 0.116727\n",
- "Train Epoch: 3 [8320/60000 (14%)]\tLoss: 0.233833\n",
- "Train Epoch: 3 [8960/60000 (15%)]\tLoss: 0.088643\n",
- "Train Epoch: 3 [9600/60000 (16%)]\tLoss: 0.384525\n",
- "Train Epoch: 3 [10240/60000 (17%)]\tLoss: 0.043991\n",
- "Train Epoch: 3 [10880/60000 (18%)]\tLoss: 0.851141\n",
- "Train Epoch: 3 [11520/60000 (19%)]\tLoss: 0.094810\n",
- "Train Epoch: 3 [12160/60000 (20%)]\tLoss: 0.083756\n",
- "Train Epoch: 3 [12800/60000 (21%)]\tLoss: 0.222417\n",
- "Train Epoch: 3 [13440/60000 (22%)]\tLoss: 0.448306\n",
- "Train Epoch: 3 [14080/60000 (23%)]\tLoss: 0.037722\n",
- "Train Epoch: 3 [14720/60000 (25%)]\tLoss: 0.318071\n",
- "Train Epoch: 3 [15360/60000 (26%)]\tLoss: 0.241556\n",
- "Train Epoch: 3 [16000/60000 (27%)]\tLoss: 0.034820\n",
- "Train Epoch: 3 [16640/60000 (28%)]\tLoss: 0.154392\n",
- "Train Epoch: 3 [17280/60000 (29%)]\tLoss: 0.103088\n",
- "Train Epoch: 3 [17920/60000 (30%)]\tLoss: 0.309260\n",
- "Train Epoch: 3 [18560/60000 (31%)]\tLoss: 0.129015\n",
- "Train Epoch: 3 [19200/60000 (32%)]\tLoss: 0.081739\n",
- "Train Epoch: 3 [19840/60000 (33%)]\tLoss: 0.194573\n",
- "Train Epoch: 3 [20480/60000 (34%)]\tLoss: 0.100797\n",
- "Train Epoch: 3 [21120/60000 (35%)]\tLoss: 0.130121\n",
- "Train Epoch: 3 [21760/60000 (36%)]\tLoss: 0.148123\n",
- "Train Epoch: 3 [22400/60000 (37%)]\tLoss: 0.107668\n",
- "Train Epoch: 3 [23040/60000 (38%)]\tLoss: 0.118747\n",
- "Train Epoch: 3 [23680/60000 (39%)]\tLoss: 0.145568\n",
- "Train Epoch: 3 [24320/60000 (41%)]\tLoss: 0.228613\n",
- "Train Epoch: 3 [24960/60000 (42%)]\tLoss: 0.125414\n",
- "Train Epoch: 3 [25600/60000 (43%)]\tLoss: 0.083142\n",
- "Train Epoch: 3 [26240/60000 (44%)]\tLoss: 0.394818\n",
- "Train Epoch: 3 [26880/60000 (45%)]\tLoss: 0.045244\n",
- "Train Epoch: 3 [27520/60000 (46%)]\tLoss: 0.005072\n",
- "Train Epoch: 3 [28160/60000 (47%)]\tLoss: 0.115797\n",
- "Train Epoch: 3 [28800/60000 (48%)]\tLoss: 0.095257\n",
- "Train Epoch: 3 [29440/60000 (49%)]\tLoss: 0.005111\n",
- "Train Epoch: 3 [30080/60000 (50%)]\tLoss: 0.110229\n",
- "Train Epoch: 3 [30720/60000 (51%)]\tLoss: 0.082010\n",
- "Train Epoch: 3 [31360/60000 (52%)]\tLoss: 0.055340\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Train Epoch: 3 [32000/60000 (53%)]\tLoss: 0.195543\n",
- "Train Epoch: 3 [32640/60000 (54%)]\tLoss: 0.150267\n",
- "Train Epoch: 3 [33280/60000 (55%)]\tLoss: 0.177324\n",
- "Train Epoch: 3 [33920/60000 (57%)]\tLoss: 0.038098\n",
- "Train Epoch: 3 [34560/60000 (58%)]\tLoss: 0.036462\n",
- "Train Epoch: 3 [35200/60000 (59%)]\tLoss: 0.076419\n",
- "Train Epoch: 3 [35840/60000 (60%)]\tLoss: 0.094155\n",
- "Train Epoch: 3 [36480/60000 (61%)]\tLoss: 0.276453\n",
- "Train Epoch: 3 [37120/60000 (62%)]\tLoss: 0.013989\n",
- "Train Epoch: 3 [37760/60000 (63%)]\tLoss: 0.033511\n",
- "Train Epoch: 3 [38400/60000 (64%)]\tLoss: 0.053062\n",
- "Train Epoch: 3 [39040/60000 (65%)]\tLoss: 0.002972\n",
- "Train Epoch: 3 [39680/60000 (66%)]\tLoss: 0.044364\n",
- "Train Epoch: 3 [40320/60000 (67%)]\tLoss: 0.201063\n",
- "Train Epoch: 3 [40960/60000 (68%)]\tLoss: 0.239112\n",
- "Train Epoch: 3 [41600/60000 (69%)]\tLoss: 0.301890\n",
- "Train Epoch: 3 [42240/60000 (70%)]\tLoss: 0.209023\n",
- "Train Epoch: 3 [42880/60000 (71%)]\tLoss: 0.340197\n",
- "Train Epoch: 3 [43520/60000 (72%)]\tLoss: 0.051514\n",
- "Train Epoch: 3 [44160/60000 (74%)]\tLoss: 0.163148\n",
- "Train Epoch: 3 [44800/60000 (75%)]\tLoss: 0.035544\n",
- "Train Epoch: 3 [45440/60000 (76%)]\tLoss: 0.105758\n",
- "Train Epoch: 3 [46080/60000 (77%)]\tLoss: 0.091835\n",
- "Train Epoch: 3 [46720/60000 (78%)]\tLoss: 0.218505\n",
- "Train Epoch: 3 [47360/60000 (79%)]\tLoss: 0.212545\n",
- "Train Epoch: 3 [48000/60000 (80%)]\tLoss: 0.001972\n",
- "Train Epoch: 3 [48640/60000 (81%)]\tLoss: 0.165325\n",
- "Train Epoch: 3 [49280/60000 (82%)]\tLoss: 0.099900\n",
- "Train Epoch: 3 [49920/60000 (83%)]\tLoss: 0.475469\n",
- "Train Epoch: 3 [50560/60000 (84%)]\tLoss: 0.102674\n",
- "Train Epoch: 3 [51200/60000 (85%)]\tLoss: 0.067554\n",
- "Train Epoch: 3 [51840/60000 (86%)]\tLoss: 0.376874\n",
- "Train Epoch: 3 [52480/60000 (87%)]\tLoss: 0.133132\n",
- "Train Epoch: 3 [53120/60000 (88%)]\tLoss: 0.042010\n",
- "Train Epoch: 3 [53760/60000 (90%)]\tLoss: 0.008966\n",
- "Train Epoch: 3 [54400/60000 (91%)]\tLoss: 0.073707\n",
- "Train Epoch: 3 [55040/60000 (92%)]\tLoss: 0.128305\n",
- "Train Epoch: 3 [55680/60000 (93%)]\tLoss: 0.039086\n",
- "Train Epoch: 3 [56320/60000 (94%)]\tLoss: 0.176628\n",
- "Train Epoch: 3 [56960/60000 (95%)]\tLoss: 0.025344\n",
- "Train Epoch: 3 [57600/60000 (96%)]\tLoss: 0.137705\n",
- "Train Epoch: 3 [58240/60000 (97%)]\tLoss: 0.035565\n",
- "Train Epoch: 3 [58880/60000 (98%)]\tLoss: 0.165229\n",
- "Train Epoch: 3 [59520/60000 (99%)]\tLoss: 0.070512\n"
- ]
- }
- ],
- "source": [
- "model = create_net(tornasole_save_interval=100, base_loc='./ts_output', run_id='good')\n",
- "train(model=model, epochs=4, learning_rate=0.1, momentum=0.9, batch_size=64, device = torch.device(\"cpu\"))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Data Analysis\n",
- "Now that we have trained the system we can analyze the data. Notice that this notebook focuses on after-the-fact analysis. Tornasole also provides a collection of tools to do automatic analysis as the training run is progressing, which will be covered in a different notebook."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We import a basic analysis library, which defines a concept of `Trial`. A `Trial` is a single training run, which is depositing values in a local directory (`LocalTrial`) or S3 (`S3Trial`). In this case we are using a `LocalTrial` - if you wish, you can change the output from `./ts_output` to `s3://mybucket/myprefix` and use `S3Trial` instead of `LocalTrial`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [],
- "source": [
- "from smdebug.trials import LocalTrial"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And we read the data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [],
- "source": [
- "good_trial = LocalTrial('myrun', './ts_output/good')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can list all the tensors we know something about. Each one of these names is the name of a tensor - the name is a combination of the layer name (which, in these cases, is auto-assigned by PyTorch) and whether it's an input/output/weight/bias/gradient."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['Net_conv1.weight',\n",
- " 'Net_conv1.bias',\n",
- " 'Net_conv2.weight',\n",
- " 'Net_conv2.bias',\n",
- " 'Net_fc1.weight',\n",
- " 'Net_fc1.bias',\n",
- " 'Net_fc2.weight',\n",
- " 'Net_fc2.bias',\n",
- " 'conv1_input_0',\n",
- " 'conv1_output0',\n",
- " 'conv2_input_0',\n",
- " 'conv2_output0',\n",
- " 'fc1_input_0',\n",
- " 'fc1_output0',\n",
- " 'fc2_input_0',\n",
- " 'fc2_output0',\n",
- " 'Net_input_0',\n",
- " 'Net_output0',\n",
- " 'gradient/Net_fc2.bias',\n",
- " 'gradient/Net_fc2.weight',\n",
- " 'gradient/Net_fc1.bias',\n",
- " 'gradient/Net_fc1.weight',\n",
- " 'gradient/Net_conv2.weight',\n",
- " 'gradient/Net_conv2.bias',\n",
- " 'gradient/Net_conv1.weight',\n",
- " 'gradient/Net_conv1.bias']"
- ]
- },
- "execution_count": 12,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "good_trial.tensor_names()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For each tensor we can ask for which steps we have data - in this case, every 100 steps"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[0,\n",
- " 100,\n",
- " 200,\n",
- " 300,\n",
- " 400,\n",
- " 500,\n",
- " 600,\n",
- " 700,\n",
- " 800,\n",
- " 900,\n",
- " 1000,\n",
- " 1100,\n",
- " 1200,\n",
- " 1300,\n",
- " 1400,\n",
- " 1500,\n",
- " 1600,\n",
- " 1700,\n",
- " 1800,\n",
- " 1900,\n",
- " 2000,\n",
- " 2100,\n",
- " 2200,\n",
- " 2300,\n",
- " 2400,\n",
- " 2500,\n",
- " 2600,\n",
- " 2700,\n",
- " 2800,\n",
- " 2900,\n",
- " 3000,\n",
- " 3100,\n",
- " 3200,\n",
- " 3300,\n",
- " 3400,\n",
- " 3500,\n",
- " 3600,\n",
- " 3700]"
- ]
- },
- "execution_count": 13,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "good_trial.tensor('gradient/Net_fc1.weight').steps()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can obtain each tensor at each step as a `numpy` array"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "numpy.ndarray"
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "type(good_trial.tensor('gradient/Net_fc1.weight').value(300))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Gradient Analysis"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can also create a simple function that prints the `np.mean` of the `np.abs` of each gradient. We expect each gradient to get smaller over time, as the system converges to a good solution. Now, remember that this is an interactive analysis - we are showing these tensors to give an idea of the data. \n",
- "\n",
- "Later on in this notebook we will run an automated analysis."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Define a function that, for the given tensor name, walks through all \n",
- "# the batches for which we have data and computes mean(abs(tensor)).\n",
- "# Returns the set of steps and the values\n",
- "\n",
- "def get_data(trial, tname):\n",
- " tensor = trial.tensor(tname)\n",
- " steps = tensor.steps()\n",
- " vals = []\n",
- " for s in steps:\n",
- " val = tensor.value(s)\n",
- " val = np.mean(np.abs(val))\n",
- " vals.append(val)\n",
- " return steps, vals"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [],
- "source": [
- "def plot_gradients( lt ):\n",
- " for tname in lt.tensor_names():\n",
- " if not 'gradient' in tname: continue\n",
- " steps, data = get_data(lt, tname)\n",
- " plt.plot( steps, data, label=tname)\n",
- " plt.legend()\n",
- " plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can plot these gradiends. Notice how they are (mostly!) decreasing. We should investigate the spikes!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plot_gradients(good_trial)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can also print inputs and outputs from the model. For instance, let's print the 42nd sample of the 2700th batch, as seen by the network. \n",
- "\n",
- "Notice that we have "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAPsAAAD4CAYAAAAq5pAIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAALkklEQVR4nO3dX4hc9RnG8eeJ1RsjmlS6xBir1dxItbEsoVCpFlHSgEYRxCAlpdL1QkEhFw32wkApSKnWXgkrBpNilYCKQaWaBmnam5JVkpg/1aQaMcsmqQQ1ubK6by/mpKxx58xmzjlzJn2/H1hm5vfOzHk5+uT8m5mfI0IA/v/Na7sBAINB2IEkCDuQBGEHkiDsQBLfGOTCbHPqH2hYRHi28UpbdtsrbL9r+6DtdVXeC0Cz3O91dtvnSHpP0s2SDkvaIWl1ROwreQ1bdqBhTWzZl0s6GBHvR8Tnkp6XtKrC+wFoUJWwL5b00YzHh4uxr7A9ZnvC9kSFZQGoqPETdBExLmlcYjceaFOVLfukpCUzHl9ajAEYQlXCvkPSUttX2D5P0t2SttTTFoC69b0bHxFf2H5A0uuSzpG0ISL21tYZgFr1femtr4VxzA40rpEP1QA4exB2IAnCDiRB2IEkCDuQBGEHkiDsQBKEHUiCsANJEHYgCcIOJEHYgSQIO5AEYQeSIOxAEoQdSIKwA0kQdiAJwg4kQdiBJAg7kARhB5Ig7EAShB1IgrADSRB2IAnCDiRB2IEkCDuQRN9TNuPscMkll5TWX3311dL6tddeW1p/4oknSutr164trWNwKoXd9iFJJyR9KemLiBitoykA9atjy/7jiPi4hvcB0CCO2YEkqoY9JL1h+y3bY7M9wfaY7QnbExWXBaCCqrvx10fEpO1vSdpq+58RsX3mEyJiXNK4JNmOissD0KdKW/aImCxuj0l6SdLyOpoCUL++w277fNsXnLov6RZJe+pqDEC9quzGj0h6yfap9/lTRPy5lq5Qm8suu6y0fs0115TWI8qPvHrVMTz6DntEvC/pezX2AqBBXHoDkiDsQBKEHUiCsANJEHYgCcIOJEHYgSQIO5AEYQeSIOxAEoQdSIKwA0kQdiAJwg4kQdiBJAg7kARhB5Ig7EAShB1IgrADSRB2IAmmbEYlN9xwQ2n9wgsv7Fr79NNP624HJdiyA0kQdiAJwg4kQdiBJAg7kARhB5Ig7EASXGdPrphyu2+jo6Ol9fnz53etcZ19sHpu2W1vsH3M9p4ZYwttb7V9oLhd0GybAKqay278M5JWnDa2TtK2iFgqaVvxGMAQ6xn2iNgu6fhpw6skbSzub5R0e819AahZv8fsIxExVdw/Immk2xNtj0ka63M5AGpS+QRdRITtKKmPSxqXpLLnAWhWv5fejtpeJEnF7bH6WgLQhH7DvkXSmuL+Gkkv19MOgKb03I23/ZykGyVdbPuwpEckPSpps+17JX0o6a4mm0RzIqodWU1PT9fUCZrWM+wRsbpL6aaaewHQID4uCyRB2IEkCDuQBGEHkiDsQBKEHUiCsANJEHYgCcIOJEHYgSQIO5AEYQeSIOxAEoQdSIKwA0kQdiAJwg4kQdiBJAg7kARhB5Ig7EAShB1IgrADSRB2IAnCDiRB2IEkCDuQBGEHkiDsQBI9Z3HF2e3IkSOl9YMHD5bWly5dWlqfN4/txdmi538p2xtsH7O9Z8bYetuTtncWfyubbRNAVXP5Z/kZSStmGf99RCwr/l6rty0AdesZ9ojYLun4AHoB0KAqB1wP2N5d7OYv6PYk22O2J2xPVFgWgIr6DfuTkq6UtEzSlKTHuj0xIsYjYjQiRvtcFoAa9BX2iDgaEV9GxLSkpyQtr7ctAHXrK+y2F814eIekPd2eC2A49LzObvs5STdKutj2YUmPSLrR9jJJIemQpPsa7BEVHDp0qLS+a9eu0vpVV11VWp+enj7TltCSnmGPiNWzDD/dQC8AGsTHn4AkCDuQBGEHkiDsQBKEHUiCsANJEHYgCcIOJEHYgSQIO5AEYQeSIOxAEoQdSIKfkkajLrrooq61ycnJAXYCtuxAEoQdSIKwA0kQdiAJwg4kQdiBJAg7kIQjYnALswe3MMzJPffcU1rfuHFjad12aX3z5s1da6tXz/bDxagqImb9j8KWHUiCsANJEHYgCcIOJEHYgSQIO5AEYQeS4PvsyfW6Tt6rPm9e+fai1+sxOD237LaX2H7T9j7be20/WIwvtL3V9oHidkHz7QLo11x247+QtDYirpb0A0n3275a0jpJ2yJiqaRtxWMAQ6pn2CNiKiLeLu6fkLRf0mJJqySd+izlRkm3N9UkgOrO6Jjd9uWSrpP0D0kjETFVlI5IGunymjFJY/23CKAOcz4bb3u+pBckPRQRn82sRefbNLN+ySUixiNiNCJGK3UKoJI5hd32ueoE/dmIeLEYPmp7UVFfJOlYMy0CqEPP3Xh3rp08LWl/RDw+o7RF0hpJjxa3LzfSIRp16623ltZ7fQV6enq60usxOHM5Zv+hpJ9Kesf2zmLsYXVCvtn2vZI+lHRXMy0CqEPPsEfE3yV1+2TETfW2A6ApfFwWSIKwA0kQdiAJwg4kQdiBJAg7kARhB5Ig7EAShB1IgrADSRB2IAnCDiRB2IEk+Cnp5DZt2lRav/POOyu9/yeffFLp9agPW3YgCcIOJEHYgSQIO5AEYQeSIOxAEoQdSMKD/F1v2/yI+JBZsmRJaf2DDz4orW/fvr20ftttt3WtnTx5svS16E9EzPpr0GzZgSQIO5AEYQeSIOxAEoQdSIKwA0kQdiCJntfZbS+RtEnSiKSQNB4Rf7C9XtIvJP27eOrDEfFaj/fiOjvQsG7X2ecS9kWSFkXE27YvkPSWpNvVmY/9ZET8bq5NEHaged3CPpf52ackTRX3T9jeL2lxve0BaNoZHbPbvlzSdZL+UQw9YHu37Q22F3R5zZjtCdsTlToFUMmcPxtve76kv0r6TUS8aHtE0sfqHMf/Wp1d/Z/3eA9244GG9X3MLkm2z5X0iqTXI+LxWeqXS3olIr7b430IO9Cwvr8IY9uSnpa0f2bQixN3p9whaU/VJgE0Zy5n46+X9DdJ70iaLoYflrRa0jJ1duMPSbqvOJlX9l5s2YGGVdqNrwthB5rH99mB5Ag7kARhB5Ig7EAShB1IgrADSRB2IAnCDiRB2IEkCDuQBGEHkiDsQBKEHUiCsANJ9PzByZp9LOnDGY8vLsaG0bD2Nqx9SfTWrzp7+3a3wkC/z/61hdsTETHaWgMlhrW3Ye1Lord+Dao3duOBJAg7kETbYR9vefllhrW3Ye1Lord+DaS3Vo/ZAQxO21t2AANC2IEkWgm77RW237V90Pa6NnroxvYh2+/Y3tn2/HTFHHrHbO+ZMbbQ9lbbB4rbWefYa6m39bYni3W30/bKlnpbYvtN2/ts77X9YDHe6ror6Wsg623gx+y2z5H0nqSbJR2WtEPS6ojYN9BGurB9SNJoRLT+AQzbP5J0UtKmU1Nr2f6tpOMR8WjxD+WCiPjlkPS2Xmc4jXdDvXWbZvxnanHd1Tn9eT/a2LIvl3QwIt6PiM8lPS9pVQt9DL2I2C7p+GnDqyRtLO5vVOd/loHr0ttQiIipiHi7uH9C0qlpxltddyV9DUQbYV8s6aMZjw9ruOZ7D0lv2H7L9ljbzcxiZMY0W0ckjbTZzCx6TuM9SKdNMz40666f6c+r4gTd110fEd+X9BNJ9xe7q0MpOsdgw3Tt9ElJV6ozB+CUpMfabKaYZvwFSQ9FxGcza22uu1n6Gsh6ayPsk5KWzHh8aTE2FCJisrg9JukldQ47hsnRUzPoFrfHWu7nfyLiaER8GRHTkp5Si+uumGb8BUnPRsSLxXDr6262vga13toI+w5JS21fYfs8SXdL2tJCH19j+/zixIlsny/pFg3fVNRbJK0p7q+R9HKLvXzFsEzj3W2acbW87lqf/jwiBv4naaU6Z+T/JelXbfTQpa/vSNpV/O1tuzdJz6mzW/cfdc5t3Cvpm5K2STog6S+SFg5Rb39UZ2rv3eoEa1FLvV2vzi76bkk7i7+Vba+7kr4Gst74uCyQBCfogCQIO5AEYQeSIOxAEoQdSIKwA0kQdiCJ/wIrTKNeLgYTHwAAAABJRU5ErkJggg==\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "# The raw tensor\n",
- "raw_t = good_trial.tensor('Net_input_0').value(2700)[42]\n",
- "# We have to undo the transformations in 'transformer' above. First of all, multiply by 255\n",
- "raw_t = raw_t * 255\n",
- "# Then reshape from a 784-long vector to a 28x28 square.\n",
- "input_image = raw_t.reshape(28,28)\n",
- "plt.imshow(input_image, cmap=plt.get_cmap('gray'))\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can also plot the relative values emitted by the network. Notice that the last layer is of type `nn.Linear(500, 10)`: it will emit 10 separate confidences, one for each 0-9 digit. The one with the highest output is the predicted value.\n",
- "\n",
- "We can capture and plot the network output for the same sample."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXkAAAD4CAYAAAAJmJb0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAP8UlEQVR4nO3df4xlZX3H8feHpWKXxmLCJgjLMGu6tAFiUScoNZq20EqpcaPWhGaqsf4xpQW1PxIj3aQ2mk1MNW1tsNXxR5Om0xKDUoiiwKaN6T+os7Kl/DQLsrCUpqOm2HYNuvLtH/euzC4zuzt77p1z95n3K7mZe55z5zzfPWE/PHvOc5+TqkKS1KbT+i5AkjQ+hrwkNcyQl6SGGfKS1DBDXpIadnrfBSx39tln1/T0dN9lSNIpZc+ePd+uqi0r7ZuokJ+enmZxcbHvMiTplJJk/2r7vFwjSQ0z5CWpYYa8JDXMkJekhhnyktSwsYd8kquSPJxkX5L3jbu/viwswPQ0nHba4OfCQt8VSdKYp1Am2QR8DPgV4ADw9SS3VdUD4+x3vS0swNwcHDw42N6/f7ANMDvbX12SNO6R/GXAvqp6tKp+ANwE7Bhzn+tu587nAv6wgwcH7ZLUp3GH/HnAE8u2DwzbfizJXJLFJItLS0tjLmc8Hn98be2StF56v/FaVfNVNVNVM1u2rPit3Ik3NbW2dklaL+MO+SeB85dtbx22NWXXLti8+ci2zZsH7ZLUp3GH/NeB7Um2JXkBcA1w25j7XHezszA/DxdcAMng5/y8N10l9W+ss2uq6lCS64E7gE3AZ6rq/nH22ZfZWUNd0uQZ+yqUVXU7cPu4+5EkPV/vN14lSeNjyEtSwwx5SWqYIS9JDTPkJalhhrwkNcyQl6SGGfKS1DBDXpIaZshLUsMMeUlqmCEvSQ0z5CWpYYa8JDXMkJekhhnyktQwQ16SGmbIS1LDDHlJapghL0kNG1vIJ/nTJE8m2Tt8XT2uviRJKzt9zMf/i6r6yJj7kCStwss1ktSwcYf89UnuTfKZJC9e6QNJ5pIsJllcWloaczmStLGkqk7+l5PdwDkr7NoJ3A18Gyjgg8BLquqdxzrezMxMLS4unnQ9krQRJdlTVTMr7et0Tb6qrjzBAj4JfKFLX5KktRvn7JqXLNt8E3DfuPqSJK1snLNr/izJpQwu1zwG/M4Y+5IkrWBsIV9VbxvXsSVJJ8YplJLUMENekhpmyEtSwwx5SWqYIS9JDTPkJalhhrwkNcyQl6SGGfKS1DBDXpIaZshLUsMMeUlqmCEvSQ0z5CWpYYa8JDXMkJekhhnyktQwQ16SGmbIS1LDDHlJalinkE/y1iT3J3k2ycxR+25Isi/Jw0le361MSdLJOL3j798HvBn4xPLGJBcB1wAXA+cCu5NcWFU/6tifJGkNOo3kq+rBqnp4hV07gJuq6pmq+hawD7isS1+SpLUb1zX584Anlm0fGLY9T5K5JItJFpeWlsZUjiRtTMe9XJNkN3DOCrt2VtWtXQuoqnlgHmBmZqa6Hk+S9JzjhnxVXXkSx30SOH/Z9tZhmyRpHY3rcs1twDVJzkiyDdgOfG1MfUmSVtF1CuWbkhwALge+mOQOgKq6H/gs8ADwZeA6Z9ZI0vrrNIWyqm4Bblll3y5gV5fjS5K68RuvktQwQ16SGmbIS1LDDHlJapghL0kNM+QlqWGGvCQ1zJCXpIYZ8pLUMENekhpmyEtSwwx5SWqYIS9JDTPkJalhhrwkNcyQl6SGGfKS1DBDXpIaZshLUsO6Psj7rUnuT/Jskpll7dNJvp9k7/D18e6lSpLWqtODvIH7gDcDn1hh3yNVdWnH40uSOugU8lX1IECS0VQjSRqpcV6T35bkniRfSfLaMfYjSVrFcUfySXYD56ywa2dV3brKrz0FTFXVd5K8EvinJBdX1fdWOP4cMAcwNTV14pVLko7ruCFfVVeu9aBV9QzwzPD9niSPABcCiyt8dh6YB5iZmam19iVJWt1YLtck2ZJk0/D9S4HtwKPj6EuStLquUyjflOQAcDnwxSR3DHe9Drg3yV7gZuDaqvput1IlSWvVdXbNLcAtK7R/Dvhcl2NLkrrzG6+S1DBDXpIaZshLUsMMeUlqmCEvSQ0z5CWpYYa8JDXMkJekhhnyktQwQ16SGmbIS1LDDHlJapghL0kNM+QlqWGGvCQ1zJCXpIYZ8pLUMENekhpmyEtSwwx5SWpYp5BP8uEkDyW5N8ktSc5atu+GJPuSPJzk9d1LlSStVdeR/F3AJVX1MuCbwA0ASS4CrgEuBq4C/jrJpo59SZLWqFPIV9WdVXVouHk3sHX4fgdwU1U9U1XfAvYBl3XpS5K0dqO8Jv9O4EvD9+cBTyzbd2DY9jxJ5pIsJllcWloaYTmSpNOP94Eku4FzVti1s6puHX5mJ3AIWFhrAVU1D8wDzMzM1Fp/X5K0uuOGfFVdeaz9Sd4BvAG4oqoOh/STwPnLPrZ12CZJWkddZ9dcBbwXeGNVHVy26zbgmiRnJNkGbAe+1qUvSdLaHXckfxw3AmcAdyUBuLuqrq2q+5N8FniAwWWc66rqRx37kiStUaeQr6qfOca+XcCuLseXJHXjN14lqWGGvCQ1zJCXpIYZ8pLUMENekhpmyEtSwwx5SWqYIS9JPVpYgOlpOO20wc+FNa8Admxdv/EqSTpJCwswNwcHh4vC7N8/2AaYnR1NH47kJaknO3c+F/CHHTw4aB8VQ16SevL442trPxmGvCT1ZGpqbe0nw5CXpJ7s2gWbNx/ZtnnzoH1UDHlJ6snsLMzPwwUXQDL4OT8/upuu4OwaSerV7OxoQ/1ojuQlqWGGvCQ1zJCXpIYZ8pLUMENekhrWKeSTfDjJQ0nuTXJLkrOG7dNJvp9k7/D18dGUK0lai64j+buAS6rqZcA3gRuW7Xukqi4dvq7t2I8k6SR0CvmqurOqDg037wa2di9JkjQqo7wm/07gS8u2tyW5J8lXkrx2tV9KMpdkMcni0tLSCMuRJB035JPsTnLfCq8dyz6zEzgEHF7u/ilgqqpeDvwh8A9JXrTS8atqvqpmqmpmy5Yt3f9EknQCxv2wjklx3GUNqurKY+1P8g7gDcAVVVXD33kGeGb4fk+SR4ALgcWuBUtSV+vxsI5J0XV2zVXAe4E3VtXBZe1bkmwavn8psB14tEtfktowCSPo9XhYx6ToukDZjcAZwF1JAO4ezqR5HfCBJD8EngWurarvduxL0iluUkbQ6/GwjkmR4RWWiTAzM1OLi17RkVo1PT0I9qNdcAE89tjGq2NUkuypqpmV9vmNV0nrZlJG0OvxsI5JYchLWjfr8bi7E7EeD+uYFIa8tEFMwg3PSRpBz84OLs08++zgZ4sBD4a8tCEcvuG5fz9UPXfDc72DfiONoCeFN16lDaC1G406kjdepQ1uUm54av0Z8tIGMCk3PLX+DHlpA5ikG55aX4a8tA76ntniDc+Nq+uyBpKOY1K+yj87a6hvRI7kpTHbSIthafIY8tKYObNFfTLkpTFzZov6ZMhLY+bMFvXJkFfT+p7VAs5sUb+cXaNmTcqslsP9GerqgyN5NctZLZIhr4Y5q0Uy5NUwZ7VIhrwa5qwWaQQhn+SDSe5NsjfJnUnOHbYnyV8l2Tfc/4ru5Uonzlkt0ggeGpLkRVX1veH7dwMXVdW1Sa4G3gVcDbwK+GhVvepYx/KhIZK0dmN9aMjhgB86Ezj8f40dwN/VwN3AWUle0rU/HdskzAuXNDlGMk8+yS7g7cDTwC8Nm88Dnlj2sQPDtqeO+t05YA5gyjtinUzSvHBJk+GERvJJdie5b4XXDoCq2llV5wMLwPVrKaCq5qtqpqpmtmzZsvY/gX7MeeGSjnZCI/mquvIEj7cA3A68H3gSOH/Zvq3DNo2J88IlHW0Us2u2L9vcATw0fH8b8PbhLJtXA09X1VPPO4BGxnnhko42innyHxpeurkX+FXgPcP224FHgX3AJ4HfG0FfOgbnhUs6Wucbr1X1llXaC7iu6/F14g7fXN25c3CJZmpqEPDedJU2LlehbIyrHUpazmUNJKlhhrwkNcyQl6SGGfIaC5dXkCaDN141ci6vIE0OR/IaOZdXkCaHIa+Rc3kFaXIY8ho5l1eQJochr5FzeQVpchjyGjkfuydNDmfXaCxcXkGaDI7kJalhhrwkNcyQl6SGGfKS1LAmQt51UiRpZaf87BrXSZGk1Z3yI3nXSZGk1XUK+SQfTHJvkr1J7kxy7rD9F5M8PWzfm+RPRlPu87lOiiStrutI/sNV9bKquhT4ArA8zP+1qi4dvj7QsZ9VuU6KJK2uU8hX1feWbZ4JVLdy1s51UiRpdZ2vySfZleQJYJYjR/KXJ/m3JF9KcnHXflbjOimStLpUHXvwnWQ3cM4Ku3ZW1a3LPncD8MKqen+SFwHPVtX/Jrka+GhVbV/l+HPAHMDU1NQr9+/ff5J/FEnamJLsqaqZFfcdL+TX0MkUcHtVXbLCvseAmar69rGOMTMzU4uLiyOpR5I2imOFfNfZNctH5zuAh4bt5yTJ8P1lw36+06UvSdLadf0y1IeS/CzwLLAfuHbY/hvA7yY5BHwfuKZG9U8GSdIJ6xTyVfWWVdpvBG7scmxJUnen/DdeJUmrG9mN11FIssTgss/JOhs45s3dDcRzcSTPx3M8F0dq4XxcUFVbVtoxUSHfVZLF1e4wbzSeiyN5Pp7juThS6+fDyzWS1DBDXpIa1lrIz/ddwATxXBzJ8/Ecz8WRmj4fTV2TlyQdqbWRvCRpGUNekhrWRMgnuSrJw0n2JXlf3/X0Kcn5Sf4lyQNJ7k/ynr5r6luSTUnuSfKFvmvpW5Kzktyc5KEkDya5vO+a+pTkD4Z/T+5L8o9JXth3TaN2yod8kk3Ax4BfAy4CfjPJRf1W1atDwB9V1UXAq4HrNvj5AHgP8GDfRUyIjwJfrqqfA36eDXxekpwHvJvBCrmXAJuAa/qtavRO+ZAHLgP2VdWjVfUD4CYGK2JuSFX1VFV9Y/j+fxj8JT6v36r6k2Qr8OvAp/qupW9Jfhp4HfBpgKr6QVX9d79V9e504CeTnA5sBv6j53pGroWQPw94Ytn2ATZwqC2XZBp4OfDVfivp1V8C72WwUupGtw1YAv52ePnqU0nO7LuovlTVk8BHgMeBp4Cnq+rOfqsavRZCXitI8lPA54DfP+pZvBtGkjcA/1VVe/quZUKcDrwC+Juqejnwf8CGvYeV5MUM/tW/DTgXODPJb/Vb1ei1EPJPAucv2946bNuwkvwEg4BfqKrP911Pj14DvHH4ZLKbgF9O8vf9ltSrA8CBqjr8L7ubGYT+RnUl8K2qWqqqHwKfB36h55pGroWQ/zqwPcm2JC9gcOPktp5r6s3wiVyfBh6sqj/vu54+VdUNVbW1qqYZ/Hfxz1XV3EjtRFXVfwJPDB/0A3AF8ECPJfXtceDVSTYP/95cQYM3ors+Gap3VXUoyfXAHQzujn+mqu7vuaw+vQZ4G/DvSfYO2/64qm7vsSZNjncBC8MB0aPAb/dcT2+q6qtJbga+wWBW2j00uMSByxpIUsNauFwjSVqFIS9JDTPkJalhhrwkNcyQl6SGGfKS1DBDXpIa9v83PK0Ut0h3xAAAAABJRU5ErkJggg==\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "The network predicted the value: 1\n"
- ]
- }
- ],
- "source": [
- "plt.plot(good_trial.tensor('Net_output0').value(2700)[42], 'bo')\n",
- "plt.show()\n",
- "print('The network predicted the value: {}'.format(np.argmax(good_trial.tensor('Net_output0').value(2700)[42])))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Vanishing Gradient"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We have now worked through some of the basics. Let's pretend we are debugging a real problem: the Vanishing Gradient. When training a network, if the `learning_rate` is too high we will end up with a Vanishing Gradient. Let's set `learning_rate=1`.\n",
- "\n",
- "Notice how the accuracy remains at around ~10% - no better than random."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Train Epoch: 0 [0/60000 (0%)]\tLoss: 2.330259\n",
- "Train Epoch: 0 [640/60000 (1%)]\tLoss: 2.459480\n",
- "Train Epoch: 0 [1280/60000 (2%)]\tLoss: 2.342018\n",
- "Train Epoch: 0 [1920/60000 (3%)]\tLoss: 2.370183\n",
- "Train Epoch: 0 [2560/60000 (4%)]\tLoss: 2.306495\n",
- "Train Epoch: 0 [3200/60000 (5%)]\tLoss: 2.287591\n",
- "Train Epoch: 0 [3840/60000 (6%)]\tLoss: 2.284006\n",
- "Train Epoch: 0 [4480/60000 (7%)]\tLoss: 2.360036\n",
- "Train Epoch: 0 [5120/60000 (9%)]\tLoss: 2.298006\n",
- "Train Epoch: 0 [5760/60000 (10%)]\tLoss: 2.341758\n",
- "Train Epoch: 0 [6400/60000 (11%)]\tLoss: 2.304522\n",
- "Train Epoch: 0 [7040/60000 (12%)]\tLoss: 2.286000\n",
- "Train Epoch: 0 [7680/60000 (13%)]\tLoss: 2.293531\n",
- "Train Epoch: 0 [8320/60000 (14%)]\tLoss: 2.358820\n",
- "Train Epoch: 0 [8960/60000 (15%)]\tLoss: 2.307475\n",
- "Train Epoch: 0 [9600/60000 (16%)]\tLoss: 2.310409\n",
- "Train Epoch: 0 [10240/60000 (17%)]\tLoss: 2.324923\n",
- "Train Epoch: 0 [10880/60000 (18%)]\tLoss: 2.341324\n",
- "Train Epoch: 0 [11520/60000 (19%)]\tLoss: 2.346567\n",
- "Train Epoch: 0 [12160/60000 (20%)]\tLoss: 2.398468\n",
- "Train Epoch: 0 [12800/60000 (21%)]\tLoss: 2.327744\n",
- "Train Epoch: 0 [13440/60000 (22%)]\tLoss: 2.355408\n",
- "Train Epoch: 0 [14080/60000 (23%)]\tLoss: 2.342527\n",
- "Train Epoch: 0 [14720/60000 (25%)]\tLoss: 2.381174\n",
- "Train Epoch: 0 [15360/60000 (26%)]\tLoss: 2.327620\n",
- "Train Epoch: 0 [16000/60000 (27%)]\tLoss: 2.312094\n",
- "Train Epoch: 0 [16640/60000 (28%)]\tLoss: 2.349320\n",
- "Train Epoch: 0 [17280/60000 (29%)]\tLoss: 2.325104\n",
- "Train Epoch: 0 [17920/60000 (30%)]\tLoss: 2.410883\n",
- "Train Epoch: 0 [18560/60000 (31%)]\tLoss: 2.312054\n",
- "Train Epoch: 0 [19200/60000 (32%)]\tLoss: 2.328432\n",
- "Train Epoch: 0 [19840/60000 (33%)]\tLoss: 2.326895\n",
- "Train Epoch: 0 [20480/60000 (34%)]\tLoss: 2.294943\n",
- "Train Epoch: 0 [21120/60000 (35%)]\tLoss: 2.310558\n",
- "Train Epoch: 0 [21760/60000 (36%)]\tLoss: 2.353846\n",
- "Train Epoch: 0 [22400/60000 (37%)]\tLoss: 2.319908\n",
- "Train Epoch: 0 [23040/60000 (38%)]\tLoss: 2.286135\n",
- "Train Epoch: 0 [23680/60000 (39%)]\tLoss: 2.337246\n",
- "Train Epoch: 0 [24320/60000 (41%)]\tLoss: 2.360455\n",
- "Train Epoch: 0 [24960/60000 (42%)]\tLoss: 2.375340\n",
- "Train Epoch: 0 [25600/60000 (43%)]\tLoss: 2.329563\n",
- "Train Epoch: 0 [26240/60000 (44%)]\tLoss: 2.395788\n",
- "Train Epoch: 0 [26880/60000 (45%)]\tLoss: 2.309679\n",
- "Train Epoch: 0 [27520/60000 (46%)]\tLoss: 2.316720\n",
- "Train Epoch: 0 [28160/60000 (47%)]\tLoss: 2.346542\n",
- "Train Epoch: 0 [28800/60000 (48%)]\tLoss: 2.307389\n",
- "Train Epoch: 0 [29440/60000 (49%)]\tLoss: 2.317679\n",
- "Train Epoch: 0 [30080/60000 (50%)]\tLoss: 2.395548\n",
- "Train Epoch: 0 [30720/60000 (51%)]\tLoss: 2.491510\n",
- "Train Epoch: 0 [31360/60000 (52%)]\tLoss: 2.282305\n",
- "Train Epoch: 0 [32000/60000 (53%)]\tLoss: 2.254564\n",
- "Train Epoch: 0 [32640/60000 (54%)]\tLoss: 2.333660\n",
- "Train Epoch: 0 [33280/60000 (55%)]\tLoss: 2.362551\n",
- "Train Epoch: 0 [33920/60000 (57%)]\tLoss: 2.384102\n",
- "Train Epoch: 0 [34560/60000 (58%)]\tLoss: 2.326107\n",
- "Train Epoch: 0 [35200/60000 (59%)]\tLoss: 2.320798\n",
- "Train Epoch: 0 [35840/60000 (60%)]\tLoss: 2.464493\n",
- "Train Epoch: 0 [36480/60000 (61%)]\tLoss: 2.331449\n",
- "Train Epoch: 0 [37120/60000 (62%)]\tLoss: 2.415666\n",
- "Train Epoch: 0 [37760/60000 (63%)]\tLoss: 2.473129\n",
- "Train Epoch: 0 [38400/60000 (64%)]\tLoss: 2.397358\n",
- "Train Epoch: 0 [39040/60000 (65%)]\tLoss: 2.407380\n",
- "Train Epoch: 0 [39680/60000 (66%)]\tLoss: 2.462022\n",
- "Train Epoch: 0 [40320/60000 (67%)]\tLoss: 2.399495\n",
- "Train Epoch: 0 [40960/60000 (68%)]\tLoss: 2.306419\n",
- "Train Epoch: 0 [41600/60000 (69%)]\tLoss: 2.341526\n",
- "Train Epoch: 0 [42240/60000 (70%)]\tLoss: 2.294206\n",
- "Train Epoch: 0 [42880/60000 (71%)]\tLoss: 2.290251\n",
- "Train Epoch: 0 [43520/60000 (72%)]\tLoss: 2.317634\n",
- "Train Epoch: 0 [44160/60000 (74%)]\tLoss: 2.358360\n",
- "Train Epoch: 0 [44800/60000 (75%)]\tLoss: 2.332942\n",
- "Train Epoch: 0 [45440/60000 (76%)]\tLoss: 2.312563\n",
- "Train Epoch: 0 [46080/60000 (77%)]\tLoss: 2.328177\n",
- "Train Epoch: 0 [46720/60000 (78%)]\tLoss: 2.319674\n",
- "Train Epoch: 0 [47360/60000 (79%)]\tLoss: 2.349114\n",
- "Train Epoch: 0 [48000/60000 (80%)]\tLoss: 2.294027\n",
- "Train Epoch: 0 [48640/60000 (81%)]\tLoss: 2.344978\n",
- "Train Epoch: 0 [49280/60000 (82%)]\tLoss: 2.322978\n",
- "Train Epoch: 0 [49920/60000 (83%)]\tLoss: 2.308064\n",
- "Train Epoch: 0 [50560/60000 (84%)]\tLoss: 2.344061\n",
- "Train Epoch: 0 [51200/60000 (85%)]\tLoss: 2.403563\n",
- "Train Epoch: 0 [51840/60000 (86%)]\tLoss: 2.312183\n",
- "Train Epoch: 0 [52480/60000 (87%)]\tLoss: 2.287836\n",
- "Train Epoch: 0 [53120/60000 (88%)]\tLoss: 2.333920\n",
- "Train Epoch: 0 [53760/60000 (90%)]\tLoss: 2.312499\n",
- "Train Epoch: 0 [54400/60000 (91%)]\tLoss: 2.359949\n",
- "Train Epoch: 0 [55040/60000 (92%)]\tLoss: 2.363973\n",
- "Train Epoch: 0 [55680/60000 (93%)]\tLoss: 2.304453\n",
- "Train Epoch: 0 [56320/60000 (94%)]\tLoss: 2.454728\n",
- "Train Epoch: 0 [56960/60000 (95%)]\tLoss: 2.338699\n",
- "Train Epoch: 0 [57600/60000 (96%)]\tLoss: 2.289698\n",
- "Train Epoch: 0 [58240/60000 (97%)]\tLoss: 2.325969\n",
- "Train Epoch: 0 [58880/60000 (98%)]\tLoss: 2.334784\n",
- "Train Epoch: 0 [59520/60000 (99%)]\tLoss: 2.363783\n",
- "Train Epoch: 1 [0/60000 (0%)]\tLoss: 2.304468\n",
- "Train Epoch: 1 [640/60000 (1%)]\tLoss: 2.237942\n",
- "Train Epoch: 1 [1280/60000 (2%)]\tLoss: 2.379854\n",
- "Train Epoch: 1 [1920/60000 (3%)]\tLoss: 2.321807\n",
- "Train Epoch: 1 [2560/60000 (4%)]\tLoss: 2.326054\n",
- "Train Epoch: 1 [3200/60000 (5%)]\tLoss: 2.366278\n",
- "Train Epoch: 1 [3840/60000 (6%)]\tLoss: 2.262960\n",
- "Train Epoch: 1 [4480/60000 (7%)]\tLoss: 2.338786\n",
- "Train Epoch: 1 [5120/60000 (9%)]\tLoss: 2.378443\n",
- "Train Epoch: 1 [5760/60000 (10%)]\tLoss: 2.333674\n",
- "Train Epoch: 1 [6400/60000 (11%)]\tLoss: 2.328306\n",
- "Train Epoch: 1 [7040/60000 (12%)]\tLoss: 2.338466\n",
- "Train Epoch: 1 [7680/60000 (13%)]\tLoss: 2.356887\n",
- "Train Epoch: 1 [8320/60000 (14%)]\tLoss: 2.377309\n",
- "Train Epoch: 1 [8960/60000 (15%)]\tLoss: 2.312181\n",
- "Train Epoch: 1 [9600/60000 (16%)]\tLoss: 2.397353\n",
- "Train Epoch: 1 [10240/60000 (17%)]\tLoss: 2.364430\n",
- "Train Epoch: 1 [10880/60000 (18%)]\tLoss: 2.379686\n",
- "Train Epoch: 1 [11520/60000 (19%)]\tLoss: 2.351562\n",
- "Train Epoch: 1 [12160/60000 (20%)]\tLoss: 2.350115\n",
- "Train Epoch: 1 [12800/60000 (21%)]\tLoss: 2.244029\n",
- "Train Epoch: 1 [13440/60000 (22%)]\tLoss: 2.360412\n",
- "Train Epoch: 1 [14080/60000 (23%)]\tLoss: 2.315639\n",
- "Train Epoch: 1 [14720/60000 (25%)]\tLoss: 2.389025\n",
- "Train Epoch: 1 [15360/60000 (26%)]\tLoss: 2.397625\n",
- "Train Epoch: 1 [16000/60000 (27%)]\tLoss: 2.324974\n",
- "Train Epoch: 1 [16640/60000 (28%)]\tLoss: 2.326982\n",
- "Train Epoch: 1 [17280/60000 (29%)]\tLoss: 2.397022\n",
- "Train Epoch: 1 [17920/60000 (30%)]\tLoss: 2.341864\n",
- "Train Epoch: 1 [18560/60000 (31%)]\tLoss: 2.316780\n",
- "Train Epoch: 1 [19200/60000 (32%)]\tLoss: 2.290725\n",
- "Train Epoch: 1 [19840/60000 (33%)]\tLoss: 2.302054\n",
- "Train Epoch: 1 [20480/60000 (34%)]\tLoss: 2.341123\n",
- "Train Epoch: 1 [21120/60000 (35%)]\tLoss: 2.367768\n",
- "Train Epoch: 1 [21760/60000 (36%)]\tLoss: 2.341992\n",
- "Train Epoch: 1 [22400/60000 (37%)]\tLoss: 2.338322\n",
- "Train Epoch: 1 [23040/60000 (38%)]\tLoss: 2.355606\n",
- "Train Epoch: 1 [23680/60000 (39%)]\tLoss: 2.284112\n",
- "Train Epoch: 1 [24320/60000 (41%)]\tLoss: 2.374856\n",
- "Train Epoch: 1 [24960/60000 (42%)]\tLoss: 2.331543\n",
- "Train Epoch: 1 [25600/60000 (43%)]\tLoss: 2.321192\n",
- "Train Epoch: 1 [26240/60000 (44%)]\tLoss: 2.265647\n",
- "Train Epoch: 1 [26880/60000 (45%)]\tLoss: 2.298278\n",
- "Train Epoch: 1 [27520/60000 (46%)]\tLoss: 2.317490\n",
- "Train Epoch: 1 [28160/60000 (47%)]\tLoss: 2.272723\n",
- "Train Epoch: 1 [28800/60000 (48%)]\tLoss: 2.400963\n",
- "Train Epoch: 1 [29440/60000 (49%)]\tLoss: 2.507440\n",
- "Train Epoch: 1 [30080/60000 (50%)]\tLoss: 2.393094\n",
- "Train Epoch: 1 [30720/60000 (51%)]\tLoss: 2.419714\n",
- "Train Epoch: 1 [31360/60000 (52%)]\tLoss: 2.351777\n",
- "Train Epoch: 1 [32000/60000 (53%)]\tLoss: 2.419491\n",
- "Train Epoch: 1 [32640/60000 (54%)]\tLoss: 2.429312\n",
- "Train Epoch: 1 [33280/60000 (55%)]\tLoss: 2.297789\n",
- "Train Epoch: 1 [33920/60000 (57%)]\tLoss: 2.351755\n",
- "Train Epoch: 1 [34560/60000 (58%)]\tLoss: 2.342575\n",
- "Train Epoch: 1 [35200/60000 (59%)]\tLoss: 2.359362\n",
- "Train Epoch: 1 [35840/60000 (60%)]\tLoss: 2.323574\n",
- "Train Epoch: 1 [36480/60000 (61%)]\tLoss: 2.405147\n",
- "Train Epoch: 1 [37120/60000 (62%)]\tLoss: 2.372452\n",
- "Train Epoch: 1 [37760/60000 (63%)]\tLoss: 2.360568\n",
- "Train Epoch: 1 [38400/60000 (64%)]\tLoss: 2.419126\n",
- "Train Epoch: 1 [39040/60000 (65%)]\tLoss: 2.283723\n",
- "Train Epoch: 1 [39680/60000 (66%)]\tLoss: 2.336538\n",
- "Train Epoch: 1 [40320/60000 (67%)]\tLoss: 2.346513\n",
- "Train Epoch: 1 [40960/60000 (68%)]\tLoss: 2.304324\n",
- "Train Epoch: 1 [41600/60000 (69%)]\tLoss: 2.341439\n",
- "Train Epoch: 1 [42240/60000 (70%)]\tLoss: 2.361294\n",
- "Train Epoch: 1 [42880/60000 (71%)]\tLoss: 2.406241\n",
- "Train Epoch: 1 [43520/60000 (72%)]\tLoss: 2.300334\n",
- "Train Epoch: 1 [44160/60000 (74%)]\tLoss: 2.309165\n",
- "Train Epoch: 1 [44800/60000 (75%)]\tLoss: 2.361495\n",
- "Train Epoch: 1 [45440/60000 (76%)]\tLoss: 2.443631\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Train Epoch: 1 [46080/60000 (77%)]\tLoss: 2.314088\n",
- "Train Epoch: 1 [46720/60000 (78%)]\tLoss: 2.392107\n",
- "Train Epoch: 1 [47360/60000 (79%)]\tLoss: 2.323653\n",
- "Train Epoch: 1 [48000/60000 (80%)]\tLoss: 2.294007\n",
- "Train Epoch: 1 [48640/60000 (81%)]\tLoss: 2.314728\n",
- "Train Epoch: 1 [49280/60000 (82%)]\tLoss: 2.355071\n",
- "Train Epoch: 1 [49920/60000 (83%)]\tLoss: 2.349204\n",
- "Train Epoch: 1 [50560/60000 (84%)]\tLoss: 2.307742\n",
- "Train Epoch: 1 [51200/60000 (85%)]\tLoss: 2.347462\n",
- "Train Epoch: 1 [51840/60000 (86%)]\tLoss: 2.405621\n",
- "Train Epoch: 1 [52480/60000 (87%)]\tLoss: 2.423558\n",
- "Train Epoch: 1 [53120/60000 (88%)]\tLoss: 2.382326\n",
- "Train Epoch: 1 [53760/60000 (90%)]\tLoss: 2.320254\n",
- "Train Epoch: 1 [54400/60000 (91%)]\tLoss: 2.369054\n",
- "Train Epoch: 1 [55040/60000 (92%)]\tLoss: 2.321925\n",
- "Train Epoch: 1 [55680/60000 (93%)]\tLoss: 2.321786\n",
- "Train Epoch: 1 [56320/60000 (94%)]\tLoss: 2.339195\n",
- "Train Epoch: 1 [56960/60000 (95%)]\tLoss: 2.323470\n",
- "Train Epoch: 1 [57600/60000 (96%)]\tLoss: 2.421056\n",
- "Train Epoch: 1 [58240/60000 (97%)]\tLoss: 2.329478\n",
- "Train Epoch: 1 [58880/60000 (98%)]\tLoss: 2.315077\n",
- "Train Epoch: 1 [59520/60000 (99%)]\tLoss: 2.340677\n",
- "Train Epoch: 2 [0/60000 (0%)]\tLoss: 2.308716\n",
- "Train Epoch: 2 [640/60000 (1%)]\tLoss: 2.318739\n",
- "Train Epoch: 2 [1280/60000 (2%)]\tLoss: 2.294240\n",
- "Train Epoch: 2 [1920/60000 (3%)]\tLoss: 2.417459\n",
- "Train Epoch: 2 [2560/60000 (4%)]\tLoss: 2.342914\n",
- "Train Epoch: 2 [3200/60000 (5%)]\tLoss: 2.373566\n",
- "Train Epoch: 2 [3840/60000 (6%)]\tLoss: 2.307191\n",
- "Train Epoch: 2 [4480/60000 (7%)]\tLoss: 2.340860\n",
- "Train Epoch: 2 [5120/60000 (9%)]\tLoss: 2.305294\n",
- "Train Epoch: 2 [5760/60000 (10%)]\tLoss: 2.383138\n",
- "Train Epoch: 2 [6400/60000 (11%)]\tLoss: 2.337879\n",
- "Train Epoch: 2 [7040/60000 (12%)]\tLoss: 2.336192\n",
- "Train Epoch: 2 [7680/60000 (13%)]\tLoss: 2.339699\n",
- "Train Epoch: 2 [8320/60000 (14%)]\tLoss: 2.323756\n",
- "Train Epoch: 2 [8960/60000 (15%)]\tLoss: 2.305490\n",
- "Train Epoch: 2 [9600/60000 (16%)]\tLoss: 2.325570\n",
- "Train Epoch: 2 [10240/60000 (17%)]\tLoss: 2.288280\n",
- "Train Epoch: 2 [10880/60000 (18%)]\tLoss: 2.306230\n",
- "Train Epoch: 2 [11520/60000 (19%)]\tLoss: 2.342124\n",
- "Train Epoch: 2 [12160/60000 (20%)]\tLoss: 2.346761\n",
- "Train Epoch: 2 [12800/60000 (21%)]\tLoss: 2.428949\n",
- "Train Epoch: 2 [13440/60000 (22%)]\tLoss: 2.404235\n",
- "Train Epoch: 2 [14080/60000 (23%)]\tLoss: 2.278017\n",
- "Train Epoch: 2 [14720/60000 (25%)]\tLoss: 2.326802\n",
- "Train Epoch: 2 [15360/60000 (26%)]\tLoss: 2.358422\n",
- "Train Epoch: 2 [16000/60000 (27%)]\tLoss: 2.343786\n",
- "Train Epoch: 2 [16640/60000 (28%)]\tLoss: 2.293986\n",
- "Train Epoch: 2 [17280/60000 (29%)]\tLoss: 2.336337\n",
- "Train Epoch: 2 [17920/60000 (30%)]\tLoss: 2.321667\n",
- "Train Epoch: 2 [18560/60000 (31%)]\tLoss: 2.371925\n",
- "Train Epoch: 2 [19200/60000 (32%)]\tLoss: 2.374017\n",
- "Train Epoch: 2 [19840/60000 (33%)]\tLoss: 2.320786\n",
- "Train Epoch: 2 [20480/60000 (34%)]\tLoss: 2.342391\n",
- "Train Epoch: 2 [21120/60000 (35%)]\tLoss: 2.308513\n",
- "Train Epoch: 2 [21760/60000 (36%)]\tLoss: 2.271694\n",
- "Train Epoch: 2 [22400/60000 (37%)]\tLoss: 2.332782\n",
- "Train Epoch: 2 [23040/60000 (38%)]\tLoss: 2.360431\n",
- "Train Epoch: 2 [23680/60000 (39%)]\tLoss: 2.289818\n",
- "Train Epoch: 2 [24320/60000 (41%)]\tLoss: 2.305624\n",
- "Train Epoch: 2 [24960/60000 (42%)]\tLoss: 2.311587\n",
- "Train Epoch: 2 [25600/60000 (43%)]\tLoss: 2.331149\n",
- "Train Epoch: 2 [26240/60000 (44%)]\tLoss: 2.313762\n",
- "Train Epoch: 2 [26880/60000 (45%)]\tLoss: 2.349113\n",
- "Train Epoch: 2 [27520/60000 (46%)]\tLoss: 2.355408\n",
- "Train Epoch: 2 [28160/60000 (47%)]\tLoss: 2.304258\n",
- "Train Epoch: 2 [28800/60000 (48%)]\tLoss: 2.377938\n",
- "Train Epoch: 2 [29440/60000 (49%)]\tLoss: 2.321165\n",
- "Train Epoch: 2 [30080/60000 (50%)]\tLoss: 2.364525\n",
- "Train Epoch: 2 [30720/60000 (51%)]\tLoss: 2.406883\n",
- "Train Epoch: 2 [31360/60000 (52%)]\tLoss: 2.400862\n",
- "Train Epoch: 2 [32000/60000 (53%)]\tLoss: 2.334538\n",
- "Train Epoch: 2 [32640/60000 (54%)]\tLoss: 2.282245\n",
- "Train Epoch: 2 [33280/60000 (55%)]\tLoss: 2.300971\n",
- "Train Epoch: 2 [33920/60000 (57%)]\tLoss: 2.308848\n",
- "Train Epoch: 2 [34560/60000 (58%)]\tLoss: 2.333123\n",
- "Train Epoch: 2 [35200/60000 (59%)]\tLoss: 2.333816\n",
- "Train Epoch: 2 [35840/60000 (60%)]\tLoss: 2.313128\n",
- "Train Epoch: 2 [36480/60000 (61%)]\tLoss: 2.320728\n",
- "Train Epoch: 2 [37120/60000 (62%)]\tLoss: 2.311455\n",
- "Train Epoch: 2 [37760/60000 (63%)]\tLoss: 2.312425\n",
- "Train Epoch: 2 [38400/60000 (64%)]\tLoss: 2.301049\n",
- "Train Epoch: 2 [39040/60000 (65%)]\tLoss: 2.287769\n",
- "Train Epoch: 2 [39680/60000 (66%)]\tLoss: 2.368213\n",
- "Train Epoch: 2 [40320/60000 (67%)]\tLoss: 2.329561\n",
- "Train Epoch: 2 [40960/60000 (68%)]\tLoss: 2.296645\n",
- "Train Epoch: 2 [41600/60000 (69%)]\tLoss: 2.339840\n",
- "Train Epoch: 2 [42240/60000 (70%)]\tLoss: 2.400887\n",
- "Train Epoch: 2 [42880/60000 (71%)]\tLoss: 2.366787\n",
- "Train Epoch: 2 [43520/60000 (72%)]\tLoss: 2.371027\n",
- "Train Epoch: 2 [44160/60000 (74%)]\tLoss: 2.338437\n",
- "Train Epoch: 2 [44800/60000 (75%)]\tLoss: 2.389745\n",
- "Train Epoch: 2 [45440/60000 (76%)]\tLoss: 2.362866\n",
- "Train Epoch: 2 [46080/60000 (77%)]\tLoss: 2.440138\n",
- "Train Epoch: 2 [46720/60000 (78%)]\tLoss: 2.340149\n",
- "Train Epoch: 2 [47360/60000 (79%)]\tLoss: 2.426742\n",
- "Train Epoch: 2 [48000/60000 (80%)]\tLoss: 2.357159\n",
- "Train Epoch: 2 [48640/60000 (81%)]\tLoss: 2.400013\n",
- "Train Epoch: 2 [49280/60000 (82%)]\tLoss: 2.337224\n",
- "Train Epoch: 2 [49920/60000 (83%)]\tLoss: 2.369920\n",
- "Train Epoch: 2 [50560/60000 (84%)]\tLoss: 2.327389\n",
- "Train Epoch: 2 [51200/60000 (85%)]\tLoss: 2.318965\n",
- "Train Epoch: 2 [51840/60000 (86%)]\tLoss: 2.357245\n",
- "Train Epoch: 2 [52480/60000 (87%)]\tLoss: 2.421128\n",
- "Train Epoch: 2 [53120/60000 (88%)]\tLoss: 2.365572\n",
- "Train Epoch: 2 [53760/60000 (90%)]\tLoss: 2.359708\n",
- "Train Epoch: 2 [54400/60000 (91%)]\tLoss: 2.317222\n",
- "Train Epoch: 2 [55040/60000 (92%)]\tLoss: 2.371051\n",
- "Train Epoch: 2 [55680/60000 (93%)]\tLoss: 2.360173\n",
- "Train Epoch: 2 [56320/60000 (94%)]\tLoss: 2.345640\n",
- "Train Epoch: 2 [56960/60000 (95%)]\tLoss: 2.355781\n",
- "Train Epoch: 2 [57600/60000 (96%)]\tLoss: 2.335961\n",
- "Train Epoch: 2 [58240/60000 (97%)]\tLoss: 2.336265\n",
- "Train Epoch: 2 [58880/60000 (98%)]\tLoss: 2.383019\n",
- "Train Epoch: 2 [59520/60000 (99%)]\tLoss: 2.294914\n",
- "Train Epoch: 3 [0/60000 (0%)]\tLoss: 2.302218\n",
- "Train Epoch: 3 [640/60000 (1%)]\tLoss: 2.321162\n",
- "Train Epoch: 3 [1280/60000 (2%)]\tLoss: 2.301874\n",
- "Train Epoch: 3 [1920/60000 (3%)]\tLoss: 2.406926\n",
- "Train Epoch: 3 [2560/60000 (4%)]\tLoss: 2.365343\n",
- "Train Epoch: 3 [3200/60000 (5%)]\tLoss: 2.323746\n",
- "Train Epoch: 3 [3840/60000 (6%)]\tLoss: 2.344622\n",
- "Train Epoch: 3 [4480/60000 (7%)]\tLoss: 2.351114\n",
- "Train Epoch: 3 [5120/60000 (9%)]\tLoss: 2.407657\n",
- "Train Epoch: 3 [5760/60000 (10%)]\tLoss: 2.418502\n",
- "Train Epoch: 3 [6400/60000 (11%)]\tLoss: 2.337087\n",
- "Train Epoch: 3 [7040/60000 (12%)]\tLoss: 2.303796\n",
- "Train Epoch: 3 [7680/60000 (13%)]\tLoss: 2.401513\n",
- "Train Epoch: 3 [8320/60000 (14%)]\tLoss: 2.337463\n",
- "Train Epoch: 3 [8960/60000 (15%)]\tLoss: 2.324577\n",
- "Train Epoch: 3 [9600/60000 (16%)]\tLoss: 2.335718\n",
- "Train Epoch: 3 [10240/60000 (17%)]\tLoss: 2.384667\n",
- "Train Epoch: 3 [10880/60000 (18%)]\tLoss: 2.267396\n",
- "Train Epoch: 3 [11520/60000 (19%)]\tLoss: 2.306527\n",
- "Train Epoch: 3 [12160/60000 (20%)]\tLoss: 2.367751\n",
- "Train Epoch: 3 [12800/60000 (21%)]\tLoss: 2.309073\n",
- "Train Epoch: 3 [13440/60000 (22%)]\tLoss: 2.315047\n",
- "Train Epoch: 3 [14080/60000 (23%)]\tLoss: 2.347873\n",
- "Train Epoch: 3 [14720/60000 (25%)]\tLoss: 2.268999\n",
- "Train Epoch: 3 [15360/60000 (26%)]\tLoss: 2.333838\n",
- "Train Epoch: 3 [16000/60000 (27%)]\tLoss: 2.349008\n",
- "Train Epoch: 3 [16640/60000 (28%)]\tLoss: 2.375700\n",
- "Train Epoch: 3 [17280/60000 (29%)]\tLoss: 2.331388\n",
- "Train Epoch: 3 [17920/60000 (30%)]\tLoss: 2.335067\n",
- "Train Epoch: 3 [18560/60000 (31%)]\tLoss: 2.332542\n",
- "Train Epoch: 3 [19200/60000 (32%)]\tLoss: 2.345043\n",
- "Train Epoch: 3 [19840/60000 (33%)]\tLoss: 2.300745\n",
- "Train Epoch: 3 [20480/60000 (34%)]\tLoss: 2.416367\n",
- "Train Epoch: 3 [21120/60000 (35%)]\tLoss: 2.282617\n",
- "Train Epoch: 3 [21760/60000 (36%)]\tLoss: 2.317955\n",
- "Train Epoch: 3 [22400/60000 (37%)]\tLoss: 2.329546\n",
- "Train Epoch: 3 [23040/60000 (38%)]\tLoss: 2.333439\n",
- "Train Epoch: 3 [23680/60000 (39%)]\tLoss: 2.432110\n",
- "Train Epoch: 3 [24320/60000 (41%)]\tLoss: 2.389215\n",
- "Train Epoch: 3 [24960/60000 (42%)]\tLoss: 2.317299\n",
- "Train Epoch: 3 [25600/60000 (43%)]\tLoss: 2.398170\n",
- "Train Epoch: 3 [26240/60000 (44%)]\tLoss: 2.354642\n",
- "Train Epoch: 3 [26880/60000 (45%)]\tLoss: 2.310941\n",
- "Train Epoch: 3 [27520/60000 (46%)]\tLoss: 2.352980\n",
- "Train Epoch: 3 [28160/60000 (47%)]\tLoss: 2.370045\n",
- "Train Epoch: 3 [28800/60000 (48%)]\tLoss: 2.332853\n",
- "Train Epoch: 3 [29440/60000 (49%)]\tLoss: 2.328536\n",
- "Train Epoch: 3 [30080/60000 (50%)]\tLoss: 2.410731\n",
- "Train Epoch: 3 [30720/60000 (51%)]\tLoss: 2.315743\n",
- "Train Epoch: 3 [31360/60000 (52%)]\tLoss: 2.362804\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Train Epoch: 3 [32000/60000 (53%)]\tLoss: 2.268909\n",
- "Train Epoch: 3 [32640/60000 (54%)]\tLoss: 2.324456\n",
- "Train Epoch: 3 [33280/60000 (55%)]\tLoss: 2.313516\n",
- "Train Epoch: 3 [33920/60000 (57%)]\tLoss: 2.345426\n",
- "Train Epoch: 3 [34560/60000 (58%)]\tLoss: 2.328141\n",
- "Train Epoch: 3 [35200/60000 (59%)]\tLoss: 2.392569\n",
- "Train Epoch: 3 [35840/60000 (60%)]\tLoss: 2.333704\n",
- "Train Epoch: 3 [36480/60000 (61%)]\tLoss: 2.352234\n",
- "Train Epoch: 3 [37120/60000 (62%)]\tLoss: 2.323742\n",
- "Train Epoch: 3 [37760/60000 (63%)]\tLoss: 2.318627\n",
- "Train Epoch: 3 [38400/60000 (64%)]\tLoss: 2.340932\n",
- "Train Epoch: 3 [39040/60000 (65%)]\tLoss: 2.401247\n",
- "Train Epoch: 3 [39680/60000 (66%)]\tLoss: 2.390721\n",
- "Train Epoch: 3 [40320/60000 (67%)]\tLoss: 2.372447\n",
- "Train Epoch: 3 [40960/60000 (68%)]\tLoss: 2.381556\n",
- "Train Epoch: 3 [41600/60000 (69%)]\tLoss: 2.370942\n",
- "Train Epoch: 3 [42240/60000 (70%)]\tLoss: 2.382010\n",
- "Train Epoch: 3 [42880/60000 (71%)]\tLoss: 2.337266\n",
- "Train Epoch: 3 [43520/60000 (72%)]\tLoss: 2.327033\n",
- "Train Epoch: 3 [44160/60000 (74%)]\tLoss: 2.379825\n",
- "Train Epoch: 3 [44800/60000 (75%)]\tLoss: 2.323223\n",
- "Train Epoch: 3 [45440/60000 (76%)]\tLoss: 2.283228\n",
- "Train Epoch: 3 [46080/60000 (77%)]\tLoss: 2.334064\n",
- "Train Epoch: 3 [46720/60000 (78%)]\tLoss: 2.381998\n",
- "Train Epoch: 3 [47360/60000 (79%)]\tLoss: 2.324826\n",
- "Train Epoch: 3 [48000/60000 (80%)]\tLoss: 2.344363\n",
- "Train Epoch: 3 [48640/60000 (81%)]\tLoss: 2.407687\n",
- "Train Epoch: 3 [49280/60000 (82%)]\tLoss: 2.405679\n",
- "Train Epoch: 3 [49920/60000 (83%)]\tLoss: 2.347231\n",
- "Train Epoch: 3 [50560/60000 (84%)]\tLoss: 2.381284\n",
- "Train Epoch: 3 [51200/60000 (85%)]\tLoss: 2.320855\n",
- "Train Epoch: 3 [51840/60000 (86%)]\tLoss: 2.332896\n",
- "Train Epoch: 3 [52480/60000 (87%)]\tLoss: 2.331153\n",
- "Train Epoch: 3 [53120/60000 (88%)]\tLoss: 2.318925\n",
- "Train Epoch: 3 [53760/60000 (90%)]\tLoss: 2.324926\n",
- "Train Epoch: 3 [54400/60000 (91%)]\tLoss: 2.312320\n",
- "Train Epoch: 3 [55040/60000 (92%)]\tLoss: 2.316199\n",
- "Train Epoch: 3 [55680/60000 (93%)]\tLoss: 2.314889\n",
- "Train Epoch: 3 [56320/60000 (94%)]\tLoss: 2.341814\n",
- "Train Epoch: 3 [56960/60000 (95%)]\tLoss: 2.343977\n",
- "Train Epoch: 3 [57600/60000 (96%)]\tLoss: 2.324932\n",
- "Train Epoch: 3 [58240/60000 (97%)]\tLoss: 2.346128\n",
- "Train Epoch: 3 [58880/60000 (98%)]\tLoss: 2.328274\n",
- "Train Epoch: 3 [59520/60000 (99%)]\tLoss: 2.302693\n"
- ]
- }
- ],
- "source": [
- "model = create_net(tornasole_save_interval=100, base_loc='./ts_output', run_id='bad')\n",
- "train(model=model, epochs=4, learning_rate=1.0, momentum=0.9, batch_size=64, device=torch.device(\"cpu\"))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {},
- "outputs": [],
- "source": [
- "bad_trial = LocalTrial( 'myrun', './ts_output/bad/')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can plot the gradients - notice how every single one of them (apart from one) goes to zero and stays there!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plot_gradients(bad_trial)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The `VanishingGradient` rule provided by Tornasole alerts for this automatically."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAPsAAAD4CAYAAAAq5pAIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAANEUlEQVR4nO3db6hc9Z3H8c8nmoKaPEg2bIhJ3LRFhLqwVoIsblwjtSXqgxjE0oBLdK97KzTQgA9WXLHiUpDFdtn4IHCLIWnMWgpajLXS2hA27gNLrpqNUTfRjdHmEhNjxBpQoua7D+akXPXOmZtzzswZ832/4DIz5ztnzpeTfO75d8/8HBECcPab0XYDAAaDsANJEHYgCcIOJEHYgSTOHeTCbHPqH+iziPBU02tt2W2vsL3P9uu276rzWQD6y1Wvs9s+R9J+Sd+WdEjSLkmrI+KVknnYsgN91o8t+xWSXo+IAxFxUtIvJK2s8XkA+qhO2BdK+uOk14eKaZ9he9T2uO3xGssCUFPfT9BFxJikMYndeKBNdbbsE5IWT3q9qJgGYAjVCfsuSRfb/qrtr0j6nqRtzbQFoGmVd+Mj4hPbayX9VtI5kjZGxMuNdQagUZUvvVVaGMfsQN/15Y9qAHx5EHYgCcIOJEHYgSQIO5AEYQeSIOxAEoQdSIKwA0kQdiAJwg4kQdiBJAg7kARhB5Ig7EAShB1IgrADSRB2IAnCDiRB2IEkCDuQBGEHkiDsQBKEHUiCsANJEHYgCcIOJEHYgSQIO5BE5SGbMTgzZpT/Tr788su71m655Zam2zkjZct/8cUXS+ddtWpVaf3EiROVesqqVthtH5T0gaRPJX0SEUubaApA85rYsl8TEcca+BwAfcQxO5BE3bCHpN/Zft726FRvsD1qe9z2eM1lAaih7m78soiYsP2Xkp6x/b8RsXPyGyJiTNKYJNmOmssDUFGtLXtETBSPRyX9StIVTTQFoHmVw277AtuzTz+X9B1Je5tqDECzHFFtz9r219TZmkudw4H/jIgf95iH3fgKzjvvvNL6/v37u9YuvPDCWsu2XVqv+v9Hkj7++OPS+nXXXVda37FjR+Vln80iYsp/tMrH7BFxQNLfVO4IwEBx6Q1IgrADSRB2IAnCDiRB2IEkKl96q7QwLr31xcKFC7vW5syZ09dlr1+/vrR+9dVXd631unR27bXXVuopu26X3tiyA0kQdiAJwg4kQdiBJAg7kARhB5Ig7EASfJX0WWBiYqJSbTpuv/320vqyZctK68ePH+9au//++yv1hGrYsgNJEHYgCcIOJEHYgSQIO5AEYQeSIOxAEtzPnlyvr2t+6qmnSuv79u0rrd9www1dawcOHCidF9VwPzuQHGEHkiDsQBKEHUiCsANJEHYgCcIOJMF19rPcokWLSuu9rqO/++67pfWbbrqptP7ee++V1tG8ytfZbW+0fdT23knT5tp+xvZrxWN/RyIAUNt0duM3SVrxuWl3SdoeERdL2l68BjDEeoY9InZK+vx3C62UtLl4vlnSjQ33BaBhVb+Dbn5EHC6evy1pfrc32h6VNFpxOQAaUvsLJyMiyk68RcSYpDGJE3RAm6peejtie4EkFY9Hm2sJQD9UDfs2SWuK52skPdFMOwD6peduvO1HJS2XNM/2IUk/kvSApF/aHpH0pqTv9rPJ7OwpL5v+2UUXXdS19uSTT5bOe+mll5bWr7rqqtI619G/PHqGPSJWdyl9q+FeAPQRfy4LJEHYgSQIO5AEYQeSIOxAEtzi+iUwMjJSWh8bG+vbsjdu3Fhr/q1bt3at7dy5s3TeU6dO1Vp2VnyVNJAcYQeSIOxAEoQdSIKwA0kQdiAJwg4kwXX2IXDNNdeU1p9++unS+syZM5ts5zNmzCjfHtS5Fn7rrbeW1rds2VL5szPjOjuQHGEHkiDsQBKEHUiCsANJEHYgCcIOJFF7RBjU9/7775fWd+3aVVq/5JJLutbK7idvwuLFi0vrq1at6lrrNZw0msWWHUiCsANJEHYgCcIOJEHYgSQIO5AEYQeS4H521FI2XLQkPffcc11re/bsKZ13xYoVlXrKrvL97LY32j5qe++kaffZnrC9u/i5vslmATRvOrvxmyRN9Sv23yPisuLnN822BaBpPcMeETslHR9ALwD6qM4JurW29xS7+XO6vcn2qO1x2+M1lgWgpqph3yDp65Iuk3RY0k+6vTEixiJiaUQsrbgsAA2oFPaIOBIRn0bEKUk/k3RFs20BaFqlsNteMOnlKkl7u70XwHDoeT+77UclLZc0z/YhST+StNz2ZZJC0kFJ3+9jjxhic+Z0PV0jSTr//PO71t56662m20GJnmGPiNVTTH64D70A6CP+XBZIgrADSRB2IAnCDiRB2IEk+Cpp1DJv3rzS+uzZs7vW3njjjabbQQm27EAShB1IgrADSRB2IAnCDiRB2IEkCDuQBNfZUWru3Lml9fXr11f+7A8//LDyvDhzbNmBJAg7kARhB5Ig7EAShB1IgrADSRB2IAmGbEapK6+8srT+7LPPltbfeeedrrUlS5aUzvvRRx+V1jG1ykM2Azg7EHYgCcIOJEHYgSQIO5AEYQeSIOxAEtzPjlL33HNPrfkffPDBrjWuow9Wzy277cW2d9h+xfbLtn9YTJ9r+xnbrxWP5QN1A2jVdHbjP5F0Z0R8Q9LfSvqB7W9IukvS9oi4WNL24jWAIdUz7BFxOCJeKJ5/IOlVSQslrZS0uXjbZkk39qtJAPWd0TG77SWSvinpD5LmR8ThovS2pPld5hmVNFq9RQBNmPbZeNuzJD0maV1E/GlyLTp300x5k0tEjEXE0ohYWqtTALVMK+y2Z6oT9K0R8Xgx+YjtBUV9gaSj/WkRQBN67sbbtqSHJb0aET+dVNomaY2kB4rHJ/rSYQLnnlv+zzBjRvnv5JMnT1b+7DvuuKO0vmLFitJ6r2GXt2zZUlrH4EznmP3vJP2DpJds7y6m3a1OyH9pe0TSm5K+258WATShZ9gj4r8lTXkzvKRvNdsOgH7hz2WBJAg7kARhB5Ig7EAShB1Igltch8DNN99cWl++fHlp/aGHHupaW7duXem8t912W2n92LFjpfW1a9eW1o8cOVJax+CwZQeSIOxAEoQdSIKwA0kQdiAJwg4kQdiBJBiyeQisXr26tP7II48MqJMv6nUdfcOGDQPqBNPFkM1AcoQdSIKwA0kQdiAJwg4kQdiBJAg7kATX2YfArFmzSuv33ntvaf3OO++svOxNmzaV1kdGRip/NtrBdXYgOcIOJEHYgSQIO5AEYQeSIOxAEoQdSKLndXbbiyX9XNJ8SSFpLCL+w/Z9kv5J0jvFW++OiN/0+CyuswN91u06+3TCvkDSgoh4wfZsSc9LulGd8dhPRMSD022CsAP91y3s0xmf/bCkw8XzD2y/Kmlhs+0B6LczOma3vUTSNyX9oZi01vYe2xttz+kyz6jtcdvjtToFUMu0/zbe9ixJ/yXpxxHxuO35ko6pcxz/r+rs6v9jj89gNx7os8rH7JJke6akX0v6bUT8dIr6Ekm/joi/7vE5hB3os8o3wti2pIclvTo56MWJu9NWSdpbt0kA/TOds/HLJD0r6SVJp4rJd0taLekydXbjD0r6fnEyr+yz2LIDfVZrN74phB3oP+5nB5Ij7EAShB1IgrADSRB2IAnCDiRB2IEkCDuQBGEHkiDsQBKEHUiCsANJEHYgCcIOJNHzCycbdkzSm5NezyumDaNh7W1Y+5Loraome/urboWB3s/+hYXb4xGxtLUGSgxrb8Pal0RvVQ2qN3bjgSQIO5BE22Efa3n5ZYa1t2HtS6K3qgbSW6vH7AAGp+0tO4ABIexAEq2E3fYK2/tsv277rjZ66Mb2Qdsv2d7d9vh0xRh6R23vnTRtru1nbL9WPE45xl5Lvd1ne6JYd7ttX99Sb4tt77D9iu2Xbf+wmN7quivpayDrbeDH7LbPkbRf0rclHZK0S9LqiHhloI10YfugpKUR0fofYNj+e0knJP389NBatv9N0vGIeKD4RTknIv55SHq7T2c4jHefeus2zPitanHdNTn8eRVtbNmvkPR6RByIiJOSfiFpZQt9DL2I2Cnp+Ocmr5S0uXi+WZ3/LAPXpbehEBGHI+KF4vkHkk4PM97quivpayDaCPtCSX+c9PqQhmu895D0O9vP2x5tu5kpzJ80zNbbkua32cwUeg7jPUifG2Z8aNZdleHP6+IE3Rcti4jLJV0n6QfF7upQis4x2DBdO90g6evqjAF4WNJP2mymGGb8MUnrIuJPk2ttrrsp+hrIemsj7BOSFk96vaiYNhQiYqJ4PCrpV+ocdgyTI6dH0C0ej7bcz59FxJGI+DQiTkn6mVpcd8Uw449J2hoRjxeTW193U/U1qPXWRth3SbrY9ldtf0XS9yRta6GPL7B9QXHiRLYvkPQdDd9Q1NskrSmer5H0RIu9fMawDOPdbZhxtbzuWh/+PCIG/iPpenXOyP+fpH9po4cufX1N0v8UPy+33ZukR9XZrftYnXMbI5L+QtJ2Sa9J+r2kuUPU2xZ1hvbeo06wFrTU2zJ1dtH3SNpd/Fzf9ror6Wsg640/lwWS4AQdkARhB5Ig7EAShB1IgrADSRB2IAnCDiTx/1ppMKNb4Fc+AAAAAElFTkSuQmCC\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "input_image = (bad_trial.tensor('Net_input_0').value(2700)[42]*255).reshape(28,28)\n",
- "plt.imshow(input_image, cmap=plt.get_cmap('gray'))\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAD4CAYAAADvsV2wAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAPCElEQVR4nO3dTYilV53H8e+v0zpaMWMLaRiSTncFIkpo4kQKiQYcsLOIbUjDrCKlELOoGfAlI0IwUwtXPZuIKEQMRdSNhS4yCSPS0e6gDswiwcoLMUnHENTuvInlQkesRabxP4t7Q6rbe6u7+rlVT+We7weKW8+5p+7510PXL0/OPfc8qSokSdNvV98FSJK2h4EvSY0w8CWpEQa+JDXCwJekRuzuu4CNXH755TU7O9t3GZL0lvH444//oar2jnpuRwf+7OwsKysrfZchSW8ZSU6Ne84pHUlqhIEvSY0w8CWpEQa+JDXCwJekRnQK/CT3JHk+ydNJHkqyZ0y/m5P8KsmLSb7cZcy3iuVlmJ2FXbsGj8vLfVckqXVdr/BPAAer6jrgBeDuczskuQT4JvBx4Frgk0mu7Tjujra8DAsLcOoUVA0eFxYMfUn96hT4VXW8qs4MDx8F9o3o9iHgxar6dVW9DvwAONJl3J1ucRHW1s5uW1sbtEtSXyY5h38H8PCI9iuBl9YdvzxsGynJQpKVJCurq6sTLG/7nD69uXZJ2g7nDfwkjyR5ZsTXkXV9FoEzQOdJi6paqqq5qprbu3fkp4N3vP37N9cuSdvhvFsrVNVNGz2f5HbgFuBQjb591ivAVeuO9w3bptbRo4M5+/XTOjMzg3ZJ6kvXVTo3A3cBt1bV2phuvwDem+TqJG8HbgN+2GXcnW5+HpaW4MABSAaPS0uDdknqS9fN0+4F/g44kQTg0ar61yRXAPdX1eGqOpPkc8BPgEuA71TVsx3H3fHm5w14STtLp8CvqmvGtL8KHF53fAw41mUsSVI3ftJWkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL6lXy8swOwu7dg0el5f7rmh6dbqJuSR1sbwMCwuwtjY4PnVqcAwwP99fXdPKK3xJvVlcfDPs37C2NmjX5Bn4knpz+vTm2tWNgS+pN/v3b65d3Rj4knpz9CjMzJzdNjMzaNfkGfiSejM/D0tLcOAAJIPHpSXfsN0qrtKR1Kv5eQN+u3iFL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfUvNa2cDNZZmSmtbSBm5e4UtqWksbuHUK/CT3JHk+ydNJHkqyZ0Sfq5L8LMlzSZ5NcmeXMSVpklrawK3rFf4J4GBVXQe8ANw9os8Z4EtVdS1wA/DZJNd2HFeSJqKlDdw6BX5VHa+qM8PDR4F9I/q8VlVPDL//M3ASuLLLuJI0KS1t4DbJOfw7gIc36pBkFrgeeGyDPgtJVpKsrK6uTrA8ta6VlRjanJY2cEtVbdwheQT4hxFPLVbVfw37LAJzwD/XmBdM8i7gv4GjVfXghRQ3NzdXKysrF9JV2tC5KzFgcBU3rX/YaleSx6tqbuRz5wv8C3jx24F/AQ5V1dqYPm8DfgT8pKq+dqGvbeBrUmZnB8vtznXgAPz2t9tdjbR1Ngr8Tuvwk9wM3AX80wZhH+DbwMnNhL00SS2txJDG6TqHfy9wGXAiyVNJ7gNIckWSY8M+NwKfBj427PNUksMdx5U2paWVGNI4na7wq+qaMe2vAoeH3/8PkC7jSF0dPTp6Dn8aV2JI4/hJWzWhpZUY0jjupaNmeCs9tc4rfElqhIEvSY0w8CWpEQa+JDXCwJekRhj42nJuWibtDC7L1JZq6fZx0k7nFb62VEu3j5N2OgNfW8pNy6Sdw8DXlnLTMmnnMPC1pVq6fZzU1VYvcDDwtaXctEy6MG8scDh1CqreXOAwydDvfMerreQdryS1YlJ3Zdvojlde4UvSDrAdCxwMfEnaAbZjgYOBL0k7wHYscDDwJWkH2I4FDm6tIEk7xFbflc0rfElqhIEvSY0w8CWpEQb+FHMfeknr+abtlHIfeknn8gp/SrkPvaRzGfhTyn3oJZ3LwJ9S7kMv6VwG/pRyH3pJ5zLwp5T70Es6l6t0pthWf0xb0luLV/iS1AgDX5IaYeBLUiMMfElqRKfAT3JPkueTPJ3koSR7Nuh7SZInk/yoy5iSpIvT9Qr/BHCwqq4DXgDu3qDvncDJjuNJki5Sp8CvquNVdWZ4+Ciwb1S/JPuATwD3dxlPknTxJjmHfwfw8Jjnvg7cBfz1fC+SZCHJSpKV1dXVCZYnSW07b+AneSTJMyO+jqzrswicAf5mx/UktwC/r6rHL6Sgqlqqqrmqmtu7d+8mfhVJ0kbO+0nbqrppo+eT3A7cAhyqqhrR5Ubg1iSHgXcAf5/ke1X1qYuoV5J0kbqu0rmZwVTNrVW1NqpPVd1dVfuqaha4DfipYS9J26/rHP69wGXAiSRPJbkPIMkVSY51rk6SNDGdNk+rqmvGtL8KHB7R/nPg513GlCRdHD9pK0mNMPAlqREGviQ1wsCXpEYY+JLUCANf2mbLyzA7C7t2DR6X/+bz6dLW8J620jZaXoaFBVgbfkzx1KnBMXj/YW09r/ClbbS4+GbYv2FtbdAubTUDX9pGp09vrl2aJANf2kb792+uXZokA1/aRkePwszM2W0zM4N2aatNXeC7AkI72fw8LC3BgQOQDB6XlnzDVttjqlbpuAJCbwXz8/57VD+m6grfFRCSNN5UBb4rICRpvKkKfFdASNJ4UxX4roCQpPGmKvBdASFJ403VKh1wBYQkjTNVV/iSpPEMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhrRKfCT3JPk+SRPJ3koyZ4x/fYkeWDY92SSD3cZV5K0eV2v8E8AB6vqOuAF4O4x/b4B/Liq3g98ADjZcVxJ0iZ1CvyqOl5VZ4aHjwL7zu2T5N3AR4FvD3/m9ar6Y5dxJUmbN8k5/DuAh0e0Xw2sAt9N8mSS+5NcOsFxJUkX4LyBn+SRJM+M+Dqyrs8icAZYHvESu4EPAt+qquuBvwBf3mC8hSQrSVZWV1c3/QtJkkbbfb4OVXXTRs8nuR24BThUVTWiy8vAy1X12PD4ATYI/KpaApYA5ubmRr2eJOkidF2lczNwF3BrVa2N6lNVvwNeSvK+YdMh4Lku40qSNq/rHP69wGXAiSRPJbkPIMkVSY6t6/d5YDnJ08A/Av/RcVxJ0iadd0pnI1V1zZj2V4HD646fAua6jCVJ6sZP2kpSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY3oFPhJ7knyfJKnkzyUZM+Yfl9M8mySZ5J8P8k7uowrSdq8rlf4J4CDVXUd8AJw97kdklwJfAGYq6qDwCXAbR3HlSRtUqfAr6rjVXVmePgosG9M193AO5PsBmaAV7uMK0navEnO4d8BPHxuY1W9AnwVOA28Bvypqo5PcFxJ0gU4b+AneWQ4937u15F1fRaBM8DyiJ9/D3AEuBq4Arg0yac2GG8hyUqSldXV1Yv5nSRJI+w+X4eqummj55PcDtwCHKqqGtHlJuA3VbU67P8g8BHge2PGWwKWAObm5ka9niTpInRdpXMzcBdwa1Wtjel2GrghyUySAIeAk13GlSRtXtc5/HuBy4ATSZ5Kch9AkiuSHAOoqseAB4AngF8Ox1zqOK4kaZPOO6Wzkaq6Zkz7q8DhdcdfAb7SZSxJUjd+0laSGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr7UqOVlmJ2FXbsGj8t/c786TZtO2yNLemtaXoaFBVgb3rbo1KnBMcD8fH91aWt5hS81aHHxzbB/w9raoF3Ty8CXGnT69ObaNR0MfKlB+/dvrl3TwcCXGnT0KMzMnN02MzNo1/Qy8KUGzc/D0hIcOADJ4HFpyTdsp52rdKRGzc8b8K3xCl+SGmHgS1IjDHxJaoSBL0mNMPAlqRGpqr5rGCvJKnDqIn/8cuAPEyznrcxzcTbPx9k8H2+ahnNxoKr2jnpiRwd+F0lWqmqu7zp2As/F2TwfZ/N8vGnaz4VTOpLUCANfkhoxzYG/1HcBO4jn4myej7N5Pt401ediaufwJUlnm+YrfEnSOga+JDVi6gI/yc1JfpXkxSRf7ruePiW5KsnPkjyX5Nkkd/ZdU9+SXJLkySQ/6ruWviXZk+SBJM8nOZnkw33X1KckXxz+nTyT5PtJ3tF3TZM2VYGf5BLgm8DHgWuBTya5tt+qenUG+FJVXQvcAHy28fMBcCdwsu8idohvAD+uqvcDH6Dh85LkSuALwFxVHQQuAW7rt6rJm6rABz4EvFhVv66q14EfAEd6rqk3VfVaVT0x/P7PDP6gr+y3qv4k2Qd8Ari/71r6luTdwEeBbwNU1etV9cd+q+rdbuCdSXYDM8CrPdczcdMW+FcCL607fpmGA269JLPA9cBj/VbSq68DdwF/7buQHeBqYBX47nCK6/4kl/ZdVF+q6hXgq8Bp4DXgT1V1vN+qJm/aAl8jJHkX8J/Av1XV//ZdTx+S3AL8vqoe77uWHWI38EHgW1V1PfAXoNn3vJK8h8FswNXAFcClST7Vb1WTN22B/wpw1brjfcO2ZiV5G4OwX66qB/uup0c3Arcm+S2Dqb6PJflevyX16mXg5ap64//4HmDwH4BW3QT8pqpWq+r/gAeBj/Rc08RNW+D/AnhvkquTvJ3Bmy4/7Lmm3iQJgznak1X1tb7r6VNV3V1V+6pqlsG/i59W1dRdwV2oqvod8FKS9w2bDgHP9VhS304DNySZGf7dHGIK38SeqpuYV9WZJJ8DfsLgXfbvVNWzPZfVpxuBTwO/TPLUsO3fq+pYjzVp5/g8sDy8OPo18Jme6+lNVT2W5AHgCQar255kCrdZcGsFSWrEtE3pSJLGMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSI/4fCqIpvyP0+eIAAAAASUVORK5CYII=\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "The network predicted the value: 1\n"
- ]
- }
- ],
- "source": [
- "plt.plot(bad_trial.tensor('Net_output0').value(2700)[42], 'bo')\n",
- "plt.show()\n",
- "print('The network predicted the value: {}'.format(np.argmax(bad_trial.tensor('Net_output0').value(2700)[42])))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This concludes this notebook. For more information see the APIs at\n",
- "- https://github.com/awslabs/tornasole_core"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.3"
- },
- "pycharm": {
- "stem_cell": {
- "cell_type": "raw",
- "source": [],
- "metadata": {
- "collapsed": false
- }
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/examples/pytorch/sagemaker-notebooks/pytorch.ipynb b/examples/pytorch/sagemaker-notebooks/pytorch.ipynb
deleted file mode 100644
index 4a40d7a9a..000000000
--- a/examples/pytorch/sagemaker-notebooks/pytorch.ipynb
+++ /dev/null
@@ -1,449 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Debugging SageMaker Training Jobs with Tornasole"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Overview"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. \n",
- "It lets you go beyond just looking at scalars like losses and accuracies during training and gives \n",
- "you full visibility into all tensors 'flowing through the graph' during training.\n",
- "\n",
- "Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.\n",
- "\n",
- "### Saving tensors\n",
- "\n",
- "Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis. Tornasole is highly customizable to save the tesnsors you want at different frequencies. Refer [DeveloperGuide_PyTorch](../../DeveloperGuide_PyTorch.md) for details on how to save the tensors you want to save.\n",
- "\n",
- "### Analysis\n",
- "\n",
- "Analyses of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, \n",
- "A Rule is a python code used to detect certain conditions during training. Some of the conditions that a data scientist training a deep learning model may care about are monitoring for gradients getting too large or too small, detecting overfitting, and so on.\n",
- "Tornasole will come pre-packaged with certain rules. Users can write their own rules using the Tornasole APIs.\n",
- "You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook, using Tornasole's full set of APIs. \n",
- "Please refer [DeveloperGuide_Rules](../../../rules/DeveloperGuide_Rules.md) for more details about analysis.\n",
- "\n",
- "This example guides you through installation of the required components for emitting tensors in a \n",
- "SageMaker training job and applying a rule over the tensors to monitor the live status of the job. \n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup\n",
- "\n",
- "As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!aws s3 sync s3://tornasole-external-preview-use1/ ~/tornasole-preview\n",
- "!pip -q install ~/tornasole-preview/sdk/sagemaker-tornasole-latest.tar.gz\n",
- "!aws configure add-model --service-model file://`echo ~/tornasole-preview/sdk/sagemaker-smdebug.json` --service-name sagemaker"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now that we've completed the setup, we're ready to spin off a training job with debugging enabled. \n",
- "\n",
- "## Enable Tornasole in the training script\n",
- "\n",
- "### Import the hook package\n",
- "Import the SessionHook class along with other helper classes in your training script as shown below\n",
- "\n",
- "```\n",
- "from smdebug.pytorch import SaveConfig, SessionHook\n",
- "```\n",
- "\n",
- "### Instantiate and initialize hook\n",
- "\n",
- "```\n",
- " # Create SaveConfig that instructs engine to log graph tensors every 10 steps.\n",
- " save_config = SaveConfig(save_interval=10)\n",
- " \n",
- " # Create a hook that logs tensors of weights, biases and gradients while training the model.\n",
- " \n",
- " hook = SessionHook(save_config=save_config)\n",
- "```\n",
- "\n",
- "For additional details on SessionHook, SaveConfig and Collection please refer to the [API documentation](api.md)\n",
- "\n",
- "### Register Tornasole hook to the model before starting of the training.\n",
- "\n",
- "\n",
- "After creating or loading your desired model, you can register the hook with the model as shown below.\n",
- "\n",
- "```\n",
- "net = create_model()\n",
- "# Apply hook to the model\n",
- "# and enable mode in which engine will log graph tensors\n",
- "hook.register_hook(net)\n",
- "```\n",
- "\n",
- "#### Set the mode\n",
- "Tornasole has the concept of modes (TRAIN, EVAL, PREDICT) to separate out different modes of the jobs. Set the mode you are running in your job. Every time the mode changes in your job, please set the current mode. This helps you group steps by mode, for easier analysis. Setting the mode is optional but recommended. If you do not specify this, Tornasole saves all steps under a GLOBAL mode.\n",
- "\n",
- "\n",
- "```\n",
- "hook.set_mode(smd.modes.TRAIN)\n",
- "```\n",
- "\n",
- "Refer [DeveloperGuide_PyTorch.md](../../DeveloperGuide_TF.md) for more details on the APIs Tornasole provides to help you save the tensors in different forms at the frequency you want.\n",
- "\n",
- "#### Note\n",
- "Tornasole currently only works for single process training. We will support distributed training very soon. \n",
- "\n",
- "## Start Sagemaker training with Tornasole enabled\n",
- "\n",
- "We'll be training a simple Pytorch model using the script [simple.py](../scripts/simple.py).\n",
- "This will be done using SageMaker Pytorch 1.13.1 Container in Script Mode.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import sagemaker\n",
- "from sagemaker.pytorch import PyTorch"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "### Inputs\n",
- "Configuring the inputs for the training job. The command line arguments taken by the script\n",
- "can be passed using the hyperparameters dictionary below.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "entry_point_script = '../scripts/simple.py'\n",
- "docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-pytorch-1.1.0-cpu:latest'\n",
- "hyperparameters = {'epochs': 2, 'lr' : 0.01, 'momentum' : 0.9, 'save-frequency' : 3, 'steps' : 10, 'hook-type' : 'saveall', 'random-seed' : True }\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Storage\n",
- "The tensors saved by Tornasole are, by default, stored in the S3 output path of the training job, \n",
- "under the folder **`/tensors-`**. This is done to ensure that we don't end up accidentally \n",
- "overwriting the tensors from a training job with the others. Rules evaluation require separation of \n",
- "the tensors paths to be evaluated correctly.\n",
- "\n",
- "If you don't provide an S3 output path to the estimator, SageMaker creates one for you as:\n",
- "**`s3://sagemaker--/`**\n",
- "\n",
- "\n",
- "This path is used to create a Tornasole Trial taken by Rules (see below). \n",
- "\n",
- "#### New Parameters\n",
- "The new parameters in Sagemaker Estimator to look out for are\n",
- "\n",
- "##### `debug` (bool)\n",
- "This indicates that debugging should be enabled for the training job. \n",
- "Setting this as `True` would make Tornasole available for use with the job\n",
- "\n",
- "##### `rules_specification` (list[*dict*])\n",
- "This is a list of python dictionaries, where each `dict` is of the following form:\n",
- "```\n",
- "{\n",
- " \"RuleName\": # The name of the class implementing the Tornasole Rule interface. (required)\n",
- " \"SourceS3Uri\": # S3 URI of the rule script containing the class in 'RuleName'. \n",
- " If left empty, it would look for the class in one of the First Party rules already provided to you by Amazon. \n",
- " If not, SageMaker will try to look for the rule class in the script\n",
- " \"InstanceType\": # The ml instance type in which the rule evaluation should run\n",
- " \"VolumeSizeInGB\": # The volume size to store the runtime artifacts from the rule evaluation\n",
- " \"RuntimeConfigurations\": {\n",
- " # Map defining the parameters required to instantiate the Rule class and\n",
- " # parameters regarding invokation of the rule (start-step and end-step)\n",
- " # This can be any parameter taken by the rule\n",
- " : \n",
- " }\n",
- "}\n",
- "```"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Rules\n",
- "Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the job.\n",
- "They can be used to assert certain conditions during training, and raise Cloudwatch Events based on them that you can\n",
- "use to process in any way you like. \n",
- "\n",
- "A Trial in Tornasole's context\n",
- "refers to a training job. It is identified by the path where the saved tensors for the job are stored. \n",
- "A rule takes a `base_trial` which refers to the job whose run invokes the rule execution.\n",
- "A rule can optionally look at other jobs as well, passed using the ar `other_trials`. \n",
- "\n",
- "Tornasole comes with a set of first party rules (1P rules).\n",
- "You can also write your own rules looking at these 1P rules for inspiration. \n",
- "Refer [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more.\n",
- " \n",
- "Here we will talk about how to use Sagemaker to evalute these rules on the training jobs.\n",
- "##### 1P Rule \n",
- "If you want to use a 1P rule. Specify the RuleName field with the 1P RuleName, \n",
- "and the rule will be automatically applied. You can pass any parameters accepted by the \n",
- "rule as part of the RuntimeConfigurations dictionary. The arguments `base_trial` (and `other_trials` if \n",
- "taken by the rule) can be passed as the S3 path where the tensors for \n",
- "the trial are stored in the RuntimeConfigurations dictionary above.\n",
- "\n",
- "Here's a example of a complex configuration for the SimilarAcrossRuns (which accepts another trial and a regex pattern) \n",
- "where we ask for the rule to be invoked for the steps between 10 and 100.\n",
- "\n",
- "``` \n",
- "rules_specification = [\n",
- " {\n",
- " \"RuleName\": \"SimilarAcrossRuns\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"other_trials\": \"s3://sagemaker--/past-job\",\n",
- " \"include_regex\": \".*\",\n",
- " \"start-step\": \"10\",\n",
- " \"end-step\": \"100\"\n",
- " }\n",
- " }\n",
- "]\n",
- "```\n",
- "\n",
- "##### Custom rule\n",
- "In this case you need to define a custom rule class which inherits from `smdebug.rules.Rule` class.\n",
- "You need to provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`.\n",
- "Again, you can pass any arguments taken by this rule through the RuntimeConfigurations dictionary. \n",
- "Note that the custom rules can only have arguments which expect a string as the value except the two arguments \n",
- "specifying trials to the Rule. Refer [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more.\n",
- "\n",
- "Here's an example:\n",
- "```\n",
- "rules_specification = [\n",
- " {\n",
- " \"RuleName\": \"CustomRule\",\n",
- " \"SourceS3Uri\": \"s3://weiyou-tornasole-test/rule-script/custom_rule.py\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"threshold\" : \"0.5\"\n",
- " }\n",
- " }\n",
- "]\n",
- "```\n",
- "\n",
- "### Estimator\n",
- "Now we'll call the Sagemaker Pytorch Estimator to kick off a training job along with a rule to monitor the job.\n",
- "\n",
- "For the purposes of this demonstration let us use the simple.py script with the above hyperparameters dictionary.\n",
- "These good hyperparameters do not produce vanishing gradients, so you will see that the rule doesn't get fired.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Training Example Without Vanishing Gradients "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "ename": "NameError",
- "evalue": "name 'sagemaker' is not defined",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
- "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0msagemaker_execution_role\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msagemaker\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_execution_role\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0;31m#sagemaker_execution_role = 'AmazonSageMaker-ExecutionRole-20190614T145575'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m estimator = PyTorch(role=sagemaker_execution_role,\n\u001b[1;32m 4\u001b[0m \u001b[0mbase_job_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'pytorch-good-example'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mtrain_instance_count\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;31mNameError\u001b[0m: name 'sagemaker' is not defined"
- ]
- }
- ],
- "source": [
- "sagemaker_execution_role = sagemaker.get_execution_role()\n",
- "#sagemaker_execution_role = 'AmazonSageMaker-ExecutionRole-20190614T145575'\n",
- "estimator = PyTorch(role=sagemaker_execution_role,\n",
- " base_job_name='pytorch-good-example',\n",
- " train_instance_count=1,\n",
- " train_instance_type='ml.m4.xlarge',\n",
- " image_name=docker_image_name,\n",
- " entry_point=entry_point_script,\n",
- " framework_version='1.1.0',\n",
- " hyperparameters=hyperparameters,\n",
- " py_version='py3',\n",
- " debug=True,\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"VanishingGradient\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"start-step\": \"1\",\n",
- " \"end-step\": \"50\"\n",
- " }\n",
- " }\n",
- " ])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.fit()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Result\n",
- "\n",
- "As a result of the above command, SageMaker will spin off 2 training jobs for you - the first one being the job which produces the tensors to be analyzed and the second one, which evaluates or analyzes the rule you asked it to in `rules_specification`\n",
- "\n",
- "### Check the status of the Rule Execution Job\n",
- "To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. You can check the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.describe_rule_execution_jobs()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Receive CloudWatch Event For your Jobs\n",
- "When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus CloudWatch events are emitted : https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html. You can configure a CW event rule to receive and process these events by setting up a target (Lambda function, SNS). \n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Training Example With Vanishing Gradients \n",
- "\n",
- "Now let us change the hyperparameters dictionary to the below bad set of hyperparameters, which produce vanishing gradients \n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "entry_point_script = '../scripts/simple.py'\n",
- "docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-pytorch-1.1.0-cpu:latest'\n",
- "bad_hyperparameters = {'epochs': 2, 'lr' : 1.0, 'momentum' : 0.9, 'save-frequency' : 3, 'steps' : 10, 'hook-type' : 'saveall', 'random-seed' : True }\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_execution_role = sagemaker.get_execution_role()\n",
- "#sagemaker_execution_role = 'AmazonSageMaker-ExecutionRole-20190614T145575'\n",
- "estimator = PyTorch(role=sagemaker_execution_role,\n",
- " base_job_name='pytorch-bad-example',\n",
- " train_instance_count=1,\n",
- " train_instance_type='ml.m4.xlarge',\n",
- " image_name=docker_image_name,\n",
- " entry_point=entry_point_script,\n",
- " framework_version='1.1.0',\n",
- " hyperparameters=bad_hyperparameters,\n",
- " py_version='py3',\n",
- " debug=True,\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"VanishingGradient\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"start-step\": \"1\",\n",
- " \"end-step\": \"10\"\n",
- " }\n",
- " }\n",
- " ])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.fit()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.describe_rule_execution_jobs()"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.2"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/examples/rules/sagemaker-notebooks/BringYourOwnRule.ipynb b/examples/rules/sagemaker-notebooks/BringYourOwnRule.ipynb
deleted file mode 100644
index e84a64901..000000000
--- a/examples/rules/sagemaker-notebooks/BringYourOwnRule.ipynb
+++ /dev/null
@@ -1,630 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Debugging SageMaker Training Jobs with Tornasole \n",
- "## Writing Custom Rules"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Overview\n",
- "Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. \n",
- "It lets you go beyond just looking at scalars like losses and accuracies during training and gives \n",
- "you full visibility into all tensors 'flowing through the graph' during training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training flow. \n",
- "\n",
- "Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.\n",
- "\n",
- "### Saving tensors\n",
- "Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis. Tornasole is highly customizable to save the tensors you want at different frequencies. Refer [DeveloperGuide_TensorFlow](../../DeveloperGuide_TF.md) for details on how to save the tensors you want to save.\n",
- "\n",
- "### Analysis\n",
- "\n",
- "Analysis of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, \n",
- "a rule is a python code used to detect certain conditions during training. Some of the conditions that a data scientist training a deep learning model may care about are monitoring for gradients getting too large or too small, detecting overfitting, and so on.\n",
- "Tornasole will come pre-packaged with certain rules. Users can write their own rules using the Tornasole APIs.\n",
- "You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook, using Tornasole's full set of APIs. \n",
- "Please refer [Developer Guide for Rules](../../../../rules/DeveloperGuide_Rules.md) for more details about analysis.\n",
- "\n",
- "This example guides you through installation of the required components for emitting tensors in a \n",
- "SageMaker training job and applying a rule over the tensors to monitor the live status of the job. \n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup\n",
- "\n",
- "As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them. This is only for the purposes of this private beta. Once we do this, we will be ready to use smdebug."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Installing requirements...\n",
- "\u001b[33mYou are using pip version 10.0.1, however version 19.2.3 is available.\n",
- "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n",
- "Installation completed!\n"
- ]
- }
- ],
- "source": [
- "! aws s3 sync s3://tornasole-external-preview-use1/sdk/ ~/SageMaker/tornasole-preview-sdk/\n",
- "! chmod +x ~/SageMaker/tornasole-preview-sdk/installer.sh && ~/SageMaker/tornasole-preview-sdk/installer.sh"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Using custom Tornasole rules with SageMaker \n",
- "\n",
- "This notebook assumes that you have gone through at least one notebook demonstrating training models\n",
- "in SageMaker with Tornasole with your framework of choice. That notebook would demonstrate the \n",
- "changes you need to make in your training script to enable Tornasole, starting a training job \n",
- "along with a rule execution job, and looking at the status of these jobs.\n",
- "\n",
- "In this notebook we will focus on how to write a custom Tornasole rule, and how to \n",
- "execute this custom rule in SageMaker. To make this notebook runnable, we are picking a TensorFlow script as the training job.\n",
- "Whatever framework or script you use, rule behavior would be similar. \n",
- "\n",
- "### Start training with a custom rule\n",
- "\n",
- "#### Configuring the inputs for the training job\n",
- "Set the docker image to the SageMaker TensorFlow container that we have built with Tornasole pre-installed, for the region you are in. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "pycharm": {
- "name": "#%%\n"
- }
- },
- "outputs": [],
- "source": [
- "import sagemaker\n",
- "import boto3\n",
- "from sagemaker.tensorflow import TensorFlow\n",
- "\n",
- "REGION = boto3.Session().region_name\n",
- "TAG='latest'\n",
- "\n",
- "docker_image_name= '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-tf-1.13.1-cpu:{}'.format(REGION, TAG)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "Let us now set `entry_point_script` to the simple TensorFlow training script that has SessionHook integrated.\n",
- "The 'hyperparameters' below are the parameters that will be passed to the training script as command line arguments in SageMaker's script mode."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "pycharm": {
- "name": "#%%\n"
- }
- },
- "outputs": [],
- "source": [
- "entry_point_script = '../../frameworks/tensorflow/examples/scripts/simple.py'\n",
- "hyperparameters = { 'steps': 1000000, 'save_frequency': 50 }"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Configuring custom rule\n",
- "We have written an example custom rule `CustomGradientRule`, available [here](../scripts/my_custom_rule.py). We need to upload this to a bucket in the same region where we want to run the job. We have chosen a default bucket below. Please change it to the bucket you want. We will now create this bucket if it does not exist, and upload this file. \n",
- "We will then specify this path when starting the job as `SourceS3Uri`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [],
- "source": [
- "ACCOUNT_ID = boto3.client('sts').get_caller_identity().get('Account')\n",
- "BUCKET = f'tornasole-resources-{ACCOUNT_ID}-{REGION}'\n",
- "\n",
- "CUSTOM_RULE_PATH = '../scripts/my_custom_rule.py'\n",
- "\n",
- "PREFIX = os.path.join('rules', os.path.basename(CUSTOM_RULE_PATH))\n",
- "\n",
- "import os\n",
- "s3 = boto3.resource('s3')\n",
- "bucket = s3.Bucket(BUCKET)\n",
- "if not bucket.creation_date:\n",
- " s3.create_bucket(Bucket=BUCKET, CreateBucketConfiguration={'LocationConstraint': REGION})\n",
- "s3.Object(BUCKET, PREFIX).put(Body=open(CUSTOM_RULE_PATH, 'rb'))\n",
- "SOURCE_S3_URI = f's3://{BUCKET}/{PREFIX}'"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Keep in mind that for SageMaker to be able to evaluate your rule, the rule class **will need** to have a signature conforming to the spec defined by smdebug. \n",
- "\n",
- "This custom rule that we have written takes the arguments `self`, `base_trial` and `threshold`. \n",
- "In order to initialize a custom rule class, you'll need to pass down values for everything except `self` and `base_trial`. \n",
- "This is done through putting the parameters and their values as a string-to-string map in `RuntimeConfigurations` in the `rules_specification` parameter to the SageMaker Estimator.\n",
- "\n",
- "After we run this example, in this notebook we will look at these concepts in more detail."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {
- "pycharm": {
- "name": "#%%\n"
- }
- },
- "outputs": [],
- "source": [
- "estimator = TensorFlow(role=sagemaker.get_execution_role(),\n",
- " base_job_name='tensorflow-custom-rule-tornasole',\n",
- " train_instance_count=1,\n",
- " train_instance_type='ml.m4.xlarge',\n",
- " image_name=docker_image_name,\n",
- " entry_point=entry_point_script,\n",
- " hyperparameters=hyperparameters,\n",
- " framework_version='1.13.1',\n",
- " debug=True,\n",
- " py_version='py3',\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"CustomGradientRule\",\n",
- " \"SourceS3Uri\": SOURCE_S3_URI,\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"RuntimeConfigurations\": {\n",
- " \"threshold\" : \"0.5\"\n",
- " }\n",
- " }\n",
- " ])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "To kick off the job, we call the `fit()` method on the SageMaker TensorFlow estimator"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {
- "pycharm": {
- "name": "#%%\n"
- }
- },
- "outputs": [],
- "source": [
- "# setting wait as True will cause the logs to be streamed in the notebook directly,\n",
- "# in order to proceed to further cells you'll need to stop cell execution. So, \n",
- "# we set wait to False for demonstration purposes.\n",
- "estimator.fit(wait=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Result\n",
- "As a result of the above command, SageMaker will spin off two jobs for you - the first one being the training job which produces the tensors to be analyzed and the second one, which evaluates or analyzes the rule you asked it to in `rules_specification`.\n",
- "#### Check the status of the Rule Execution Job\n",
- "To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if rule job started, the `RuleJobName` and `RuleExecutionJobArn`. \n",
- "If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. \n",
- "You can check the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`.\n",
- "\n",
- "Depending on how your tensors are emitted and how your custom rule reacts to the script, your rule evaluation job will either fail or succeed. \n",
- "You can get the rule evaluation statuses of the jobs through the following mechanism. This function will continue to poll till the rule execution jobs end. To proceed with the notebook, please stop the cell after RuleStatus changes to InProgress. At this point, you should see RuleExecutionJobName. This will be needed to execute the next cell of code where we attach to the rule execution job to see its logs."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {
- "pycharm": {
- "name": "#%%\n"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Wait to get status for Rule Execution Jobs...\n",
- "=============================================\n",
- "RuleName: CustomGradientRule\n",
- "RuleStatus: RuleExecutionError\n",
- "FailureReason: ClientError: RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met\n",
- "Traceback (most recent call last):\n",
- " File \"train.py\", line 214, in execute\n",
- " exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local)\n",
- " File \"\", line 2, in \n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 84, in invoke_rule\n",
- " raise e\n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 79, in invoke_rule\n",
- " rule_obj.invoke(step)\n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule.py\", line 56, in invoke\n",
- " raise RuleEvaluationConditionMet(self.rule_name, step)\n",
- "smdebug.exceptions.RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met\n",
- "\n",
- "\n",
- "RuleExecutionJobName: CustomGradientRule-75136a67d770aaea32adbb833fb2ee4f\n",
- "RuleExecutionJobArn: arn:aws:sagemaker:us-west-2:072677473360:training-job/customgradientrule-75136a67d770aaea32adbb833fb2ee4f\n",
- "=============================================\n"
- ]
- }
- ],
- "source": [
- "rule_description = estimator.describe_rule_execution_jobs()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Check the logs of the Rule Execution Job\n",
- "If you want to access the logs of a particular rule job name, you can do the following. First, you need to get the rule job name (`RuleExecutionJobArn` field from the training job description). Note that this is only available after the rule job reaches Started stage. Hence the next cell waits till the job name is available"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we can attach to this job to see its logs"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {
- "pycharm": {
- "name": "#%%\n"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "2019-08-29 22:40:05 Starting - Preparing the instances for training\n",
- "2019-08-29 22:40:05 Downloading - Downloading input data\n",
- "2019-08-29 22:40:05 Training - Training image download completed. Training in progress.\n",
- "2019-08-29 22:40:05 Uploading - Uploading generated training model\n",
- "2019-08-29 22:40:05 Failed - Training job failed\u001b[31m[2019-08-29 22:39:24.434 ip-10-0-174-72.us-west-2.compute.internal:1 INFO s3_trial.py:27] Loading trial base-trial at path s3://sagemaker-us-west-2-072677473360/tensors-tensorflow-custom-rule-tornasole-2019-08-29-22-32-10-697\u001b[0m\n",
- "\u001b[31m[2019-08-29 22:39:56.413 ip-10-0-174-72.us-west-2.compute.internal:1 INFO rule_invoker.py:76] Started execution of rule CustomGradientRule at step 0\u001b[0m\n",
- "\u001b[31mException during rule execution: Customer Error: RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met\u001b[0m\n",
- "\u001b[31mTraceback (most recent call last):\n",
- " File \"train.py\", line 214, in execute\n",
- " exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local)\n",
- " File \"\", line 2, in \n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 84, in invoke_rule\n",
- " raise e\n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 79, in invoke_rule\n",
- " rule_obj.invoke(step)\n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule.py\", line 56, in invoke\n",
- " raise RuleEvaluationConditionMet(self.rule_name, step)\u001b[0m\n",
- "\u001b[31msmdebug.exceptions.RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met\n",
- "\n",
- "\u001b[0m\n"
- ]
- },
- {
- "ename": "UnexpectedStatusException",
- "evalue": "Error for Training job CustomGradientRule-75136a67d770aaea32adbb833fb2ee4f: Failed. Reason: ClientError: RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met\nTraceback (most recent call last):\n File \"train.py\", line 214, in execute\n exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local)\n File \"\", line 2, in \n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 84, in invoke_rule\n raise e\n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 79, in invoke_rule\n rule_obj.invoke(step)\n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule.py\", line 56, in invoke\n raise RuleEvaluationConditionMet(self.rule_name, step)\nsmdebug.exceptions.RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met\n\n",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
- "\u001b[0;31mUnexpectedStatusException\u001b[0m Traceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msagemaker\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mestimator\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mEstimator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mrule_job_name\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrule_description\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'RuleExecutionJobName'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mexploding_tensor\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mEstimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mattach\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrule_job_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
- "\u001b[0;32m~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py\u001b[0m in \u001b[0;36mattach\u001b[0;34m(cls, training_job_name, sagemaker_session, model_channel_name)\u001b[0m\n\u001b[1;32m 460\u001b[0m )\n\u001b[1;32m 461\u001b[0m \u001b[0mestimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_current_job_name\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mestimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlatest_training_job\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 462\u001b[0;31m \u001b[0mestimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlatest_training_job\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwait\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 463\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mestimator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 464\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py\u001b[0m in \u001b[0;36mwait\u001b[0;34m(self, logs)\u001b[0m\n\u001b[1;32m 1012\u001b[0m \"\"\"\n\u001b[1;32m 1013\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlogs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1014\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msagemaker_session\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlogs_for_job\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjob_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwait\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1015\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1016\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msagemaker_session\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwait_for_job\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjob_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py\u001b[0m in \u001b[0;36mlogs_for_job\u001b[0;34m(self, job_name, wait, poll)\u001b[0m\n\u001b[1;32m 1479\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1480\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mwait\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1481\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_check_job_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mjob_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdescription\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"TrainingJobStatus\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1482\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mdot\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1483\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py\u001b[0m in \u001b[0;36m_check_job_status\u001b[0;34m(self, job, desc, status_key_name)\u001b[0m\n\u001b[1;32m 1092\u001b[0m ),\n\u001b[1;32m 1093\u001b[0m \u001b[0mallowed_statuses\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"Completed\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"Stopped\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1094\u001b[0;31m \u001b[0mactual_status\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1095\u001b[0m )\n\u001b[1;32m 1096\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;31mUnexpectedStatusException\u001b[0m: Error for Training job CustomGradientRule-75136a67d770aaea32adbb833fb2ee4f: Failed. Reason: ClientError: RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met\nTraceback (most recent call last):\n File \"train.py\", line 214, in execute\n exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local)\n File \"\", line 2, in \n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 84, in invoke_rule\n raise e\n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 79, in invoke_rule\n rule_obj.invoke(step)\n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule.py\", line 56, in invoke\n raise RuleEvaluationConditionMet(self.rule_name, step)\nsmdebug.exceptions.RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met\n\n"
- ]
- }
- ],
- "source": [
- "from sagemaker.estimator import Estimator\n",
- "rule_job_name = rule_description[0]['RuleExecutionJobName']\n",
- "exploding_tensor = Estimator.attach(rule_job_name)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "pycharm": {
- "name": "#%% md\n"
- }
- },
- "source": [
- "## Tornasole Rules Explained in depth\n",
- "Let us now walk through some of Tornasole's concepts which will be helpful to understand how rules are executed in SageMaker\n",
- "and how custom rules work. \n",
- "\n",
- "### Trial\n",
- "A Trial in Tornasole's context refers to a training job. \n",
- "It is identified by the path where the saved tensors for the job are stored. \n",
- "\n",
- "### Rules\n",
- "Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the job.\n",
- "They can be used to assert certain conditions during training, and raise Cloudwatch Events based on them that you can\n",
- "use to process in any way you like. \n",
- "\n",
- "These are defined by the class `smdebug.rules.Rule`. A rule takes a `base_trial` which refers to the job whose run invokes the rule execution. \n",
- "A rule can optionally look at other jobs as well, passed using the argument `other_trials`.\n",
- "\n",
- "Tornasole comes with a set of **First Party rules** (1P rules).\n",
- "You can also write your own rules looking at these 1P rules for inspiration.\n",
- "Refer [Developer Guide for Rules.md](../../DeveloperGuide_Rules.md) for more on the \n",
- "APIs you can use to write your own rules as well as descriptions for the 1P rules that we provide.\n",
- "\n",
- "### Storage\n",
- "The tensors saved by Tornasole are, by default, stored in the S3 output path of the training job, under the folder **`/tensors-`**. \n",
- "This is done to ensure that we don't end up accidentally overwriting the tensors from a training job with the others. \n",
- "Rules evaluation require separation of the tensors paths to be evaluated correctly.\n",
- "If you don't provide an S3 output path to the estimator, SageMaker creates one for you as: **`s3://sagemaker--/`**\n",
- "\n",
- "### Using Tornasole Rules in SageMaker \n",
- "Here we will talk about how to use SageMaker to evaluate these rules on the training jobs. \n",
- "The new parameters in Sagemaker Estimator to look out for are\n",
- "\n",
- "- `debug` :(bool)\n",
- "This indicates that debugging should be enabled for the training job. \n",
- "Setting this as `True` would make Tornasole available for use with the job\n",
- "\n",
- "- `rules_specification`: (list[*dict*])\n",
- "You can specify any number of rules to monitor your SageMaker training job. \n",
- "This parameter takes a list of python dictionaries, one for each rule you want to enable. \n",
- "Each `dict` is of the following form:\n",
- "```\n",
- "{\n",
- " \"RuleName\": \n",
- " # The name of the class implementing the Tornasole Rule interface. (required)\n",
- "\n",
- " \"SourceS3Uri\": \n",
- " # S3 URI of the rule script containing the class in 'RuleName'. \n",
- " # This is not required if you want to use one of the\n",
- " # First Party rules provided to you by Amazon. \n",
- " # In such a case you can leave it empty or not pass it. If you want to run a custom rule \n",
- " # defined by you, you will need to define the custom rule class in a python \n",
- " # file and provide it to SageMaker as a S3 URI. \n",
- " # SageMaker will fetch this file and try to look for the rule class \n",
- " # identified by RuleName in this file.\n",
- " \n",
- " \"InstanceType\": \n",
- " # The ML instance type which should be used to run the rule evaluation job\n",
- " \n",
- " \"VolumeSizeInGB\": \n",
- " # The volume size to store the runtime artifacts from the rule evaluation \n",
- " \n",
- " \"RuntimeConfigurations\": {\n",
- " # Map defining the parameters required to instantiate the Rule class and\n",
- " # parameters regarding invokation of the rule (start-step and end-step)\n",
- " # This can be any parameter taken by the rule. \n",
- " # Every value here needs to be a string. \n",
- " # So when you write custom rules, ensure that you can parse each argument from a string.\n",
- " #\n",
- " # PARAMS CAN BE\n",
- " #\n",
- " # STANDARD PARAMS FOR RULE EXECUTION\n",
- " # \"start-step\": \n",
- " # \"end-step\": \n",
- " # \"other-trials-paths\": (';' separated list of s3 paths as a string)\n",
- " # \"logging-level\": (can be one of \"CRITICAL\", \"FATAL\", \"ERROR\", \n",
- " # \"WARNING\", \"WARN\", \"DEBUG\", \"NOTSET\")\n",
- " #\n",
- " # ANY PARAMETER TAKEN BY THE RULE other than `base_trial` and `other_trials` \n",
- " # \"parameter\" : \"value\"\n",
- " # : \n",
- " }\n",
- "}\n",
- "```\n",
- "\n",
- "\n",
- "### CloudWatch Event Integration for Rules\n",
- "When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus [CloudWatch events](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) are emitted. \n",
- "\n",
- "You can configure a CloudWatch event rule to receive and process these events by setting up a target (Lambda function, SNS) as follows:\n",
- "\n",
- "- Configure the [SageMaker TrainingJobStatus CW event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#sagemaker_event_types) to include rule job statuses associated with the training job\n",
- "- Configure the CW event to be emitted when a RuleStatus changes\n",
- "- Create a CloudWatch event rule that monitors the Training Job customer started\n",
- "- Set a Target (Lambda funtion, SQS) for the CloudWatch event rule that processes the event, and triggers an alarm for the customer based on the RuleStatus. \n",
- "\n",
- "Refer [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) for more details. \n",
- "\n",
- "### Writing a custom rule\n",
- "\n",
- "Implementing a custom rule involves implementing the Rule interface that Tornasole provides.\n",
- "Let us go through the exercise of writing a rule which checks whether gradients are very high.\n",
- "\n",
- "#### Constructor\n",
- "Creating a rule involves first inheriting from the base Rule class Tornasole provides: `smdebug.rules.Rule`\n",
- "\n",
- "Every rule is required to take the argument `base_trial` which represents the Trial object for the job whose execution \n",
- "invokes this rule. In addition to this you might want to pass `other_trials` which represents\n",
- "list of Trial objects for other jobs if you want your custom rule to look at other jobs for some comparision. \n",
- "For this rule here we do not need to look at any other trials, so we set `other_trials` to None.\n",
- "\n",
- "```python\n",
- "from smdebug.rules import Rule\n",
- "\n",
- "class CustomGradientRule(Rule):\n",
- " def __init__(self, base_trial, threshold=10.0):\n",
- " super().__init__(base_trial, other_trials=None)\n",
- " self.threshold = float(threshold)\n",
- "```\n",
- "\n",
- "Please note that apart from `base_trial` and `other_trials` (if required), we require all \n",
- "arguments of the rule constructor to take a string as value. You can parse them to the type\n",
- "that you want from the string. This means if you want to pass\n",
- "a list of strings, you might want to pass them as a comma separated string. This restriction is\n",
- "being enforced so as to let you create and invoke rules from json using Sagemaker's APIs. \n",
- "\n",
- "#### Function to invoke at a given step\n",
- "When a rule is executed, it is invoked at each step. We need to now define what to do when the rule is invoked at a given step, `step`.\n",
- "In this function you can implement the core logic of what you want to do with your selection of tensors. If your custom rule \n",
- "has access to other trials, you can access tensors from other trials as well.\n",
- "\n",
- "This function should return a boolean value `True` or `False`. When `True` is returned,\n",
- "SageMaker will raise the exception `RuleEvaluationConditionMet`. This will also create a CloudWatch Event which can be used to configure your chosen action. \n",
- "\n",
- "The invoke function for `CustomGradientRule` to check whether tensors have large gradients can look like below:\n",
- "```python\n",
- " def invoke_at_step(self, step):\n",
- " for tensor in self.base_trial.tensors_in_collection('gradients'):\n",
- " abs_mean = tensor.reduction_value(step, 'mean', abs=True)\n",
- " if abs_mean > self.threshold:\n",
- " return True\n",
- " return False\n",
- "```\n",
- "Here, we can access the names of tensors in `gradients` collection by using the method `tensors_in_collection`. \n",
- "You can see the full API that Trial provides to get tensors in our [Developer Guide For Rules](../../DeveloperGuide_Rules.md).\n",
- "\n",
- "#### Optional: RequiredTensors\n",
- "RequiredTensors is an optional construct that allows Tornasole to bulk-fetch all tensors that you need to \n",
- "execute the rule. This helps the rule invocation be more performant so it does not fetch tensor values from S3 one by one. \n",
- "\n",
- "##### RequiredTensors API \n",
- "This is a class whose object is provided as a member of the rule class, so you can access it as `self.req_tensors`. \n",
- "Its full API is described in our [Developer Guide For Rules](../../DeveloperGuide_Rules.md). \n",
- "In short, it has the following methods:\n",
- "```python\n",
- "# Add name of required tensor for a particular trial at given steps \n",
- "self.req_tensors.add(name=tname, steps=[step_num], trial=None, should_match_regex=False)\n",
- "\n",
- "# If required tensors were added inside `set_required_tensors`, during rule invocation it is\n",
- "# automatically used to fetch all tensors at once by calling `req_tensors.fetch()`\n",
- "# If required tensors were added elsewhere, or later, you can call the `req_tensors.fetch()` method \n",
- "# yourself to fetch all tensors at once.\n",
- "self.req_tensors.fetch()\n",
- "\n",
- "# This method returns the names of the required tensors for a given trial\n",
- "self.req_tensors.get_names(trial=None)\n",
- "\n",
- "# This method returns the steps for which the tensor is required to execute the rule at this step.\n",
- "self.req_tensors.get_tensor_steps(trial=None)\n",
- "\n",
- "# This method returns the list of required tensors for a given trial as `Tensor` objects\n",
- "self.req_tensors.get(trial=None)\n",
- "``` \n",
- "\n",
- "##### Declare required tensors\n",
- "To use this construct, you need to implement a method which lets Tornasole know what tensors you are interested in for invocation at a given step. \n",
- "This is the `set_required_tensors` method.\n",
- "\n",
- "```python\n",
- "def set_required_tensors(self, step):\n",
- " for tname in self.base_trial.tensors_in_collection('gradients'):\n",
- " self.req_tensors.add(tname, steps=[step])\n",
- "```\n",
- "##### Accessing required tensors\n",
- "Since we defined required tensors in the `set_required_tensors` method, these will have been\n",
- "pre-fetched when invoking the rule at a given step. You can continue to access the tensors as before.\n",
- "\n",
- "If you do not want to determine which tensors you want to process again, you can also just call\n",
- "self.req_tensors.get() to get them. In that case, the function would look as below: \n",
- "\n",
- "```python\n",
- "def invoke_at_step(self, step):\n",
- " for tensor in self.req_tensors.get():\n",
- " abs_mean = tensor.reduction_value(step, 'mean', abs=True)\n",
- " if abs_mean > self.threshold:\n",
- " return True\n",
- " return False\n",
- "```\n",
- "\n",
- "### Executing the custom rule\n",
- "\n",
- "You need to now provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`. \n",
- "\n",
- "From above, our rule constructor takes the arguments `base_trial` and `threshold`. The `base_trial` argument will automatically be passed by SageMaker Rule Executor. The other arguments need to be passed through the RuntimeConfigurations dictionary as a mapping from string to string. \n",
- "\n",
- "If the custom rule took `other_trials`, which represents list of Trial objects for other jobs that the rule is interested in, that can be passed by passing the argument `other-trials-paths` which needs to be in the form of `s3_path_other_trial_1;s3_path_other_trial_2`.\n",
- "\n",
- "Note that the custom rules can only have arguments which expect a string as the value except the two arguments specifying trials to the Rule (`base_trial` and `other_trials`). \n",
- "\n",
- "Here's an example:\n",
- "```\n",
- "rules_specification = [\n",
- " {\n",
- " \"RuleName\": \"CustomGradientRule\",\n",
- " \"SourceS3Uri\": \"s3://tornasole-external-preview-use1/rules/scripts/my_custom_rule.py\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"threshold\" : \"20.0\"\n",
- " }\n",
- " }\n",
- "]\n",
- "```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "conda_tensorflow_p36",
- "language": "python",
- "name": "conda_tensorflow_p36"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.5"
- },
- "pycharm": {
- "stem_cell": {
- "cell_type": "raw",
- "metadata": {
- "collapsed": false
- },
- "source": []
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/examples/rules/scripts/my_custom_rule.py b/examples/rules/scripts/my_custom_rule.py
deleted file mode 100644
index 4c570a46f..000000000
--- a/examples/rules/scripts/my_custom_rule.py
+++ /dev/null
@@ -1,19 +0,0 @@
-# First Party
-from smdebug.rules.rule import Rule
-
-
-class CustomGradientRule(Rule):
- def __init__(self, base_trial, threshold=10.0):
- super().__init__(base_trial)
- self.threshold = float(threshold)
-
- def set_required_tensors(self, step):
- for tname in self.base_trial.tensor_names(collection="gradients"):
- self.req_tensors.add(tname, steps=[step])
-
- def invoke_at_step(self, step):
- for t in self.req_tensors.get():
- abs_mean = t.reduction_value(step, "mean", abs=True)
- if abs_mean > self.threshold:
- return True
- return False
diff --git a/examples/tensorflow/README.md b/examples/tensorflow/README.md
new file mode 100644
index 000000000..6a93b72df
--- /dev/null
+++ b/examples/tensorflow/README.md
@@ -0,0 +1,22 @@
+# Examples
+## Example notebooks
+Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger)
+
+## Example scripts
+The above notebooks come with example scripts which can be used through SageMaker. Some more example scripts are here in [scripts/](scripts/)
+
+## Example configurations for saving tensors through SageMaker pySDK
+Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md)
+
+## Example configurations for running rules through SageMaker pySDK
+Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md)
+
+## Example for running rule locally
+
+```
+from smdebug.rules import invoke_rule
+from smdebug.trials import create_trial
+trial = create_trial('s3://bucket/prefix')
+rule_obj = CustomRule(trial, param=value)
+invoke_rule(rule_obj, start_step=0, end_step=10)
+```
diff --git a/examples/tensorflow/local/mnist.py b/examples/tensorflow/local/mnist.py
new file mode 100644
index 000000000..00bb26185
--- /dev/null
+++ b/examples/tensorflow/local/mnist.py
@@ -0,0 +1,153 @@
+"""
+This script is a simple MNIST training script which uses Tensorflow's Estimator interface.
+It has been orchestrated with SageMaker Debugger hook to allow saving tensors during training.
+Here, the hook has been created using its constructor to allow running this locally for your experimentation.
+When you want to run this script in SageMaker, it is recommended to create the hook from json file.
+Please see scripts in either 'sagemaker_byoc' or 'sagemaker_official_container' folder based on your use case.
+"""
+
+# Standard Library
+import argparse
+import logging
+import random
+
+# Third Party
+import numpy as np
+import tensorflow as tf
+
+# First Party
+import smdebug.tensorflow as smd
+
+logging.getLogger().setLevel(logging.INFO)
+
+
+def main():
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--lr", type=float, default=0.001)
+ parser.add_argument("--random_seed", type=bool, default=False)
+ parser.add_argument("--out_dir", type=str)
+ parser.add_argument("--save_interval", type=int, default=500)
+ parser.add_argument("--num_epochs", type=int, default=5, help="Number of epochs to train for")
+ parser.add_argument(
+ "--num_steps",
+ type=int,
+ help="Number of steps to train for. If this" "is passed, it overrides num_epochs",
+ )
+ parser.add_argument(
+ "--num_eval_steps",
+ type=int,
+ help="Number of steps to evaluate for. If this"
+ "is passed, it doesnt evaluate over the full eval set",
+ )
+ parser.add_argument("--model_dir", type=str, default="/tmp/mnist_model")
+ args = parser.parse_args()
+
+ if args.random_seed:
+ tf.set_random_seed(2)
+ np.random.seed(2)
+ random.seed(12)
+
+ hook = smd.EstimatorHook(
+ out_dir=args.out_dir,
+ include_collections=["weights", "gradients"],
+ save_config=smd.SaveConfig(save_interval=args.save_interval),
+ )
+
+ def cnn_model_fn(features, labels, mode):
+ """Model function for CNN."""
+ # Input Layer
+ input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])
+
+ # Convolutional Layer #1
+ conv1 = tf.layers.conv2d(
+ inputs=input_layer,
+ filters=32,
+ kernel_size=[5, 5],
+ padding="same",
+ activation=tf.nn.relu,
+ )
+
+ # Pooling Layer #1
+ pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
+
+ # Convolutional Layer #2 and Pooling Layer #2
+ conv2 = tf.layers.conv2d(
+ inputs=pool1, filters=64, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
+ )
+ pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
+
+ # Dense Layer
+ pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
+ dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
+ dropout = tf.layers.dropout(
+ inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN
+ )
+
+ # Logits Layer
+ logits = tf.layers.dense(inputs=dropout, units=10)
+
+ predictions = {
+ # Generate predictions (for PREDICT and EVAL mode)
+ "classes": tf.argmax(input=logits, axis=1),
+ # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
+ # `logging_hook`.
+ "probabilities": tf.nn.softmax(logits, name="softmax_tensor"),
+ }
+
+ if mode == tf.estimator.ModeKeys.PREDICT:
+ return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
+
+ # Calculate Loss (for both TRAIN and EVAL modes)
+ loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
+
+ # Configure the Training Op (for TRAIN mode)
+ if mode == tf.estimator.ModeKeys.TRAIN:
+ optimizer = tf.train.GradientDescentOptimizer(learning_rate=args.lr)
+
+ # SMD: Wrap your optimizer as follows to help SageMaker Debugger identify gradients
+ # This does not change your optimization logic, it returns back the same optimizer
+ optimizer = hook.wrap_optimizer(optimizer)
+
+ train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
+ return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
+
+ # Add evaluation metrics (for EVAL mode)
+ eval_metric_ops = {
+ "accuracy": tf.metrics.accuracy(labels=labels, predictions=predictions["classes"])
+ }
+ return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)
+
+ # Load training and eval data
+ ((train_data, train_labels), (eval_data, eval_labels)) = tf.keras.datasets.mnist.load_data()
+
+ train_data = train_data / np.float32(255)
+ train_labels = train_labels.astype(np.int32) # not required
+
+ eval_data = eval_data / np.float32(255)
+ eval_labels = eval_labels.astype(np.int32) # not required
+
+ mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn, model_dir=args.model_dir)
+
+ train_input_fn = tf.estimator.inputs.numpy_input_fn(
+ x={"x": train_data},
+ y=train_labels,
+ batch_size=128,
+ num_epochs=args.num_epochs,
+ shuffle=True,
+ )
+
+ eval_input_fn = tf.estimator.inputs.numpy_input_fn(
+ x={"x": eval_data}, y=eval_labels, num_epochs=1, shuffle=False
+ )
+
+ # Set training mode so SMDebug can classify the steps into training mode
+ hook.set_mode(smd.modes.TRAIN)
+ mnist_classifier.train(input_fn=train_input_fn, steps=args.num_steps, hooks=[hook])
+
+ # Set eval mode so SMDebug can classify the steps into eval mode
+ hook.set_mode(smd.modes.EVAL)
+ mnist_classifier.evaluate(input_fn=eval_input_fn, steps=args.num_eval_steps, hooks=[hook])
+
+
+if __name__ == "__main__":
+ main()
diff --git a/examples/tensorflow/local/simple.py b/examples/tensorflow/local/simple.py
new file mode 100644
index 000000000..c0ca691e1
--- /dev/null
+++ b/examples/tensorflow/local/simple.py
@@ -0,0 +1,97 @@
+"""
+This script is a simple training script which uses Tensorflow's MonitoredSession interface.
+It has been orchestrated with SageMaker Debugger hook to allow saving tensors during training.
+Here, the hook has been created using its constructor to allow running this locally for your experimentation.
+When you want to run this script in SageMaker, it is recommended to create the hook from json file.
+Please see scripts in either 'sagemaker_byoc' or 'sagemaker_official_container' folder based on your use case.
+"""
+
+# Standard Library
+import argparse
+import random
+
+# Third Party
+import numpy as np
+import tensorflow as tf
+
+# First Party
+import smdebug.tensorflow as smd
+
+
+def str2bool(v):
+ if isinstance(v, bool):
+ return v
+ if v.lower() in ("yes", "true", "t", "y", "1"):
+ return True
+ elif v.lower() in ("no", "false", "f", "n", "0"):
+ return False
+ else:
+ raise argparse.ArgumentTypeError("Boolean value expected.")
+
+
+def main():
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--model_dir", type=str, help="S3 path for the model")
+ parser.add_argument("--lr", type=float, help="Learning Rate", default=0.001)
+ parser.add_argument("--steps", type=int, help="Number of steps to run", default=100)
+ parser.add_argument("--scale", type=float, help="Scaling factor for inputs", default=1.0)
+ parser.add_argument("--random_seed", type=bool, default=False)
+ parser.add_argument("--out_dir", type=str)
+ parser.add_argument("--save_interval", type=int, default=500)
+ args = parser.parse_args()
+
+ # these random seeds are only intended for test purpose.
+ # for now, 2,2,12 could promise no assert failure when running tests
+ # if you wish to change the number, notice that certain steps' tensor value may be capable of variation
+ if args.random_seed:
+ tf.set_random_seed(2)
+ np.random.seed(2)
+ random.seed(12)
+
+ hook = smd.EstimatorHook(
+ out_dir=args.out_dir,
+ include_collections=["weights", "gradients"],
+ save_config=smd.SaveConfig(save_interval=args.save_interval),
+ )
+
+ # Network definition
+ # Note the use of name scopes
+ with tf.name_scope("foobar"):
+ x = tf.placeholder(shape=(None, 2), dtype=tf.float32)
+ w = tf.Variable(initial_value=[[10.0], [10.0]], name="weight1")
+ with tf.name_scope("foobaz"):
+ w0 = [[1], [1.0]]
+ y = tf.matmul(x, w0)
+ loss = tf.reduce_mean((tf.matmul(x, w) - y) ** 2, name="loss")
+
+ hook.add_to_collection("losses", loss)
+
+ global_step = tf.Variable(17, name="global_step", trainable=False)
+ increment_global_step_op = tf.assign(global_step, global_step + 1)
+
+ optimizer = tf.train.AdamOptimizer(args.lr)
+
+ # Wrap the optimizer with wrap_optimizer so smdebug can find gradients to save
+ optimizer = hook.wrap_optimizer(optimizer)
+
+ # use this wrapped optimizer to minimize loss
+ optimizer_op = optimizer.minimize(loss, global_step=increment_global_step_op)
+
+ # pass the hook to hooks parameter of monitored session
+ sess = tf.train.MonitoredSession(hooks=[hook])
+
+ # use this session for running the tensorflow model
+ hook.set_mode(smd.modes.TRAIN)
+ for i in range(args.steps):
+ x_ = np.random.random((10, 2)) * args.scale
+ _loss, opt, gstep = sess.run([loss, optimizer_op, increment_global_step_op], {x: x_})
+ print(f"Step={i}, Loss={_loss}")
+
+ hook.set_mode(smd.modes.EVAL)
+ for i in range(args.steps):
+ x_ = np.random.random((10, 2)) * args.scale
+ sess.run([loss, increment_global_step_op], {x: x_})
+
+
+if __name__ == "__main__":
+ main()
diff --git a/examples/tensorflow/local/tf_keras_resnet.py b/examples/tensorflow/local/tf_keras_resnet.py
new file mode 100644
index 000000000..1a8b0f0d0
--- /dev/null
+++ b/examples/tensorflow/local/tf_keras_resnet.py
@@ -0,0 +1,76 @@
+"""
+This script is a ResNet training script which uses Tensorflow's Keras interface.
+It has been orchestrated with SageMaker Debugger hook to allow saving tensors during training.
+Here, the hook has been created using its constructor to allow running this locally for your experimentation.
+When you want to run this script in SageMaker, it is recommended to create the hook from json file.
+Please see scripts in either 'sagemaker_byoc' or 'sagemaker_official_container' folder based on your use case.
+"""
+
+# Standard Library
+import argparse
+
+# Third Party
+import numpy as np
+import tensorflow as tf
+from tensorflow.keras.applications.resnet50 import ResNet50
+from tensorflow.keras.datasets import cifar10
+from tensorflow.keras.utils import to_categorical
+
+# First Party
+import smdebug.tensorflow as smd
+
+
+def train(batch_size, epoch, model, hook):
+ (X_train, y_train), (X_valid, y_valid) = cifar10.load_data()
+
+ Y_train = to_categorical(y_train, 10)
+ Y_valid = to_categorical(y_valid, 10)
+
+ X_train = X_train.astype("float32")
+ X_valid = X_valid.astype("float32")
+
+ mean_image = np.mean(X_train, axis=0)
+ X_train -= mean_image
+ X_valid -= mean_image
+ X_train /= 128.0
+ X_valid /= 128.0
+
+ model.fit(
+ X_train,
+ Y_train,
+ batch_size=batch_size,
+ epochs=epoch,
+ validation_data=(X_valid, Y_valid),
+ shuffle=True,
+ callbacks=[hook],
+ )
+
+
+def main():
+ parser = argparse.ArgumentParser(description="Train resnet50 cifar10")
+ parser.add_argument("--batch_size", type=int, default=32)
+ parser.add_argument("--epoch", type=int, default=3)
+ parser.add_argument("--model_dir", type=str, default="./model_keras_resnet")
+ parser.add_argument("--out_dir", type=str)
+ parser.add_argument("--save_interval", type=int, default=500)
+ opt = parser.parse_args()
+
+ model = ResNet50(weights=None, input_shape=(32, 32, 3), classes=10)
+
+ hook = smd.KerasHook(
+ out_dir=opt.out_dir,
+ include_collections=["weights", "gradients", "losses"],
+ save_config=smd.SaveConfig(save_interval=opt.save_interval),
+ )
+
+ optimizer = tf.keras.optimizers.Adam()
+ # wrap the optimizer so the hook can identify the gradients
+ optimizer = hook.wrap_optimizer(optimizer)
+ model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
+
+ # start the training.
+ train(opt.batch_size, opt.epoch, model, hook)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/examples/tensorflow/notebooks/keras-sentiment/Loss_Accuracy.ipynb b/examples/tensorflow/notebooks/keras-sentiment/Loss_Accuracy.ipynb
deleted file mode 100644
index d52c648c3..000000000
--- a/examples/tensorflow/notebooks/keras-sentiment/Loss_Accuracy.ipynb
+++ /dev/null
@@ -1,199 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "%load_ext autoreload\n",
- "%autoreload 2"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Using TensorFlow backend.\n"
- ]
- }
- ],
- "source": [
- "import matplotlib.pyplot as plt\n",
- "import numpy as np\n",
- "from keras.datasets import imdb\n",
- "from smdebug.trials import LocalTrial"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "WARNING: Logging before flag parsing goes to stderr.\n",
- "I0730 09:44:29.710710 4704150976 local_trial.py:20] Loading trial sentiment at path ts_output/\n",
- "I0730 09:44:29.720427 4704150976 local_trial.py:58] Loaded 4 collections\n"
- ]
- }
- ],
- "source": [
- "lt = LocalTrial( 'sentiment', 'ts_output/', parallel=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['batch',\n",
- " 'size',\n",
- " 'loss',\n",
- " 'acc',\n",
- " 'mean_squared_error',\n",
- " 'embedding_1',\n",
- " 'conv1d_1_0',\n",
- " 'conv1d_1_1',\n",
- " 'dense_1_0',\n",
- " 'dense_1_1',\n",
- " 'dense_2_0',\n",
- " 'dense_2_1',\n",
- " 'val_loss',\n",
- " 'val_acc',\n",
- " 'val_mean_squared_error']"
- ]
- },
- "execution_count": 5,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "lt.tensors()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "def tplt( trial, tname ): \n",
- " t = lt.tensor(tname)\n",
- " steps = t.steps()\n",
- " _t = [t.value(s) for s in steps]\n",
- " plt.plot( steps, _t, label=tname)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "tplt( lt, 'acc')\n",
- "tplt( lt, 'val_acc')\n",
- "plt.legend()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "tplt( lt, 'loss')\n",
- "tplt( lt, 'val_loss')\n",
- "plt.legend()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "tplt( lt, 'mean_squared_error')\n",
- "tplt( lt, 'val_mean_squared_error')\n",
- "plt.legend()\n",
- "plt.show()"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.4"
- },
- "pycharm": {
- "stem_cell": {
- "cell_type": "raw",
- "source": [],
- "metadata": {
- "collapsed": false
- }
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/examples/tensorflow/notebooks/keras-sentiment/sentiment-analysis.ipynb b/examples/tensorflow/notebooks/keras-sentiment/sentiment-analysis.ipynb
deleted file mode 100644
index 91a3a18ad..000000000
--- a/examples/tensorflow/notebooks/keras-sentiment/sentiment-analysis.ipynb
+++ /dev/null
@@ -1,974 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Sentiment Analysis with TensorFlow\n",
- "\n",
- "A Convolutional Neural Net (CNN) is sometimes used in text classification tasks such as sentiment analysis. We'll use a CNN built with TensorFlow to perform sentiment analysis in Amazon SageMaker on the IMDB dataset, which consists of movie reviews labeled as having positive or negative sentiment. Three aspects of Amazon SageMaker will be demonstrated:\n",
- "\n",
- "- How to use Script Mode with a prebuilt TensorFlow container, along with a training script similar to one you would use outside SageMaker. \n",
- "- Local Mode training, which allows you to test your code on your notebook instance before creating a full scale training job.\n",
- "- Batch Transform for offline, asynchronous predictions on large batches of data. \n",
- "\n",
- "# Prepare Dataset\n",
- "\n",
- "We'll begin by loading the reviews dataset, and padding the reviews so all reviews have the same length. Each review is represented as an array of numbers, where each number represents an indexed word. Training data for both Local Mode and Hosted Training must be saved as files, so we'll also save the transformed data to files."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Using TensorFlow backend.\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "25000 train sequences\n",
- "25000 test sequences\n",
- "x_train shape: (25000, 400)\n",
- "x_test shape: (25000, 400)\n"
- ]
- }
- ],
- "source": [
- "import os\n",
- "from keras.preprocessing import sequence\n",
- "from keras.datasets import imdb\n",
- "\n",
- "max_features = 20000\n",
- "maxlen = 400\n",
- "\n",
- "(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)\n",
- "print(len(x_train), 'train sequences')\n",
- "print(len(x_test), 'test sequences')\n",
- "\n",
- "x_train = sequence.pad_sequences(x_train, maxlen=maxlen)\n",
- "x_test = sequence.pad_sequences(x_test, maxlen=maxlen)\n",
- "print('x_train shape:', x_train.shape)\n",
- "print('x_test shape:', x_test.shape)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "data_dir = os.path.join(os.getcwd(), 'data')\n",
- "os.makedirs(data_dir, exist_ok=True)\n",
- "\n",
- "train_dir = os.path.join(os.getcwd(), 'data/train')\n",
- "os.makedirs(train_dir, exist_ok=True)\n",
- "\n",
- "test_dir = os.path.join(os.getcwd(), 'data/test')\n",
- "os.makedirs(test_dir, exist_ok=True)\n",
- "\n",
- "csv_test_dir = os.path.join(os.getcwd(), 'data/csv-test')\n",
- "os.makedirs(csv_test_dir, exist_ok=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "\n",
- "np.save(os.path.join(train_dir, 'x_train.npy'), x_train)\n",
- "np.save(os.path.join(train_dir, 'y_train.npy'), y_train)\n",
- "np.save(os.path.join(test_dir, 'x_test.npy'), x_test)\n",
- "np.save(os.path.join(test_dir, 'y_test.npy'), y_test)\n",
- "np.savetxt(os.path.join(csv_test_dir, 'csv-test.csv'), np.array(x_test[:100], dtype=np.int32), fmt='%d', delimiter=\",\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Local Mode Training\n",
- "\n",
- "Amazon SageMaker’s Local Mode training feature is a convenient way to make sure your code is working as expected before moving on to full scale, hosted training. With Local Mode, you can run quick tests with just a sample of training data, and/or a small number of epochs (passes over the full training set), while avoiding the time and expense of attempting full scale hosted training using possibly buggy code. \n",
- "\n",
- "To train in Local Mode, it is necessary to have docker-compose or nvidia-docker-compose (for GPU) installed in the notebook instance. Running following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "/bin/bash: ./setup.sh: No such file or directory\r\n"
- ]
- }
- ],
- "source": [
- "!/bin/bash ./setup.sh"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The next step is to set up a TensorFlow Estimator for Local Mode training. A key parameters for the Estimator is the `train_instance_type`, which is the kind of hardware on which training will run. In the case of Local Mode, we simply set this parameter to `local_gpu` to invoke Local Mode training on the GPU, or to `local` if the instance has a CPU. Other parameters of note are the algorithm’s hyperparameters, which are passed in as a dictionary, and a Boolean parameter indicating that we are using Script Mode."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "WARNING: Logging before flag parsing goes to stderr.\n",
- "W0729 09:01:18.666472 4639487424 session.py:1106] Couldn't call 'get_role' to get Role ARN from role name olg to get Role path.\n"
- ]
- },
- {
- "ename": "ValueError",
- "evalue": "The current AWS identity is not a role: arn:aws:iam::722321484884:user/olg, therefore it cannot be used as a SageMaker execution role",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
- "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0mtrain_instance_count\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0mhyperparameters\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mhyperparameters\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 14\u001b[0;31m \u001b[0mrole\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msagemaker\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_execution_role\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 15\u001b[0m \u001b[0mbase_job_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'tf-keras-sentiment'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 16\u001b[0m \u001b[0mframework_version\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'1.13.1'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/sagemaker/session.py\u001b[0m in \u001b[0;36mget_execution_role\u001b[0;34m(sagemaker_session)\u001b[0m\n\u001b[1;32m 1310\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0marn\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1311\u001b[0m \u001b[0mmessage\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'The current AWS identity is not a role: {}, therefore it cannot be used as a SageMaker execution role'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1312\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marn\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1313\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1314\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;31mValueError\u001b[0m: The current AWS identity is not a role: arn:aws:iam::722321484884:user/olg, therefore it cannot be used as a SageMaker execution role"
- ]
- }
- ],
- "source": [
- "import sagemaker\n",
- "from sagemaker.tensorflow import TensorFlow\n",
- "\n",
- "model_dir = '/opt/ml/model'\n",
- "train_instance_type = 'local'\n",
- "tornasole_s3 = 's3://' + sagemaker.Session().default_bucket() + \"/tornasole-parameters/\"\n",
- "hyperparameters = {'epochs': 1, 'batch_size': 128, \n",
- " 'tornasole-save-interval': 100, 'tornasole_outdir' : tornasole_s3 }\n",
- "local_estimator = TensorFlow(entry_point='sentiment_keras.py',\n",
- " model_dir=model_dir,\n",
- " train_instance_type=train_instance_type,\n",
- " train_instance_count=1,\n",
- " hyperparameters=hyperparameters,\n",
- " role=sagemaker.get_execution_role(),\n",
- " base_job_name='tf-keras-sentiment',\n",
- " framework_version='1.13.1',\n",
- " py_version='py3',\n",
- " image_name='072677473360.dkr.ecr.us-east-1.amazonaws.com/tornasole-preprod-tf-1.13.1-cpu:latest',\n",
- " script_mode=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we'll briefly train the model in Local Mode. Since this is just to make sure the code is working, we'll train for only one epoch. (Note that on a CPU-based notebook instance, this one epoch will take at least 3 or 4 minutes.) As you'll see from the logs below the cell when training is complete, even when trained for only one epoch, the accuracy of the model on training data is already at almost 80%. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Creating tmpsw39_nhj_algo-1-zwl3k_1 ... \n",
- "\u001b[1BAttaching to tmpsw39_nhj_algo-1-zwl3k_12mdone\u001b[0m\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m 2019-07-16 15:20:30,596 sagemaker-containers INFO Imported framework sagemaker_tensorflow_container.training\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m 2019-07-16 15:20:30,603 sagemaker-containers INFO No GPUs detected (normal if no gpus installed)\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m 2019-07-16 15:20:30,917 sagemaker-containers INFO No GPUs detected (normal if no gpus installed)\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m 2019-07-16 15:20:30,939 sagemaker-containers INFO No GPUs detected (normal if no gpus installed)\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m 2019-07-16 15:20:30,961 sagemaker-containers INFO No GPUs detected (normal if no gpus installed)\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m 2019-07-16 15:20:30,975 sagemaker-containers INFO Invoking user script\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Training Env:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m {\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"additional_framework_parameters\": {},\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"channel_input_dirs\": {\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"train\": \"/opt/ml/input/data/train\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"test\": \"/opt/ml/input/data/test\"\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m },\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"current_host\": \"algo-1-zwl3k\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"framework_module\": \"sagemaker_tensorflow_container.training:main\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"hosts\": [\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"algo-1-zwl3k\"\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m ],\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"hyperparameters\": {\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"epochs\": 1,\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"batch_size\": 128,\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"tornasole-save-interval\": 100,\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"tornasole_outdir\": \"s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"model_dir\": \"/opt/ml/model\"\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m },\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"input_config_dir\": \"/opt/ml/input/config\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"input_data_config\": {\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"train\": {\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"TrainingInputMode\": \"File\"\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m },\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"test\": {\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"TrainingInputMode\": \"File\"\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m }\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m },\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"input_dir\": \"/opt/ml/input\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"is_master\": true,\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"job_name\": \"tf-keras-sentiment-2019-07-16-15-20-27-160\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"log_level\": 20,\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"master_hostname\": \"algo-1-zwl3k\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"model_dir\": \"/opt/ml/model\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"module_dir\": \"s3://sagemaker-us-east-1-072677473360/tf-keras-sentiment-2019-07-16-15-20-27-160/source/sourcedir.tar.gz\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"module_name\": \"sentiment_keras\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"network_interface_name\": \"eth0\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"num_cpus\": 4,\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"num_gpus\": 0,\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"output_data_dir\": \"/opt/ml/output/data\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"output_dir\": \"/opt/ml/output\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"output_intermediate_dir\": \"/opt/ml/output/intermediate\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"resource_config\": {\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"current_host\": \"algo-1-zwl3k\",\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"hosts\": [\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"algo-1-zwl3k\"\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m ]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m },\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \"user_entry_point\": \"sentiment_keras.py\"\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m }\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Environment variables:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_HOSTS=[\"algo-1-zwl3k\"]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_NETWORK_INTERFACE_NAME=eth0\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_HPS={\"batch_size\":128,\"epochs\":1,\"model_dir\":\"/opt/ml/model\",\"tornasole-save-interval\":100,\"tornasole_outdir\":\"s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\"}\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_USER_ENTRY_POINT=sentiment_keras.py\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_FRAMEWORK_PARAMS={}\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_RESOURCE_CONFIG={\"current_host\":\"algo-1-zwl3k\",\"hosts\":[\"algo-1-zwl3k\"]}\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_INPUT_DATA_CONFIG={\"test\":{\"TrainingInputMode\":\"File\"},\"train\":{\"TrainingInputMode\":\"File\"}}\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_OUTPUT_DATA_DIR=/opt/ml/output/data\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_CHANNELS=[\"test\",\"train\"]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_CURRENT_HOST=algo-1-zwl3k\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_MODULE_NAME=sentiment_keras\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_LOG_LEVEL=20\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_INPUT_DIR=/opt/ml/input\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_INPUT_CONFIG_DIR=/opt/ml/input/config\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_OUTPUT_DIR=/opt/ml/output\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_NUM_CPUS=4\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_NUM_GPUS=0\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_MODEL_DIR=/opt/ml/model\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_MODULE_DIR=s3://sagemaker-us-east-1-072677473360/tf-keras-sentiment-2019-07-16-15-20-27-160/source/sourcedir.tar.gz\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_TRAINING_ENV={\"additional_framework_parameters\":{},\"channel_input_dirs\":{\"test\":\"/opt/ml/input/data/test\",\"train\":\"/opt/ml/input/data/train\"},\"current_host\":\"algo-1-zwl3k\",\"framework_module\":\"sagemaker_tensorflow_container.training:main\",\"hosts\":[\"algo-1-zwl3k\"],\"hyperparameters\":{\"batch_size\":128,\"epochs\":1,\"model_dir\":\"/opt/ml/model\",\"tornasole-save-interval\":100,\"tornasole_outdir\":\"s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\"},\"input_config_dir\":\"/opt/ml/input/config\",\"input_data_config\":{\"test\":{\"TrainingInputMode\":\"File\"},\"train\":{\"TrainingInputMode\":\"File\"}},\"input_dir\":\"/opt/ml/input\",\"is_master\":true,\"job_name\":\"tf-keras-sentiment-2019-07-16-15-20-27-160\",\"log_level\":20,\"master_hostname\":\"algo-1-zwl3k\",\"model_dir\":\"/opt/ml/model\",\"module_dir\":\"s3://sagemaker-us-east-1-072677473360/tf-keras-sentiment-2019-07-16-15-20-27-160/source/sourcedir.tar.gz\",\"module_name\":\"sentiment_keras\",\"network_interface_name\":\"eth0\",\"num_cpus\":4,\"num_gpus\":0,\"output_data_dir\":\"/opt/ml/output/data\",\"output_dir\":\"/opt/ml/output\",\"output_intermediate_dir\":\"/opt/ml/output/intermediate\",\"resource_config\":{\"current_host\":\"algo-1-zwl3k\",\"hosts\":[\"algo-1-zwl3k\"]},\"user_entry_point\":\"sentiment_keras.py\"}\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_USER_ARGS=[\"--batch_size\",\"128\",\"--epochs\",\"1\",\"--model_dir\",\"/opt/ml/model\",\"--tornasole-save-interval\",\"100\",\"--tornasole_outdir\",\"s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\"]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_CHANNEL_TRAIN=/opt/ml/input/data/train\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_CHANNEL_TEST=/opt/ml/input/data/test\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_HP_EPOCHS=1\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_HP_BATCH_SIZE=128\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_HP_TORNASOLE-SAVE-INTERVAL=100\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_HP_TORNASOLE_OUTDIR=s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m SM_HP_MODEL_DIR=/opt/ml/model\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Invoking script with the following command:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m /usr/local/bin/python sentiment_keras.py --batch_size 128 --epochs 1 --model_dir /opt/ml/model --tornasole-save-interval 100 --tornasole_outdir s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m \n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Using TensorFlow backend.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m x train (25000, 400) y train (25000,)\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m [[ 0 0 0 ... 19 178 32]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m [ 0 0 0 ... 16 145 95]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m [ 0 0 0 ... 7 129 113]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m ...\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m [595 13 258 ... 72 33 32]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m [ 0 0 0 ... 28 126 110]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m [ 0 0 0 ... 7 43 50]]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m [1 0 0 1 0 0 1 0 1 0]\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m x test (25000, 400) y test (25000,)\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Instructions for updating:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Colocations handled automatically by placer.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Instructions for updating:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Colocations handled automatically by placer.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Instructions for updating:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Instructions for updating:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Instructions for updating:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Use tf.cast instead.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Instructions for updating:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Use tf.cast instead.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Instructions for updating:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Deprecated in favor of operator or tf.math.divide.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Instructions for updating:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Deprecated in favor of operator or tf.math.divide.\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Train on 25000 samples, validate on 25000 samples\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m Epoch 1/1\n",
- "25000/25000 [==============================] - 390s 16ms/step - loss: 0.4266 - acc: 0.7852 - val_loss: 0.2631 - val_acc: 0.8911\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m 2019-07-16 15:27:04,144 sagemaker_tensorflow_container.training WARNING Your model will NOT be servable with SageMaker TensorFlow Serving container.The model artifact was not saved in the TensorFlow SavedModel directory structure:\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m https://www.tensorflow.org/guide/saved_model#structure_of_a_savedmodel_directory\n",
- "\u001b[36malgo-1-zwl3k_1 |\u001b[0m 2019-07-16 15:27:04,144 sagemaker-containers INFO Reporting training SUCCESS\n",
- "\u001b[36mtmpsw39_nhj_algo-1-zwl3k_1 exited with code 0\n",
- "\u001b[0mAborting on container exit...\n",
- "===== Job Complete =====\n"
- ]
- }
- ],
- "source": [
- "inputs = {'train': f'file://{train_dir}',\n",
- " 'test': f'file://{test_dir}'}\n",
- "\n",
- "local_estimator.fit(inputs)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Hosted Training\n",
- "\n",
- "After we've confirmed our code seems to be working using Local Mode training, we can move on to use SageMaker's hosted training, which uses compute resources separate from your notebook instance. Hosted training spins up one or more instances (cluster) for training, and then tears the cluster down when training is complete. In general, hosted training is preferred for doing actual training, especially for large-scale, distributed training. Before starting hosted training, the data must be uploaded to S3. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "{'train': 's3://sagemaker-us-east-1-072677473360/sagemaker-us-east-1-072677473360/data/train', 'test': 's3://sagemaker-us-east-1-072677473360/sagemaker-us-east-1-072677473360/data/test'}\n"
- ]
- }
- ],
- "source": [
- "s3_prefix = sagemaker.Session().default_bucket()\n",
- "\n",
- "traindata_s3_prefix = '{}/data/train'.format(s3_prefix)\n",
- "testdata_s3_prefix = '{}/data/test'.format(s3_prefix)\n",
- "\n",
- "train_s3 = sagemaker.Session().upload_data(path='./data/train/', key_prefix=traindata_s3_prefix)\n",
- "test_s3 = sagemaker.Session().upload_data(path='./data/test/', key_prefix=testdata_s3_prefix)\n",
- "\n",
- "inputs = {'train':train_s3, 'test': test_s3}\n",
- "print(inputs)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "With the training data now in S3, we're ready to set up an Estimator object for hosted training. It is similar to the Local Mode Estimator, except the `train_instance_type` has been set to a ML instance type instead of a local type for Local Mode. Additionally, we've set the number of epochs to a number greater than one for actual training, as opposed to just testing the code."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {},
- "outputs": [],
- "source": [
- "train_instance_type = 'ml.p3.2xlarge'\n",
- "#hyperparameters = {'epochs': 10, 'batch_size': 128}\n",
- "hyperparameters = {'epochs': 1, 'batch_size': 128, \n",
- " 'tornasole-save-interval': 1, 'tornasole_outdir' : tornasole_s3 }\n",
- "\n",
- "estimator = TensorFlow(entry_point='sentiment_keras.py',\n",
- " model_dir=model_dir,\n",
- " train_instance_type=train_instance_type,\n",
- " train_instance_count=1,\n",
- " hyperparameters=hyperparameters,\n",
- " role=sagemaker.get_execution_role(),\n",
- " base_job_name='tf-keras-sentiment',\n",
- " framework_version='1.13.1',\n",
- " py_version='py3',\n",
- " image_name='072677473360.dkr.ecr.us-east-1.amazonaws.com/tornasole-preprod-tf-1.13.1-cpu:latest',\n",
- " script_mode=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "With the change in training instance type and increase in epochs, we simply call `fit` to start the actual hosted training. At the end of hosted training, you'll see from the logs below the cell that accuracy on the training set has greatly increased, and accuracy on the validation set is around 90%. The model may be overfitting now (less able to generalize to data it has not yet seen), even though we are employing dropout as a regularization technique. In a production situation, further investigation would be necessary."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "2019-07-16 00:36:04 Starting - Starting the training job...\n",
- "2019-07-16 00:36:09 Starting - Launching requested ML instances......\n",
- "2019-07-16 00:37:17 Starting - Preparing the instances for training......\n",
- "2019-07-16 00:38:15 Downloading - Downloading input data......\n",
- "2019-07-16 00:39:27 Training - Downloading the training image......\n",
- "2019-07-16 00:40:17 Training - Training image download completed. Training in progress.\n",
- "\u001b[31m2019-07-16 00:40:20,820 sagemaker-containers INFO Imported framework sagemaker_tensorflow_container.training\u001b[0m\n",
- "\u001b[31m2019-07-16 00:40:21,423 sagemaker-containers INFO Invoking user script\n",
- "\u001b[0m\n",
- "\u001b[31mTraining Env:\n",
- "\u001b[0m\n",
- "\u001b[31m{\n",
- " \"additional_framework_parameters\": {},\n",
- " \"channel_input_dirs\": {\n",
- " \"test\": \"/opt/ml/input/data/test\",\n",
- " \"train\": \"/opt/ml/input/data/train\"\n",
- " },\n",
- " \"current_host\": \"algo-1\",\n",
- " \"framework_module\": \"sagemaker_tensorflow_container.training:main\",\n",
- " \"hosts\": [\n",
- " \"algo-1\"\n",
- " ],\n",
- " \"hyperparameters\": {\n",
- " \"batch_size\": 128,\n",
- " \"tornasole_outdir\": \"s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\",\n",
- " \"model_dir\": \"/opt/ml/model\",\n",
- " \"epochs\": 1,\n",
- " \"tornasole-save-interval\": 1\n",
- " },\n",
- " \"input_config_dir\": \"/opt/ml/input/config\",\n",
- " \"input_data_config\": {\n",
- " \"test\": {\n",
- " \"TrainingInputMode\": \"File\",\n",
- " \"S3DistributionType\": \"FullyReplicated\",\n",
- " \"RecordWrapperType\": \"None\"\n",
- " },\n",
- " \"train\": {\n",
- " \"TrainingInputMode\": \"File\",\n",
- " \"S3DistributionType\": \"FullyReplicated\",\n",
- " \"RecordWrapperType\": \"None\"\n",
- " }\n",
- " },\n",
- " \"input_dir\": \"/opt/ml/input\",\n",
- " \"is_master\": true,\n",
- " \"job_name\": \"tf-keras-sentiment-2019-07-16-00-36-04-131\",\n",
- " \"log_level\": 20,\n",
- " \"master_hostname\": \"algo-1\",\n",
- " \"model_dir\": \"/opt/ml/model\",\n",
- " \"module_dir\": \"s3://sagemaker-us-east-1-072677473360/tf-keras-sentiment-2019-07-16-00-36-04-131/source/sourcedir.tar.gz\",\n",
- " \"module_name\": \"sentiment_keras\",\n",
- " \"network_interface_name\": \"eth0\",\n",
- " \"num_cpus\": 8,\n",
- " \"num_gpus\": 1,\n",
- " \"output_data_dir\": \"/opt/ml/output/data\",\n",
- " \"output_dir\": \"/opt/ml/output\",\n",
- " \"output_intermediate_dir\": \"/opt/ml/output/intermediate\",\n",
- " \"resource_config\": {\n",
- " \"current_host\": \"algo-1\",\n",
- " \"hosts\": [\n",
- " \"algo-1\"\n",
- " ],\n",
- " \"network_interface_name\": \"eth0\"\n",
- " },\n",
- " \"user_entry_point\": \"sentiment_keras.py\"\u001b[0m\n",
- "\u001b[31m}\n",
- "\u001b[0m\n",
- "\u001b[31mEnvironment variables:\n",
- "\u001b[0m\n",
- "\u001b[31mSM_HOSTS=[\"algo-1\"]\u001b[0m\n",
- "\u001b[31mSM_NETWORK_INTERFACE_NAME=eth0\u001b[0m\n",
- "\u001b[31mSM_HPS={\"batch_size\":128,\"epochs\":1,\"model_dir\":\"/opt/ml/model\",\"tornasole-save-interval\":1,\"tornasole_outdir\":\"s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\"}\u001b[0m\n",
- "\u001b[31mSM_USER_ENTRY_POINT=sentiment_keras.py\u001b[0m\n",
- "\u001b[31mSM_FRAMEWORK_PARAMS={}\u001b[0m\n",
- "\u001b[31mSM_RESOURCE_CONFIG={\"current_host\":\"algo-1\",\"hosts\":[\"algo-1\"],\"network_interface_name\":\"eth0\"}\u001b[0m\n",
- "\u001b[31mSM_INPUT_DATA_CONFIG={\"test\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"},\"train\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"}}\u001b[0m\n",
- "\u001b[31mSM_OUTPUT_DATA_DIR=/opt/ml/output/data\u001b[0m\n",
- "\u001b[31mSM_CHANNELS=[\"test\",\"train\"]\u001b[0m\n",
- "\u001b[31mSM_CURRENT_HOST=algo-1\u001b[0m\n",
- "\u001b[31mSM_MODULE_NAME=sentiment_keras\u001b[0m\n",
- "\u001b[31mSM_LOG_LEVEL=20\u001b[0m\n",
- "\u001b[31mSM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main\u001b[0m\n",
- "\u001b[31mSM_INPUT_DIR=/opt/ml/input\u001b[0m\n",
- "\u001b[31mSM_INPUT_CONFIG_DIR=/opt/ml/input/config\u001b[0m\n",
- "\u001b[31mSM_OUTPUT_DIR=/opt/ml/output\u001b[0m\n",
- "\u001b[31mSM_NUM_CPUS=8\u001b[0m\n",
- "\u001b[31mSM_NUM_GPUS=1\u001b[0m\n",
- "\u001b[31mSM_MODEL_DIR=/opt/ml/model\u001b[0m\n",
- "\u001b[31mSM_MODULE_DIR=s3://sagemaker-us-east-1-072677473360/tf-keras-sentiment-2019-07-16-00-36-04-131/source/sourcedir.tar.gz\u001b[0m\n",
- "\u001b[31mSM_TRAINING_ENV={\"additional_framework_parameters\":{},\"channel_input_dirs\":{\"test\":\"/opt/ml/input/data/test\",\"train\":\"/opt/ml/input/data/train\"},\"current_host\":\"algo-1\",\"framework_module\":\"sagemaker_tensorflow_container.training:main\",\"hosts\":[\"algo-1\"],\"hyperparameters\":{\"batch_size\":128,\"epochs\":1,\"model_dir\":\"/opt/ml/model\",\"tornasole-save-interval\":1,\"tornasole_outdir\":\"s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\"},\"input_config_dir\":\"/opt/ml/input/config\",\"input_data_config\":{\"test\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"},\"train\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"}},\"input_dir\":\"/opt/ml/input\",\"is_master\":true,\"job_name\":\"tf-keras-sentiment-2019-07-16-00-36-04-131\",\"log_level\":20,\"master_hostname\":\"algo-1\",\"model_dir\":\"/opt/ml/model\",\"module_dir\":\"s3://sagemaker-us-east-1-072677473360/tf-keras-sentiment-2019-07-16-00-36-04-131/source/sourcedir.tar.gz\",\"module_name\":\"sentiment_keras\",\"network_interface_name\":\"eth0\",\"num_cpus\":8,\"num_gpus\":1,\"output_data_dir\":\"/opt/ml/output/data\",\"output_dir\":\"/opt/ml/output\",\"output_intermediate_dir\":\"/opt/ml/output/intermediate\",\"resource_config\":{\"current_host\":\"algo-1\",\"hosts\":[\"algo-1\"],\"network_interface_name\":\"eth0\"},\"user_entry_point\":\"sentiment_keras.py\"}\u001b[0m\n",
- "\u001b[31mSM_USER_ARGS=[\"--batch_size\",\"128\",\"--epochs\",\"1\",\"--model_dir\",\"/opt/ml/model\",\"--tornasole-save-interval\",\"1\",\"--tornasole_outdir\",\"s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\"]\u001b[0m\n",
- "\u001b[31mSM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate\u001b[0m\n",
- "\u001b[31mSM_CHANNEL_TEST=/opt/ml/input/data/test\u001b[0m\n",
- "\u001b[31mSM_CHANNEL_TRAIN=/opt/ml/input/data/train\u001b[0m\n",
- "\u001b[31mSM_HP_BATCH_SIZE=128\u001b[0m\n",
- "\u001b[31mSM_HP_TORNASOLE_OUTDIR=s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\u001b[0m\n",
- "\u001b[31mSM_HP_MODEL_DIR=/opt/ml/model\u001b[0m\n",
- "\u001b[31mSM_HP_EPOCHS=1\u001b[0m\n",
- "\u001b[31mSM_HP_TORNASOLE-SAVE-INTERVAL=1\u001b[0m\n",
- "\u001b[31mPYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages\n",
- "\u001b[0m\n",
- "\u001b[31mInvoking script with the following command:\n",
- "\u001b[0m\n",
- "\u001b[31m/usr/local/bin/python sentiment_keras.py --batch_size 128 --epochs 1 --model_dir /opt/ml/model --tornasole-save-interval 1 --tornasole_outdir s3://sagemaker-us-east-1-072677473360/tornasole-parameters/\n",
- "\n",
- "\u001b[0m\n",
- "\u001b[31mUsing TensorFlow backend.\u001b[0m\n",
- "\u001b[31mx train (25000, 400) y train (25000,)\u001b[0m\n",
- "\u001b[31m[[ 0 0 0 ... 19 178 32]\n",
- " [ 0 0 0 ... 16 145 95]\n",
- " [ 0 0 0 ... 7 129 113]\n",
- " ...\n",
- " [595 13 258 ... 72 33 32]\n",
- " [ 0 0 0 ... 28 126 110]\n",
- " [ 0 0 0 ... 7 43 50]]\u001b[0m\n",
- "\u001b[31m[1 0 0 1 0 0 1 0 1 0]\u001b[0m\n",
- "\u001b[31mx test (25000, 400) y test (25000,)\u001b[0m\n",
- "\u001b[31mWARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\u001b[0m\n",
- "\u001b[31mInstructions for updating:\u001b[0m\n",
- "\u001b[31mColocations handled automatically by placer.\u001b[0m\n",
- "\u001b[31mWARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\u001b[0m\n",
- "\u001b[31mInstructions for updating:\u001b[0m\n",
- "\u001b[31mColocations handled automatically by placer.\u001b[0m\n",
- "\u001b[31mWARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\u001b[0m\n",
- "\u001b[31mInstructions for updating:\u001b[0m\n",
- "\u001b[31mPlease use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\u001b[0m\n",
- "\u001b[31mWARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\u001b[0m\n",
- "\u001b[31mInstructions for updating:\u001b[0m\n",
- "\u001b[31mPlease use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\u001b[0m\n",
- "\u001b[31mWARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\u001b[0m\n",
- "\u001b[31mInstructions for updating:\u001b[0m\n",
- "\u001b[31mUse tf.cast instead.\u001b[0m\n",
- "\u001b[31mWARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\u001b[0m\n",
- "\u001b[31mInstructions for updating:\u001b[0m\n",
- "\u001b[31mUse tf.cast instead.\u001b[0m\n",
- "\u001b[31mWARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\u001b[0m\n",
- "\u001b[31mInstructions for updating:\u001b[0m\n",
- "\u001b[31mDeprecated in favor of operator or tf.math.divide.\u001b[0m\n",
- "\u001b[31mWARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\u001b[0m\n",
- "\u001b[31mInstructions for updating:\u001b[0m\n",
- "\u001b[31mDeprecated in favor of operator or tf.math.divide.\u001b[0m\n",
- "\u001b[31mTrain on 25000 samples, validate on 25000 samples\u001b[0m\n",
- "\u001b[31mEpoch 1/1\u001b[0m\n",
- "\u001b[31m 128/25000 [..............................] - ETA: 5:29 - loss: 0.6979 - acc: 0.4531\u001b[0m\n",
- "\u001b[31m 256/25000 [..............................] - ETA: 5:21 - loss: 0.6944 - acc: 0.5000\n",
- " 384/25000 [..............................] - ETA: 4:22 - loss: 0.7009 - acc: 0.4922\u001b[0m\n",
- "\u001b[31m 512/25000 [..............................] - ETA: 3:52 - loss: 0.7005 - acc: 0.4922\u001b[0m\n",
- "\u001b[31m 640/25000 [..............................] - ETA: 3:34 - loss: 0.6990 - acc: 0.4875\u001b[0m\n",
- "\u001b[31m 768/25000 [..............................] - ETA: 3:22 - loss: 0.6979 - acc: 0.4935\u001b[0m\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[31m 896/25000 [>.............................] - ETA: 3:13 - loss: 0.6988 - acc: 0.4922\n",
- " 1024/25000 [>.............................] - ETA: 3:06 - loss: 0.6980 - acc: 0.4961\u001b[0m\n",
- "\u001b[31m 1152/25000 [>.............................] - ETA: 3:00 - loss: 0.6998 - acc: 0.4931\u001b[0m\n",
- "\u001b[31m 1280/25000 [>.............................] - ETA: 2:56 - loss: 0.7008 - acc: 0.4883\n",
- " 1408/25000 [>.............................] - ETA: 3:06 - loss: 0.7003 - acc: 0.4879\u001b[0m\n",
- "\u001b[31m 1536/25000 [>.............................] - ETA: 3:01 - loss: 0.6993 - acc: 0.4961\u001b[0m\n",
- "\u001b[31m 1664/25000 [>.............................] - ETA: 2:57 - loss: 0.6984 - acc: 0.5072\u001b[0m\n",
- "\u001b[31m 1792/25000 [=>............................] - ETA: 2:53 - loss: 0.6981 - acc: 0.5056\u001b[0m\n",
- "\u001b[31m 1920/25000 [=>............................] - ETA: 2:50 - loss: 0.6969 - acc: 0.5120\n",
- " 2048/25000 [=>............................] - ETA: 2:48 - loss: 0.6969 - acc: 0.5112\u001b[0m\n",
- "\u001b[31m 2176/25000 [=>............................] - ETA: 2:45 - loss: 0.6969 - acc: 0.5078\u001b[0m\n",
- "\u001b[31m 2304/25000 [=>............................] - ETA: 2:43 - loss: 0.6966 - acc: 0.5074\u001b[0m\n",
- "\u001b[31m 2432/25000 [=>............................] - ETA: 2:40 - loss: 0.6965 - acc: 0.5053\u001b[0m\n",
- "\u001b[31m 2560/25000 [==>...........................] - ETA: 2:45 - loss: 0.6960 - acc: 0.5059\u001b[0m\n",
- "\u001b[31m 2688/25000 [==>...........................] - ETA: 2:43 - loss: 0.6956 - acc: 0.5089\u001b[0m\n",
- "\u001b[31m 2816/25000 [==>...........................] - ETA: 2:40 - loss: 0.6950 - acc: 0.5142\n",
- " 2944/25000 [==>...........................] - ETA: 2:38 - loss: 0.6947 - acc: 0.5166\u001b[0m\n",
- "\u001b[31m 3072/25000 [==>...........................] - ETA: 2:36 - loss: 0.6940 - acc: 0.5208\u001b[0m\n",
- "\u001b[31m 3200/25000 [==>...........................] - ETA: 2:34 - loss: 0.6936 - acc: 0.5209\u001b[0m\n",
- "\u001b[31m 3328/25000 [==>...........................] - ETA: 2:33 - loss: 0.6932 - acc: 0.5216\n",
- " 3456/25000 [===>..........................] - ETA: 2:31 - loss: 0.6930 - acc: 0.5214\u001b[0m\n",
- "\u001b[31m 3584/25000 [===>..........................] - ETA: 2:29 - loss: 0.6928 - acc: 0.5209\u001b[0m\n",
- "\u001b[31m 3712/25000 [===>..........................] - ETA: 2:33 - loss: 0.6923 - acc: 0.5229\u001b[0m\n",
- "\u001b[31m 3840/25000 [===>..........................] - ETA: 2:31 - loss: 0.6915 - acc: 0.5289\n",
- " 3968/25000 [===>..........................] - ETA: 2:30 - loss: 0.6906 - acc: 0.5358\u001b[0m\n",
- "\u001b[31m 4096/25000 [===>..........................] - ETA: 2:28 - loss: 0.6901 - acc: 0.5386\u001b[0m\n",
- "\u001b[31m 4224/25000 [====>.........................] - ETA: 2:27 - loss: 0.6895 - acc: 0.5400\u001b[0m\n",
- "\u001b[31m 4352/25000 [====>.........................] - ETA: 2:25 - loss: 0.6892 - acc: 0.5402\u001b[0m\n",
- "\u001b[31m 4480/25000 [====>.........................] - ETA: 2:24 - loss: 0.6883 - acc: 0.5442\n",
- " 4608/25000 [====>.........................] - ETA: 2:22 - loss: 0.6875 - acc: 0.5493\u001b[0m\n",
- "\u001b[31m 4736/25000 [====>.........................] - ETA: 2:21 - loss: 0.6866 - acc: 0.5536\u001b[0m\n",
- "\u001b[31m 4864/25000 [====>.........................] - ETA: 2:22 - loss: 0.6855 - acc: 0.5588\n",
- " 4992/25000 [====>.........................] - ETA: 2:21 - loss: 0.6845 - acc: 0.5623\u001b[0m\n",
- "\u001b[31m 5120/25000 [=====>........................] - ETA: 2:19 - loss: 0.6836 - acc: 0.5652\u001b[0m\n",
- "\u001b[31m 5248/25000 [=====>........................] - ETA: 2:18 - loss: 0.6827 - acc: 0.5665\u001b[0m\n",
- "\u001b[31m 5376/25000 [=====>........................] - ETA: 2:17 - loss: 0.6825 - acc: 0.5664\n",
- " 5504/25000 [=====>........................] - ETA: 2:15 - loss: 0.6816 - acc: 0.5672\u001b[0m\n",
- "\u001b[31m 5632/25000 [=====>........................] - ETA: 2:14 - loss: 0.6806 - acc: 0.5701\u001b[0m\n",
- "\u001b[31m 5760/25000 [=====>........................] - ETA: 2:13 - loss: 0.6795 - acc: 0.5724\u001b[0m\n",
- "\u001b[31m 5888/25000 [======>.......................] - ETA: 2:11 - loss: 0.6785 - acc: 0.5727\u001b[0m\n",
- "\u001b[31m 6016/25000 [======>.......................] - ETA: 2:12 - loss: 0.6762 - acc: 0.5751\u001b[0m\n",
- "\u001b[31m 6144/25000 [======>.......................] - ETA: 2:11 - loss: 0.6753 - acc: 0.5778\u001b[0m\n",
- "\u001b[31m 6272/25000 [======>.......................] - ETA: 2:10 - loss: 0.6736 - acc: 0.5818\u001b[0m\n",
- "\u001b[31m 6400/25000 [======>.......................] - ETA: 2:09 - loss: 0.6719 - acc: 0.5850\n",
- " 6528/25000 [======>.......................] - ETA: 2:07 - loss: 0.6705 - acc: 0.5879\u001b[0m\n",
- "\u001b[31m 6656/25000 [======>.......................] - ETA: 2:06 - loss: 0.6688 - acc: 0.5898\u001b[0m\n",
- "\u001b[31m 6784/25000 [=======>......................] - ETA: 2:05 - loss: 0.6678 - acc: 0.5909\u001b[0m\n",
- "\u001b[31m 6912/25000 [=======>......................] - ETA: 2:04 - loss: 0.6661 - acc: 0.5926\u001b[0m\n",
- "\u001b[31m 7040/25000 [=======>......................] - ETA: 2:03 - loss: 0.6641 - acc: 0.5938\u001b[0m\n",
- "\u001b[31m 7168/25000 [=======>......................] - ETA: 2:03 - loss: 0.6620 - acc: 0.5958\u001b[0m\n",
- "\u001b[31m 7296/25000 [=======>......................] - ETA: 2:02 - loss: 0.6612 - acc: 0.5973\n",
- " 7424/25000 [=======>......................] - ETA: 2:01 - loss: 0.6589 - acc: 0.6001\u001b[0m\n",
- "\u001b[31m 7552/25000 [========>.....................] - ETA: 2:00 - loss: 0.6573 - acc: 0.6029\u001b[0m\n",
- "\u001b[31m 7680/25000 [========>.....................] - ETA: 1:59 - loss: 0.6552 - acc: 0.6065\u001b[0m\n",
- "\u001b[31m 7808/25000 [========>.....................] - ETA: 1:58 - loss: 0.6532 - acc: 0.6092\n",
- " 7936/25000 [========>.....................] - ETA: 1:57 - loss: 0.6502 - acc: 0.6129\u001b[0m\n",
- "\u001b[31m 8064/25000 [========>.....................] - ETA: 1:56 - loss: 0.6481 - acc: 0.6152\u001b[0m\n",
- "\u001b[31m 8192/25000 [========>.....................] - ETA: 1:55 - loss: 0.6467 - acc: 0.6161\u001b[0m\n",
- "\u001b[31m 8320/25000 [========>.....................] - ETA: 1:55 - loss: 0.6445 - acc: 0.6178\n",
- " 8448/25000 [=========>....................] - ETA: 1:54 - loss: 0.6423 - acc: 0.6197\u001b[0m\n",
- "\u001b[31m 8576/25000 [=========>....................] - ETA: 1:53 - loss: 0.6404 - acc: 0.6212\u001b[0m\n",
- "\u001b[31m 8704/25000 [=========>....................] - ETA: 1:52 - loss: 0.6386 - acc: 0.6229\u001b[0m\n",
- "\u001b[31m 8832/25000 [=========>....................] - ETA: 1:51 - loss: 0.6363 - acc: 0.6248\n",
- " 8960/25000 [=========>....................] - ETA: 1:50 - loss: 0.6334 - acc: 0.6282\u001b[0m\n",
- "\u001b[31m 9088/25000 [=========>....................] - ETA: 1:49 - loss: 0.6303 - acc: 0.6316\u001b[0m\n",
- "\u001b[31m 9216/25000 [==========>...................] - ETA: 1:47 - loss: 0.6284 - acc: 0.6331\u001b[0m\n",
- "\u001b[31m 9344/25000 [==========>...................] - ETA: 1:46 - loss: 0.6252 - acc: 0.6360\n",
- " 9472/25000 [==========>...................] - ETA: 1:47 - loss: 0.6222 - acc: 0.6388\u001b[0m\n",
- "\u001b[31m 9600/25000 [==========>...................] - ETA: 1:46 - loss: 0.6206 - acc: 0.6408\u001b[0m\n",
- "\u001b[31m 9728/25000 [==========>...................] - ETA: 1:45 - loss: 0.6181 - acc: 0.6430\u001b[0m\n",
- "\u001b[31m 9856/25000 [==========>...................] - ETA: 1:44 - loss: 0.6151 - acc: 0.6452\n",
- " 9984/25000 [==========>...................] - ETA: 1:42 - loss: 0.6125 - acc: 0.6473\u001b[0m\n",
- "\u001b[31m10112/25000 [===========>..................] - ETA: 1:41 - loss: 0.6102 - acc: 0.6494\u001b[0m\n",
- "\u001b[31m10240/25000 [===========>..................] - ETA: 1:40 - loss: 0.6082 - acc: 0.6511\u001b[0m\n",
- "\u001b[31m10368/25000 [===========>..................] - ETA: 1:39 - loss: 0.6056 - acc: 0.6529\u001b[0m\n",
- "\u001b[31m10496/25000 [===========>..................] - ETA: 1:38 - loss: 0.6038 - acc: 0.6548\u001b[0m\n",
- "\u001b[31m10624/25000 [===========>..................] - ETA: 1:47 - loss: 0.6021 - acc: 0.6558\u001b[0m\n",
- "\u001b[31m10752/25000 [===========>..................] - ETA: 1:46 - loss: 0.6010 - acc: 0.6566\u001b[0m\n",
- "\u001b[31m10880/25000 [============>.................] - ETA: 1:44 - loss: 0.5986 - acc: 0.6590\u001b[0m\n",
- "\u001b[31m11008/25000 [============>.................] - ETA: 1:43 - loss: 0.5961 - acc: 0.6611\u001b[0m\n",
- "\u001b[31m11136/25000 [============>.................] - ETA: 1:42 - loss: 0.5938 - acc: 0.6629\u001b[0m\n",
- "\u001b[31m11264/25000 [============>.................] - ETA: 1:41 - loss: 0.5906 - acc: 0.6652\u001b[0m\n",
- "\u001b[31m11392/25000 [============>.................] - ETA: 1:40 - loss: 0.5890 - acc: 0.6664\u001b[0m\n",
- "\u001b[31m11520/25000 [============>.................] - ETA: 1:39 - loss: 0.5871 - acc: 0.6682\u001b[0m\n",
- "\u001b[31m11648/25000 [============>.................] - ETA: 1:38 - loss: 0.5853 - acc: 0.6696\u001b[0m\n",
- "\u001b[31m11776/25000 [=============>................] - ETA: 1:37 - loss: 0.5827 - acc: 0.6713\u001b[0m\n",
- "\u001b[31m11904/25000 [=============>................] - ETA: 1:36 - loss: 0.5808 - acc: 0.6728\u001b[0m\n",
- "\u001b[31m12032/25000 [=============>................] - ETA: 1:35 - loss: 0.5778 - acc: 0.6750\u001b[0m\n",
- "\u001b[31m12160/25000 [=============>................] - ETA: 1:34 - loss: 0.5753 - acc: 0.6766\u001b[0m\n",
- "\u001b[31m12288/25000 [=============>................] - ETA: 1:33 - loss: 0.5731 - acc: 0.6785\u001b[0m\n",
- "\u001b[31m12416/25000 [=============>................] - ETA: 1:32 - loss: 0.5699 - acc: 0.6807\u001b[0m\n",
- "\u001b[31m12544/25000 [==============>...............] - ETA: 1:31 - loss: 0.5679 - acc: 0.6818\u001b[0m\n",
- "\u001b[31m12672/25000 [==============>...............] - ETA: 1:29 - loss: 0.5656 - acc: 0.6836\u001b[0m\n",
- "\u001b[31m12800/25000 [==============>...............] - ETA: 1:28 - loss: 0.5631 - acc: 0.6855\u001b[0m\n",
- "\u001b[31m12928/25000 [==============>...............] - ETA: 1:28 - loss: 0.5616 - acc: 0.6863\u001b[0m\n",
- "\u001b[31m13056/25000 [==============>...............] - ETA: 1:27 - loss: 0.5592 - acc: 0.6882\u001b[0m\n",
- "\u001b[31m13184/25000 [==============>...............] - ETA: 1:26 - loss: 0.5569 - acc: 0.6899\u001b[0m\n",
- "\u001b[31m13312/25000 [==============>...............] - ETA: 1:25 - loss: 0.5541 - acc: 0.6919\u001b[0m\n",
- "\u001b[31m13440/25000 [===============>..............] - ETA: 1:24 - loss: 0.5518 - acc: 0.6936\u001b[0m\n",
- "\u001b[31m13568/25000 [===============>..............] - ETA: 1:23 - loss: 0.5508 - acc: 0.6944\u001b[0m\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[31m13696/25000 [===============>..............] - ETA: 1:22 - loss: 0.5482 - acc: 0.6962\u001b[0m\n",
- "\u001b[31m13824/25000 [===============>..............] - ETA: 1:21 - loss: 0.5464 - acc: 0.6970\u001b[0m\n",
- "\u001b[31m13952/25000 [===============>..............] - ETA: 1:20 - loss: 0.5452 - acc: 0.6978\u001b[0m\n",
- "\u001b[31m14080/25000 [===============>..............] - ETA: 1:20 - loss: 0.5433 - acc: 0.6993\u001b[0m\n",
- "\u001b[31m14208/25000 [================>.............] - ETA: 1:18 - loss: 0.5414 - acc: 0.7010\u001b[0m\n",
- "\u001b[31m14336/25000 [================>.............] - ETA: 1:17 - loss: 0.5393 - acc: 0.7026\u001b[0m\n",
- "\u001b[31m14464/25000 [================>.............] - ETA: 1:16 - loss: 0.5380 - acc: 0.7038\u001b[0m\n",
- "\u001b[31m14592/25000 [================>.............] - ETA: 1:15 - loss: 0.5368 - acc: 0.7047\u001b[0m\n",
- "\u001b[31m14720/25000 [================>.............] - ETA: 1:14 - loss: 0.5347 - acc: 0.7064\u001b[0m\n",
- "\u001b[31m14848/25000 [================>.............] - ETA: 1:13 - loss: 0.5331 - acc: 0.7076\u001b[0m\n",
- "\u001b[31m14976/25000 [================>.............] - ETA: 1:12 - loss: 0.5311 - acc: 0.7091\u001b[0m\n",
- "\u001b[31m15104/25000 [=================>............] - ETA: 1:11 - loss: 0.5295 - acc: 0.7103\u001b[0m\n",
- "\u001b[31m15232/25000 [=================>............] - ETA: 1:11 - loss: 0.5276 - acc: 0.7119\u001b[0m\n",
- "\u001b[31m15360/25000 [=================>............] - ETA: 1:10 - loss: 0.5262 - acc: 0.7130\u001b[0m\n",
- "\u001b[31m15488/25000 [=================>............] - ETA: 1:09 - loss: 0.5244 - acc: 0.7143\u001b[0m\n",
- "\u001b[31m15616/25000 [=================>............] - ETA: 1:08 - loss: 0.5229 - acc: 0.7154\u001b[0m\n",
- "\u001b[31m15744/25000 [=================>............] - ETA: 1:07 - loss: 0.5214 - acc: 0.7167\u001b[0m\n",
- "\u001b[31m15872/25000 [==================>...........] - ETA: 1:06 - loss: 0.5203 - acc: 0.7176\u001b[0m\n",
- "\u001b[31m16000/25000 [==================>...........] - ETA: 1:05 - loss: 0.5199 - acc: 0.7180\u001b[0m\n",
- "\u001b[31m16128/25000 [==================>...........] - ETA: 1:04 - loss: 0.5182 - acc: 0.7193\u001b[0m\n",
- "\u001b[31m16256/25000 [==================>...........] - ETA: 1:03 - loss: 0.5165 - acc: 0.7204\u001b[0m\n",
- "\u001b[31m16384/25000 [==================>...........] - ETA: 1:02 - loss: 0.5148 - acc: 0.7215\u001b[0m\n",
- "\u001b[31m16512/25000 [==================>...........] - ETA: 1:01 - loss: 0.5137 - acc: 0.7224\u001b[0m\n",
- "\u001b[31m16640/25000 [==================>...........] - ETA: 1:00 - loss: 0.5131 - acc: 0.7231\u001b[0m\n",
- "\u001b[31m16768/25000 [===================>..........] - ETA: 59s - loss: 0.5115 - acc: 0.7245 \u001b[0m\n",
- "\u001b[31m16896/25000 [===================>..........] - ETA: 58s - loss: 0.5100 - acc: 0.7253\u001b[0m\n",
- "\u001b[31m17024/25000 [===================>..........] - ETA: 57s - loss: 0.5084 - acc: 0.7263\u001b[0m\n",
- "\u001b[31m17152/25000 [===================>..........] - ETA: 56s - loss: 0.5066 - acc: 0.7275\u001b[0m\n",
- "\u001b[31m17280/25000 [===================>..........] - ETA: 55s - loss: 0.5051 - acc: 0.7284\u001b[0m\n",
- "\u001b[31m17408/25000 [===================>..........] - ETA: 54s - loss: 0.5039 - acc: 0.7293\u001b[0m\n",
- "\u001b[31m17536/25000 [====================>.........] - ETA: 54s - loss: 0.5020 - acc: 0.7304\u001b[0m\n",
- "\u001b[31m17664/25000 [====================>.........] - ETA: 53s - loss: 0.5007 - acc: 0.7312\u001b[0m\n",
- "\u001b[31m17792/25000 [====================>.........] - ETA: 52s - loss: 0.4996 - acc: 0.7321\u001b[0m\n",
- "\u001b[31m17920/25000 [====================>.........] - ETA: 51s - loss: 0.4980 - acc: 0.7333\u001b[0m\n",
- "\u001b[31m18048/25000 [====================>.........] - ETA: 50s - loss: 0.4969 - acc: 0.7340\u001b[0m\n",
- "\u001b[31m18176/25000 [====================>.........] - ETA: 49s - loss: 0.4952 - acc: 0.7351\u001b[0m\n",
- "\u001b[31m18304/25000 [====================>.........] - ETA: 48s - loss: 0.4939 - acc: 0.7360\u001b[0m\n",
- "\u001b[31m18432/25000 [=====================>........] - ETA: 47s - loss: 0.4935 - acc: 0.7368\u001b[0m\n",
- "\u001b[31m18560/25000 [=====================>........] - ETA: 46s - loss: 0.4916 - acc: 0.7379\u001b[0m\n",
- "\u001b[31m18688/25000 [=====================>........] - ETA: 45s - loss: 0.4902 - acc: 0.7388\u001b[0m\n",
- "\u001b[31m18816/25000 [=====================>........] - ETA: 44s - loss: 0.4889 - acc: 0.7396\u001b[0m\n",
- "\u001b[31m18944/25000 [=====================>........] - ETA: 43s - loss: 0.4881 - acc: 0.7401\u001b[0m\n",
- "\u001b[31m19072/25000 [=====================>........] - ETA: 42s - loss: 0.4864 - acc: 0.7412\u001b[0m\n",
- "\u001b[31m19200/25000 [======================>.......] - ETA: 41s - loss: 0.4855 - acc: 0.7419\u001b[0m\n",
- "\u001b[31m19328/25000 [======================>.......] - ETA: 40s - loss: 0.4839 - acc: 0.7429\u001b[0m\n",
- "\u001b[31m19456/25000 [======================>.......] - ETA: 39s - loss: 0.4825 - acc: 0.7438\u001b[0m\n",
- "\u001b[31m19584/25000 [======================>.......] - ETA: 38s - loss: 0.4815 - acc: 0.7446\u001b[0m\n",
- "\u001b[31m19712/25000 [======================>.......] - ETA: 37s - loss: 0.4803 - acc: 0.7455\u001b[0m\n",
- "\u001b[31m19840/25000 [======================>.......] - ETA: 36s - loss: 0.4792 - acc: 0.7463\u001b[0m\n",
- "\u001b[31m19968/25000 [======================>.......] - ETA: 36s - loss: 0.4777 - acc: 0.7475\u001b[0m\n",
- "\u001b[31m20096/25000 [=======================>......] - ETA: 35s - loss: 0.4773 - acc: 0.7479\u001b[0m\n",
- "\u001b[31m20224/25000 [=======================>......] - ETA: 34s - loss: 0.4757 - acc: 0.7490\u001b[0m\n",
- "\u001b[31m20352/25000 [=======================>......] - ETA: 33s - loss: 0.4742 - acc: 0.7500\u001b[0m\n",
- "\u001b[31m20480/25000 [=======================>......] - ETA: 32s - loss: 0.4730 - acc: 0.7506\u001b[0m\n",
- "\u001b[31m20608/25000 [=======================>......] - ETA: 31s - loss: 0.4729 - acc: 0.7511\u001b[0m\n",
- "\u001b[31m20736/25000 [=======================>......] - ETA: 30s - loss: 0.4719 - acc: 0.7517\u001b[0m\n",
- "\u001b[31m20864/25000 [========================>.....] - ETA: 29s - loss: 0.4710 - acc: 0.7523\u001b[0m\n",
- "\u001b[31m20992/25000 [========================>.....] - ETA: 29s - loss: 0.4697 - acc: 0.7534\u001b[0m\n",
- "\u001b[31m21120/25000 [========================>.....] - ETA: 28s - loss: 0.4686 - acc: 0.7541\u001b[0m\n",
- "\u001b[31m21248/25000 [========================>.....] - ETA: 27s - loss: 0.4673 - acc: 0.7548\u001b[0m\n",
- "\u001b[31m21376/25000 [========================>.....] - ETA: 26s - loss: 0.4664 - acc: 0.7554\u001b[0m\n",
- "\u001b[31m21504/25000 [========================>.....] - ETA: 25s - loss: 0.4654 - acc: 0.7560\u001b[0m\n",
- "\u001b[31m21632/25000 [========================>.....] - ETA: 24s - loss: 0.4644 - acc: 0.7571\u001b[0m\n",
- "\u001b[31m21760/25000 [=========================>....] - ETA: 23s - loss: 0.4632 - acc: 0.7579\u001b[0m\n",
- "\u001b[31m21888/25000 [=========================>....] - ETA: 22s - loss: 0.4619 - acc: 0.7589\u001b[0m\n",
- "\u001b[31m22016/25000 [=========================>....] - ETA: 21s - loss: 0.4608 - acc: 0.7597\u001b[0m\n",
- "\u001b[31m22144/25000 [=========================>....] - ETA: 21s - loss: 0.4598 - acc: 0.7603\u001b[0m\n",
- "\u001b[31m22272/25000 [=========================>....] - ETA: 20s - loss: 0.4589 - acc: 0.7608\u001b[0m\n",
- "\u001b[31m22400/25000 [=========================>....] - ETA: 19s - loss: 0.4581 - acc: 0.7615\u001b[0m\n",
- "\u001b[31m22528/25000 [==========================>...] - ETA: 18s - loss: 0.4567 - acc: 0.7625\u001b[0m\n",
- "\u001b[31m22656/25000 [==========================>...] - ETA: 17s - loss: 0.4562 - acc: 0.7629\u001b[0m\n",
- "\u001b[31m22784/25000 [==========================>...] - ETA: 16s - loss: 0.4553 - acc: 0.7636\u001b[0m\n",
- "\u001b[31m22912/25000 [==========================>...] - ETA: 15s - loss: 0.4540 - acc: 0.7645\u001b[0m\n",
- "\u001b[31m23040/25000 [==========================>...] - ETA: 14s - loss: 0.4530 - acc: 0.7653\u001b[0m\n",
- "\u001b[31m23168/25000 [==========================>...] - ETA: 13s - loss: 0.4524 - acc: 0.7659\u001b[0m\n",
- "\u001b[31m23296/25000 [==========================>...] - ETA: 12s - loss: 0.4512 - acc: 0.7667\u001b[0m\n",
- "\u001b[31m23424/25000 [===========================>..] - ETA: 11s - loss: 0.4500 - acc: 0.7672\u001b[0m\n",
- "\u001b[31m23552/25000 [===========================>..] - ETA: 10s - loss: 0.4489 - acc: 0.7678\u001b[0m\n",
- "\u001b[31m23680/25000 [===========================>..] - ETA: 9s - loss: 0.4477 - acc: 0.7686 \u001b[0m\n",
- "\u001b[31m23808/25000 [===========================>..] - ETA: 8s - loss: 0.4464 - acc: 0.7695\u001b[0m\n",
- "\u001b[31m23936/25000 [===========================>..] - ETA: 7s - loss: 0.4454 - acc: 0.7703\u001b[0m\n",
- "\u001b[31m24064/25000 [===========================>..] - ETA: 6s - loss: 0.4448 - acc: 0.7707\u001b[0m\n",
- "\u001b[31m24192/25000 [============================>.] - ETA: 5s - loss: 0.4441 - acc: 0.7710\u001b[0m\n",
- "\u001b[31m24320/25000 [============================>.] - ETA: 5s - loss: 0.4437 - acc: 0.7714\u001b[0m\n",
- "\u001b[31m24448/25000 [============================>.] - ETA: 4s - loss: 0.4430 - acc: 0.7719\u001b[0m\n",
- "\u001b[31m24576/25000 [============================>.] - ETA: 3s - loss: 0.4420 - acc: 0.7726\u001b[0m\n",
- "\u001b[31m24704/25000 [============================>.] - ETA: 2s - loss: 0.4411 - acc: 0.7733\u001b[0m\n",
- "\u001b[31m24832/25000 [============================>.] - ETA: 1s - loss: 0.4403 - acc: 0.7739\u001b[0m\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[31m24960/25000 [============================>.] - ETA: 0s - loss: 0.4393 - acc: 0.7745\u001b[0m\n",
- "\u001b[31m25000/25000 [==============================] - 233s 9ms/step - loss: 0.4387 - acc: 0.7748 - val_loss: 0.2636 - val_acc: 0.8898\u001b[0m\n",
- "\u001b[31m2019-07-16 00:44:17,528 sagemaker_tensorflow_container.training WARNING Your model will NOT be servable with SageMaker TensorFlow Serving container.The model artifact was not saved in the TensorFlow SavedModel directory structure:\u001b[0m\n",
- "\u001b[31mhttps://www.tensorflow.org/guide/saved_model#structure_of_a_savedmodel_directory\u001b[0m\n",
- "\u001b[31m2019-07-16 00:44:17,528 sagemaker-containers INFO Reporting training SUCCESS\u001b[0m\n",
- "\n",
- "2019-07-16 00:45:32 Uploading - Uploading generated training model\n",
- "2019-07-16 00:45:32 Completed - Training job completed\n",
- "Billable seconds: 437\n"
- ]
- }
- ],
- "source": [
- "estimator.fit(inputs)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Batch Prediction\n",
- "\n",
- "\n",
- "If our use case requires individual predictions in near real-time, SageMaker hosted endpoints can be created. Hosted endpoints also can be used for pseudo-batch prediction, but the process is more involved than simply using SageMaker's Batch Transform feature, which is designed for large-scale, asynchronous batch inference.\n",
- "\n",
- "To use Batch Transform, first we must upload to Amazon S3 some test data in CSV format to be transformed."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "csvtestdata_s3_prefix = '{}/data/csv-test'.format(s3_prefix)\n",
- "csvtest_s3 = sagemaker.Session().upload_data(path='./data/csv-test/', key_prefix=csvtestdata_s3_prefix)\n",
- "print(csvtest_s3)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "A Transformer object must be set up to describe the Batch Transform job, including the amount and type of inference hardware to be used. Then the actual transform job itself is started with a call to the `transform` method of the Transformer."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "transformer = estimator.transformer(instance_count=1, instance_type='ml.m5.xlarge')\n",
- "transformer.transform(csvtest_s3, content_type='text/csv')\n",
- "print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)\n",
- "transformer.wait()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can now download the batch predictions from S3 to the local filesystem on the notebook instance; the predictions are contained in a file with a .out extension, and are embedded in JSON. Next we'll load the JSON and examine the predictions, which are confidence scores from 0.0 to 1.0 where numbers close to 1.0 indicate positive sentiment, while numbers close to 0.0 indicate negative sentiment."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import json\n",
- "\n",
- "batch_output = transformer.output_path\n",
- "!mkdir -p batch_data/output\n",
- "!aws s3 cp --recursive $batch_output/ batch_data/output/\n",
- "\n",
- "with open('batch_data/output/csv-test.csv.out', 'r') as f:\n",
- " jstr = json.load(f)\n",
- " results = [float('%.3f'%(item)) for sublist in jstr['predictions'] for item in sublist]\n",
- " print(results)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now let's look at the text of some actual reviews to see the predictions in action. First, we have to convert the integers representing the words back to the words themselves by using a reversed dictionary. Next we can decode the reviews, taking into account that the first 3 indices were reserved for \"padding\", \"start of sequence\", and \"unknown\", and removing a string of unknown tokens from the start of the review."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import re\n",
- "\n",
- "regex = re.compile(r'^[\\?\\s]+')\n",
- "\n",
- "word_index = imdb.get_word_index()\n",
- "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])\n",
- "first_decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_test[3]])\n",
- "regex.sub('', first_decoded_review)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Overall, this review looks fairly negative. Let's compare the actual label with the prediction:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def get_sentiment(score):\n",
- " return 'positive' if score > 0.5 else 'negative' \n",
- "\n",
- "print('Labeled sentiment for this review is {}, predicted sentiment is {}'.format(get_sentiment(y_test[3]), \n",
- " get_sentiment(results[3])))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Our negative sentiment prediction agrees with the label for this review. Let's now examine another review:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "second_decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_test[10]])\n",
- "regex.sub('', second_decoded_review)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print('Labeled sentiment for this review is {}, predicted sentiment is {}'.format(get_sentiment(y_test[10]), \n",
- " get_sentiment(results[10])))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Again, the prediction agreed with the label for the test data. Note that there is no need to clean up any Batch Transform resources: after the transform job is complete, the cluster used to make inferences is torn down.\n",
- "\n",
- "Now that we've reviewed some sample predictions as a sanity check, we're finished. Of course, in a typical production situation, the data science project lifecycle is iterative, with repeated cycles of refining the model using a tool such as Amazon SageMaker's Automatic Model Tuning feature, and gathering more data. "
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.4"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/examples/tensorflow/notebooks/keras-sentiment/sentiment_keras.py b/examples/tensorflow/notebooks/keras-sentiment/sentiment_keras.py
deleted file mode 100644
index 61d9ce0a6..000000000
--- a/examples/tensorflow/notebooks/keras-sentiment/sentiment_keras.py
+++ /dev/null
@@ -1,106 +0,0 @@
-# Standard Library
-import argparse
-import os
-
-# Third Party
-import keras
-import numpy as np
-
-# First Party
-from smdebug import SaveConfig
-from smdebug.tensorflow.keras import SessionHook
-
-max_features = 20000
-maxlen = 400
-embedding_dims = 300
-filters = 250
-kernel_size = 3
-hidden_dims = 250
-
-
-def parse_args():
-
- parser = argparse.ArgumentParser()
-
- # hyperparameters sent by the client are passed as command-line arguments to the script
- parser.add_argument("--epochs", type=int, default=5)
- parser.add_argument("--batch_size", type=int, default=64)
-
- parser.add_argument("--tornasole_outdir", type=str, required=True)
- parser.add_argument("--tornasole_save_interval", type=int, default=10)
-
- # data directories
- parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
- parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
-
- # model directory: we will use the default set by SageMaker, /opt/ml/model
- parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
-
- return parser.parse_known_args()
-
-
-def get_train_data(train_dir):
-
- x_train = np.load(os.path.join(train_dir, "x_train.npy"))
- y_train = np.load(os.path.join(train_dir, "y_train.npy"))
- print("x train", x_train.shape, "y train", y_train.shape)
- print(x_train[:10])
- print(y_train[:10])
-
- return x_train, y_train
-
-
-def get_test_data(test_dir):
-
- x_test = np.load(os.path.join(test_dir, "x_test.npy"))
- y_test = np.load(os.path.join(test_dir, "y_test.npy"))
- print("x test", x_test.shape, "y test", y_test.shape)
-
- return x_test, y_test
-
-
-def get_model():
-
- embedding_layer = keras.layers.Embedding(max_features, embedding_dims, input_length=maxlen)
-
- sequence_input = keras.Input(shape=(maxlen,), dtype="int32")
- embedded_sequences = embedding_layer(sequence_input)
- x = keras.layers.Dropout(0.2)(embedded_sequences)
- x = keras.layers.Conv1D(filters, kernel_size, padding="valid", activation="relu", strides=1)(x)
- x = keras.layers.MaxPooling1D()(x)
- x = keras.layers.GlobalMaxPooling1D()(x)
- x = keras.layers.Dense(hidden_dims, activation="relu")(x)
- x = keras.layers.Dropout(0.2)(x)
- preds = keras.layers.Dense(1, activation="sigmoid")(x)
-
- return keras.Model(sequence_input, preds)
-
-
-if __name__ == "__main__":
-
- args, _ = parse_args()
-
- hook = SessionHook(
- out_dir=args.tornasole_outdir,
- save_config=SaveConfig(save_interval=args.tornasole_save_interval),
- )
-
- x_train, y_train = get_train_data(args.train)
- x_test, y_test = get_test_data(args.test)
-
- model = get_model()
-
- model.compile(
- loss="binary_crossentropy", optimizer="adam", metrics=["accuracy", "mean_squared_error"]
- )
-
- model.fit(
- x_train,
- y_train,
- batch_size=args.batch_size,
- epochs=args.epochs,
- validation_data=(x_test, y_test),
- callbacks=[hook],
- )
-
- model.save(os.path.join(args.model_dir, "sentiment_keras.h5"))
diff --git a/examples/tensorflow/sagemaker-notebooks/tensorflow.ipynb b/examples/tensorflow/sagemaker-notebooks/tensorflow.ipynb
deleted file mode 100644
index bda7f67d5..000000000
--- a/examples/tensorflow/sagemaker-notebooks/tensorflow.ipynb
+++ /dev/null
@@ -1,881 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Debugging SageMaker TensorFlow Training Jobs with Tornasole"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Overview\n",
- "Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. \n",
- "It lets you go beyond just looking at scalars like losses and accuracies during training and gives \n",
- "you full visibility into all tensors 'flowing through the graph' during training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training flow. \n",
- "\n",
- "Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.\n",
- "\n",
- "### Saving tensors\n",
- "Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis. Tornasole is highly customizable to save the tesnsors you want at different frequencies. Refer [DeveloperGuide_TensorFlow](../../DeveloperGuide_TF.md) for details on how to save the tensors you want to save.\n",
- "\n",
- "### Analysis\n",
- "\n",
- "Analysis of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, \n",
- "a rule is a python code used to detect certain conditions during training. Some of the conditions that a data scientist training a deep learning model may care about are monitoring for gradients getting too large or too small, detecting overfitting, and so on.\n",
- "Tornasole will come pre-packaged with certain rules. Users can write their own rules using the Tornasole APIs.\n",
- "You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook, using Tornasole's full set of APIs. \n",
- "Please refer [DeveloperGuide_Rules](../../../../rules/DeveloperGuide_Rules.md) for more details about analysis.\n",
- "\n",
- "This example guides you through installation of the required components for emitting tensors in a \n",
- "SageMaker training job and applying a rule over the tensors to monitor the live status of the job. \n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup\n",
- "\n",
- "As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them. This is only for the purposes of this private beta. Once we do this, we will be ready to use smdebug."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "download: s3://tornasole-external-preview-use1/sdk/sagemaker-1.35.2.dev0.tar.gz to ../../../../../tornasole-preview-sdk/sagemaker-1.35.2.dev0.tar.gz\n",
- "download: s3://tornasole-external-preview-use1/sdk/sagemaker-tornasole-latest.tar.gz to ../../../../../tornasole-preview-sdk/sagemaker-tornasole-latest.tar.gz\n",
- "Installing requirements...\n",
- "\u001b[33mYou are using pip version 10.0.1, however version 19.2.3 is available.\n",
- "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n",
- "Installation completed!\n"
- ]
- }
- ],
- "source": [
- "! aws s3 sync s3://tornasole-external-preview-use1/sdk/ ~/SageMaker/tornasole-preview-sdk/\n",
- "! chmod +x ~/SageMaker/tornasole-preview-sdk/installer.sh && ~/SageMaker/tornasole-preview-sdk/installer.sh"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Training TensorFlow models in SageMaker with Tornasole\n",
- "\n",
- "We'll train a few TensorFlow models in this notebook with Tornasole enabled and monitor the training jobs with Tornasole Rules. This will be done using SageMaker TensorFlow 1.13.1 Container in Script Mode. \n",
- "\n",
- "Let us first train a simple example training script [simple.py](../scripts/simple.py) with Tornasole enabled in SageMaker using the SageMaker Estimator API, along with a ExplodingTensor Rule to monitor the training job in realtime. A Tornasole Rule is essentially python code which analyses tensors saved by tornasole and validates some condition. ExplodingTensor rule is a first party (1P) rule provided by smdebug. During training, Tornasole will capture tensors as specified in its configuration and ExplodingTensor Rule job will monitor whether any tensor exploded (i.e. value became not a number). The rule will emit a cloudwatch event if it finds an exploding tensor during training.\n",
- "\n",
- "### Enabling Tornasole in the script\n",
- "You can see in the script that we have made a couple of simple changes to enable smdebug. We created a SessionHook which we pass when creating a Monitored session, and use this monitored session for training. We passed save_all=True telling the hook to save all tensors in the graph. Note that Tornasole is highly configurable, you can choose exactly what to save. The changes are described in a bit more detail below after we train this example as well as in even more detail in our [Developer Guide for Tensorflow](../../DeveloperGuide_TF.md). \n",
- "\n",
- "We also wrapped the optimizer with Tornasole's optimizer and used this to minimize the loss. This helps Tornasole identify the gradients during training, and changes nothing in the loss minimization process.\n",
- "\n",
- "```python\n",
- "import smdebug.tensorflow as smd\n",
- "# Ask TORNASOLE to save all tensors. Note: SessionHook is highly configurable\n",
- "hook = smd.SessionHook(save_all=True) \n",
- "# Wrap the optimizer with Tornasole optimizer to identify gradients\n",
- "optimizer = hook.wrap_optimizer(optimizer) \n",
- "# pass the hook to hooks parameter of monitored session\n",
- "sess = tf.train.MonitoredSession(hooks=[hook])\n",
- "```\n",
- "\n",
- "### Docker Images with Tornasole\n",
- "We have built SageMaker TensorFlow containers with smdebug. You can use them from ECR from SageMaker. Here are the links to the images. Please use the image from the appropriate region in which you want your jobs to run."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "import sagemaker\n",
- "import boto3\n",
- "from sagemaker.tensorflow import TensorFlow\n",
- "\n",
- "# Below changes the region to be one where this notebook is running\n",
- "REGION = boto3.Session().region_name\n",
- "\n",
- "TAG='latest'\n",
- "\n",
- "gpu_docker_image_name = '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-tf-1.13.1-gpu:{}'.format(REGION, TAG)\n",
- "cpu_docker_image_name = '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-tf-1.13.1-cpu:{}'.format(REGION, TAG)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Start training a simple Session based example\n",
- "For the purposes of this demonstration let us bad hyperparameters so that NAN is produced during training for the rule to catch."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "pycharm": {
- "name": "#%%\n"
- }
- },
- "outputs": [],
- "source": [
- "simple_entry_point_script = '../scripts/simple.py'\n",
- "simple_hyperparameters = { 'steps': 1000000, 'save_frequency': 50 }\n",
- "\n",
- "# copy dict\n",
- "bad_simple_hyperparameters = dict(simple_hyperparameters)\n",
- "\n",
- "# These parameters are consumed by simple.py to produce a exploding tensor problem\n",
- "bad_simple_hyperparameters.update({ 'lr': 100, 'scale': 100000000000})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_simple_estimator = TensorFlow(\n",
- " role=sagemaker.get_execution_role(),\n",
- " base_job_name='tornasole-simple-demo',\n",
- " train_instance_count=1,\n",
- " train_instance_type='ml.m4.xlarge',\n",
- " image_name=cpu_docker_image_name,\n",
- " entry_point=simple_entry_point_script,\n",
- " framework_version='1.13.1',\n",
- " py_version='py3',\n",
- " script_mode=True,\n",
- " hyperparameters=bad_simple_hyperparameters,\n",
- " train_max_run=1800,\n",
- " \n",
- " # These are Tornasole specific parameters, \n",
- " # debug= True means rule specified in rules_specification \n",
- " # will run as rule job. \n",
- " # Below, we specify to run the first party rule ExplodingTensor\n",
- " # on a ml.c5.4xlarge instance\n",
- " debug=True,\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"ExplodingTensor\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " }\n",
- " ])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_simple_estimator.fit(wait=False)\n",
- "# This is a fire and forget event. By setting wait=False, we just submit the job to run in the background.\n",
- "# In the background SageMaker will spin off 1 training job and 1 rule job for you.\n",
- "# Please follow this notebook to see status of the training job and the rule job"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Result\n",
- "As a result of the above command, SageMaker will spin off 1 training job and 1 rule job for you - the first one being the job which produces the tensors to be analyzed and the second one, which analyzes the tensors to check if there are any exploding tensor during training.\n",
- "\n",
- "### Describing the training job\n",
- "We can check the status of the training job by running the following command:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Below command will give the status of training job\n",
- "# Note: In the output of below command you will see DebugConfig parameter \n",
- "\n",
- "job_name = sagemaker_simple_estimator.latest_training_job.name\n",
- "\n",
- "client = sagemaker_simple_estimator.sagemaker_session.sagemaker_client\n",
- "\n",
- "description = client.describe_training_job(TrainingJobName=job_name)\n",
- "\n",
- "# uncomment next line to see full details of training job \n",
- "# description"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'InProgress'"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# The status of the training job can be seen below\n",
- "description['TrainingJobStatus']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Once your training job is started SageMaker will spin up a rule execution job to run the ExplodingTensor rule.\n",
- "\n",
- "### Tornasole specific parameters in the description\n",
- "**DebugConfig** parameter has details about Tornasole related configuration. The key parameters to look for below are\n",
- "\n",
- "*S3OutputPath* : This is the path where output tensors from tornasole is getting saved. \n",
- "*RuleConfig*' : This parameter tells about the rule config parameter that was passed when creating the trainning job. In this you should be able to see details of the rule that ran for training. \n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{'DebugHookConfig': {'LocalPath': '/opt/ml/output/tensors',\n",
- " 'S3OutputPath': 's3://sagemaker-ca-central-1-072677473360/tensors-tornasole-simple-demo-2019-08-30-04-11-54-514',\n",
- " 'DebugHookSpecificationList': []},\n",
- " 'RuleConfig': {'RuleSpecificationList': [{'RuleName': 'ExplodingTensor',\n",
- " 'RuleEvaluatorImage': '453379255795.dkr.ecr.ca-central-1.amazonaws.com/script-rule-executor:latest',\n",
- " 'InstanceType': 'ml.c5.4xlarge',\n",
- " 'VolumeSizeInGB': 100}]}}"
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "description['DebugConfig']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Check the status of the Rule Execution Job\n",
- "To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. These details are also available as part of the response `description` above under: `description['RuleMonitoringStatuses']`\n",
- "\n",
- "\n",
- "The logs of the training job are available in the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`. \n",
- "\n",
- "You will see that once the rule execution job starts, that it identifies the exploding tensor situation in the training job, raises the `RuleEvaluationConditionMet` exception and ends the job. \n",
- "\n",
- "**Note that the next cell blocks till the rule execution job ends. Once it says RuleStatus is Started, and shows the `RuleExecutionJobArn`, you can look at the status of the rule being monitored. At that point, we can also look at the logs as shown in the next cell**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Wait to get status for Rule Execution Jobs...\n",
- "=============================================\n",
- "RuleName: ExplodingTensor\n",
- "RuleStatus: NotStarted\n",
- "=============================================\n",
- "Wait to get status for Rule Execution Jobs...\n",
- "=============================================\n",
- "RuleName: ExplodingTensor\n",
- "RuleStatus: NotStarted\n",
- "=============================================\n",
- "Wait to get status for Rule Execution Jobs...\n",
- "=============================================\n",
- "RuleName: ExplodingTensor\n",
- "RuleStatus: NotStarted\n",
- "=============================================\n",
- "Wait to get status for Rule Execution Jobs...\n",
- "=============================================\n",
- "RuleName: ExplodingTensor\n",
- "RuleStatus: NotStarted\n",
- "=============================================\n",
- "Wait to get status for Rule Execution Jobs...\n",
- "=============================================\n",
- "RuleName: ExplodingTensor\n",
- "RuleStatus: InProgress\n",
- "RuleExecutionJobName: ExplodingTensor-8a2f1f7334ff9ef67c8de51cf7b2a7dd\n",
- "RuleExecutionJobArn: arn:aws:sagemaker:ca-central-1:072677473360:training-job/explodingtensor-8a2f1f7334ff9ef67c8de51cf7b2a7dd\n",
- "=============================================\n",
- "Wait to get status for Rule Execution Jobs...\n",
- "=============================================\n",
- "RuleName: ExplodingTensor\n",
- "RuleStatus: InProgress\n",
- "RuleExecutionJobName: ExplodingTensor-8a2f1f7334ff9ef67c8de51cf7b2a7dd\n",
- "RuleExecutionJobArn: arn:aws:sagemaker:ca-central-1:072677473360:training-job/explodingtensor-8a2f1f7334ff9ef67c8de51cf7b2a7dd\n",
- "=============================================\n",
- "Wait to get status for Rule Execution Jobs...\n",
- "=============================================\n",
- "RuleName: ExplodingTensor\n",
- "RuleStatus: RuleExecutionError\n",
- "FailureReason: ClientError: RuleEvaluationConditionMet: Evaluation of the rule ExplodingTensor at step 0 resulted in the condition being met\n",
- "Traceback (most recent call last):\n",
- " File \"train.py\", line 214, in execute\n",
- " exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local)\n",
- " File \"\", line 2, in \n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 84, in invoke_rule\n",
- " raise e\n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 79, in invoke_rule\n",
- " rule_obj.invoke(step)\n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule.py\", line 56, in invoke\n",
- " raise RuleEvaluationConditionMet(self.rule_name, step)\n",
- "smdebug.exceptions.RuleEvaluationConditionMet: Evaluation of the rule ExplodingTensor at step 0 resulted in the condition being met\n",
- "\n",
- "\n",
- "RuleExecutionJobName: ExplodingTensor-8a2f1f7334ff9ef67c8de51cf7b2a7dd\n",
- "RuleExecutionJobArn: arn:aws:sagemaker:ca-central-1:072677473360:training-job/explodingtensor-8a2f1f7334ff9ef67c8de51cf7b2a7dd\n",
- "=============================================\n"
- ]
- }
- ],
- "source": [
- "statuses = sagemaker_simple_estimator.describe_rule_execution_jobs()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Check logs of the rule execution jobs\n",
- "\n",
- "If you want to access the logs of a particular rule job name, you can do the following:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [],
- "source": [
- "rule_job_name = statuses[0].get('RuleExecutionJobName')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we can attach to this job to see its logs"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "2019-08-30 04:17:01 Starting - Preparing the instances for training\n",
- "2019-08-30 04:17:01 Downloading - Downloading input data\n",
- "2019-08-30 04:17:01 Training - Training image download completed. Training in progress.\n",
- "2019-08-30 04:17:01 Uploading - Uploading generated training model\n",
- "2019-08-30 04:17:01 Failed - Training job failed\u001b[31m[2019-08-30 04:16:45.369 ip-10-0-100-119.ca-central-1.compute.internal:1 INFO s3_trial.py:27] Loading trial base-trial at path s3://sagemaker-ca-central-1-072677473360/tensors-tornasole-simple-demo-2019-08-30-04-11-54-514\u001b[0m\n",
- "\u001b[31m[2019-08-30 04:16:56.457 ip-10-0-100-119.ca-central-1.compute.internal:1 INFO exploding_tensor.py:24] ExplodingTensor rule created. Monitoring tensors for nans or infinities.\u001b[0m\n",
- "\u001b[31m[2019-08-30 04:16:56.457 ip-10-0-100-119.ca-central-1.compute.internal:1 INFO rule_invoker.py:76] Started execution of rule ExplodingTensor at step 0\u001b[0m\n",
- "\u001b[31m[2019-08-30 04:16:57.653 ip-10-0-100-119.ca-central-1.compute.internal:1 INFO exploding_tensor.py:45] Step 0 had 1 tensors with non finite values\u001b[0m\n",
- "\u001b[31mException during rule execution: Customer Error: RuleEvaluationConditionMet: Evaluation of the rule ExplodingTensor at step 0 resulted in the condition being met\u001b[0m\n",
- "\u001b[31mTraceback (most recent call last):\n",
- " File \"train.py\", line 214, in execute\n",
- " exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local)\n",
- " File \"\", line 2, in \n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 84, in invoke_rule\n",
- " raise e\n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 79, in invoke_rule\n",
- " rule_obj.invoke(step)\n",
- " File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule.py\", line 56, in invoke\n",
- " raise RuleEvaluationConditionMet(self.rule_name, step)\u001b[0m\n",
- "\u001b[31msmdebug.exceptions.RuleEvaluationConditionMet: Evaluation of the rule ExplodingTensor at step 0 resulted in the condition being met\n",
- "\n",
- "\u001b[0m\n"
- ]
- },
- {
- "ename": "UnexpectedStatusException",
- "evalue": "Error for Training job ExplodingTensor-8a2f1f7334ff9ef67c8de51cf7b2a7dd: Failed. Reason: ClientError: RuleEvaluationConditionMet: Evaluation of the rule ExplodingTensor at step 0 resulted in the condition being met\nTraceback (most recent call last):\n File \"train.py\", line 214, in execute\n exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local)\n File \"\", line 2, in \n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 84, in invoke_rule\n raise e\n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 79, in invoke_rule\n rule_obj.invoke(step)\n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule.py\", line 56, in invoke\n raise RuleEvaluationConditionMet(self.rule_name, step)\nsmdebug.exceptions.RuleEvaluationConditionMet: Evaluation of the rule ExplodingTensor at step 0 resulted in the condition being met\n\n",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
- "\u001b[0;31mUnexpectedStatusException\u001b[0m Traceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msagemaker\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mestimator\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mEstimator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mexploding_tensor\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mEstimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mattach\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrule_job_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
- "\u001b[0;32m~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py\u001b[0m in \u001b[0;36mattach\u001b[0;34m(cls, training_job_name, sagemaker_session, model_channel_name)\u001b[0m\n\u001b[1;32m 460\u001b[0m )\n\u001b[1;32m 461\u001b[0m \u001b[0mestimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_current_job_name\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mestimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlatest_training_job\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 462\u001b[0;31m \u001b[0mestimator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlatest_training_job\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwait\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 463\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mestimator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 464\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py\u001b[0m in \u001b[0;36mwait\u001b[0;34m(self, logs)\u001b[0m\n\u001b[1;32m 1012\u001b[0m \"\"\"\n\u001b[1;32m 1013\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlogs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1014\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msagemaker_session\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlogs_for_job\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjob_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwait\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1015\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1016\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msagemaker_session\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwait_for_job\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjob_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py\u001b[0m in \u001b[0;36mlogs_for_job\u001b[0;34m(self, job_name, wait, poll)\u001b[0m\n\u001b[1;32m 1479\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1480\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mwait\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1481\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_check_job_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mjob_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdescription\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"TrainingJobStatus\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1482\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mdot\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1483\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py\u001b[0m in \u001b[0;36m_check_job_status\u001b[0;34m(self, job, desc, status_key_name)\u001b[0m\n\u001b[1;32m 1092\u001b[0m ),\n\u001b[1;32m 1093\u001b[0m \u001b[0mallowed_statuses\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"Completed\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"Stopped\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1094\u001b[0;31m \u001b[0mactual_status\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1095\u001b[0m )\n\u001b[1;32m 1096\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;31mUnexpectedStatusException\u001b[0m: Error for Training job ExplodingTensor-8a2f1f7334ff9ef67c8de51cf7b2a7dd: Failed. Reason: ClientError: RuleEvaluationConditionMet: Evaluation of the rule ExplodingTensor at step 0 resulted in the condition being met\nTraceback (most recent call last):\n File \"train.py\", line 214, in execute\n exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local)\n File \"\", line 2, in \n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 84, in invoke_rule\n raise e\n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule_invoker.py\", line 79, in invoke_rule\n rule_obj.invoke(step)\n File \"/usr/local/lib/python3.7/site-packages/smdebug/rules/rule.py\", line 56, in invoke\n raise RuleEvaluationConditionMet(self.rule_name, step)\nsmdebug.exceptions.RuleEvaluationConditionMet: Evaluation of the rule ExplodingTensor at step 0 resulted in the condition being met\n\n"
- ]
- }
- ],
- "source": [
- "from sagemaker.estimator import Estimator\n",
- "exploding_tensor = Estimator.attach(rule_job_name)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Receive a CloudWatch Event for Rules\n",
- "When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus [CloudWatch events](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) are emitted. More details on this, see [below](#CloudWatch-Event-Integration-for-Rules). "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Making this a good run\n",
- "\n",
- "In above example, we saw how a ExplodingTensor Rule was run which analyzed the tensors when training was running and produced an alert in form of cloudwatch event.\n",
- "\n",
- "You can go back and change the hyperparameters passed to the estimator to `simple_hyperparameters` and start a new training job. You will see that the ExplodingTensor rule is not fired in that case as no tensors go to `nan` with the default good hyperparameters.\n",
- "\n",
- "We have 2 more real life examples at the end section of this notebook for you to try, a GPU example which trains ResNet50, and another CPU example which trains MNIST. Before moving further, let's take some detailed look at Tornasole, some of which were touched upon above."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Enabling Tornasole in the training script\n",
- "\n",
- "The first step to using Tornasole is to save tensors from the training job. The containers we provide in SageMaker come with Tornasole library installed, which needs to be used to enable Tornasole in your training script. We currently support two interfaces for training in TensorFlow: `tf.Session` and `tf.Estimator`. \n",
- "\n",
- "Please note: **Keras** support is Work in Progress. Please stay tuned! We will also support **Eager** mode in the future. Tornasole also currently only works for single process training. We will support distributed training very soon. \n",
- "\n",
- "### TF Session based training\n",
- "When training using this interface you need to create a [MonitoredSession](https://www.tensorflow.org/api_docs/python/tf/train/MonitoredSession) to use for the job which is configured with SessionHook, a construct Tornasole exposes to save tensors from the job. Here's how you will need to modify your training script.\n",
- "\n",
- "First, you need to import `smdebug.tensorflow`. \n",
- "```\n",
- "import smdebug.tensorflow as smd \n",
- "```\n",
- "Then create the SessionHook by specifying what you want to save and when you want to save them.\n",
- "```\n",
- "hook = smd.SessionHook(include_collections=['weights','gradients'],\n",
- " save_config=smd.SaveConfig(save_interval=50))\n",
- "```\n",
- "Tornasole has the concept of modes (TRAIN, EVAL, PREDICT) to separate out different modes of the jobs.\n",
- "Set the mode you are running in your job. Every time the mode changes in your job, please set the current mode. This helps you group steps by mode, for easier analysis. Setting the mode is optional but recommended. If you do not specify this, Tornasole saves all steps under a `GLOBAL` mode. \n",
- "```\n",
- "hook.set_mode(smd.modes.TRAIN)\n",
- "```\n",
- "Wrap your optimizer with wrap_optimizer so that Tornasole can identify your gradients and automatically provide these tensors as part of the `gradients` collection. Use this new optimizer to minimize your loss during training.\n",
- "```\n",
- "optimizer = hook.wrap_optimizer(optimizer)\n",
- "```\n",
- "Create a monitored session with the above hook, and use this for executing your TensorFlow job.\n",
- "```\n",
- "sess = tf.train.MonitoredSession(hooks=[hook])\n",
- "```\n",
- "\n",
- "We have an example script which shows the above [scripts/simple.py](../scripts/simple.py). You will be running this script below.\n",
- "\n",
- "Refer [DeveloperGuide_TensorFlow.md](../../DeveloperGuide_TF.md) for more details on the APIs Tornasole provides to help you save tensors.\n",
- "\n",
- "### TF Estimator based training\n",
- "When training using this interface you need to pass SessionHook, a construct Tornasole exposes to save tensors, to the train, predict or evaluate functions of your [TensorFlow Estimator](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/estimator/Estimator?hl=en). Here's how you will need to modify your training script.\n",
- "\n",
- "First, you need to import `smdebug.tensorflow`. \n",
- "```\n",
- "import smdebug.tensorflow as smd \n",
- "```\n",
- "Then create the SessionHook by specifying what you want to save and when you want to save them.\n",
- "```\n",
- "hook = smd.SessionHook(include_collections=['weights','gradients'],\n",
- " save_config=smd.SaveConfig(save_interval=50))\n",
- "```\n",
- "Tornasole has the concept of modes (TRAIN, EVAL, PREDICT) to separate out different modes of the jobs.\n",
- "Set the mode you are running in your job. Every time the mode changes in your job, please set the current mode. This helps you group steps by mode, for easier analysis. Setting the mode is optional but recommended. If you do not specify this, Tornasole saves all steps under a `GLOBAL` mode. \n",
- "```\n",
- "hook.set_mode(smd.modes.TRAIN)\n",
- "```\n",
- "Wrap your optimizer with wrap_optimizer so that Tornasole can identify your gradients and automatically provide these tensors as part of the `gradients` collection. Use this new optimizer to minimize your loss during training.\n",
- "```\n",
- "opt = hook.wrap_optimizer(opt)\n",
- "```\n",
- "Now pass this hook to the estimator object's train, predict or evaluate methods, whichever ones you want to monitor.\n",
- "```\n",
- "classifier = tf.estimator.Estimator(...)\n",
- "\n",
- "classifier.train(input_fn, hooks=[hook])\n",
- "classifier.predict(input_fn, hooks=[hook])\n",
- "classifier.evaluate(input_fn, hooks=[hook])\n",
- "```\n",
- "\n",
- "Refer our example script for [MNIST](../scripts/mnist.py) or [ResNet50 for ImageNet](../scripts/train_imagenet_resnet_hvd.py) for examples of using Tornasole with the Estimator interface. We will show you to how to run these examples in SageMaker below.\n",
- "\n",
- "Refer [DeveloperGuide_TensorFlow.md](../../DeveloperGuide_TF.md) for more details on the APIs Tornasole provides to help you save tensors."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Enabling Tornasole with SageMaker\n",
- "#### Storage\n",
- "The tensors saved by Tornasole are, by default, stored in the S3 output path of the training job, under the folder **`/tensors-`**. This is done to ensure that we don't end up accidentally overwriting the tensors from a training job with the others. Rules evaluation require separation of the tensors paths to be evaluated correctly.\n",
- "\n",
- "If you don't provide an S3 output path to the estimator, SageMaker creates one for you as: **`s3://sagemaker--/`**\n",
- "\n",
- "This path is used to create a Tornasole Trial taken by Rules (see below).\n",
- "\n",
- "#### New Parameters \n",
- "The new parameters in Sagemaker Estimator to look out for are\n",
- "\n",
- "- `debug` :(bool)\n",
- "This indicates that debugging should be enabled for the training job. \n",
- "Setting this as `True` would make Tornasole available for use with the job\n",
- "\n",
- "- `rules_specification`: (list[*dict*])\n",
- "You can specify any number of rules to monitor your SageMaker training job. This parameter takes a list of python dictionaries, one for each rule you want to enable. Each `dict` is of the following form:\n",
- "```\n",
- "{\n",
- " \"RuleName\": \n",
- " # The name of the class implementing the Tornasole Rule interface. (required)\n",
- "\n",
- " \"SourceS3Uri\": \n",
- " # S3 URI of the rule script containing the class in 'RuleName'. \n",
- " # This is not required if you want to use one of the\n",
- " # First Party rules provided to you by Amazon. \n",
- " # In such a case you can leave it empty or not pass it. If you want to run a custom rule \n",
- " # defined by you, you will need to define the custom rule class in a python \n",
- " # file and provide it to SageMaker as a S3 URI. \n",
- " # SageMaker will fetch this file and try to look for the rule class \n",
- " # identified by RuleName in this file.\n",
- " \n",
- " \"InstanceType\": \n",
- " # The ML instance type which should be used to run the rule evaluation job\n",
- " \n",
- " \"VolumeSizeInGB\": \n",
- " # The volume size to store the runtime artifacts from the rule evaluation \n",
- " \n",
- " \"RuntimeConfigurations\": {\n",
- " # Map defining the parameters required to instantiate the Rule class and\n",
- " # parameters regarding invokation of the rule (start-step and end-step)\n",
- " # This can be any parameter taken by the rule. \n",
- " # Every value here needs to be a string. \n",
- " # So when you write custom rules, ensure that you can parse each argument from a string.\n",
- " #\n",
- " # PARAMS CAN BE\n",
- " #\n",
- " # STANDARD PARAMS FOR RULE EXECUTION\n",
- " # \"start-step\": \n",
- " # \"end-step\": \n",
- " # \"other-trials-paths\": (';' separated list of s3 paths as a string)\n",
- " # \"logging-level\": (can be one of \"CRITICAL\", \"FATAL\", \"ERROR\", \n",
- " # \"WARNING\", \"WARN\", \"DEBUG\", \"NOTSET\")\n",
- " #\n",
- " # ANY OTHER PARAMETER TAKEN BY THE RULE\n",
- " # \"parameter\" : \n",
- " # : \n",
- " }\n",
- "}\n",
- "```\n",
- "\n",
- "### Inputs\n",
- "Just a quick reminder if you are not familiar with script mode in SageMaker. You can pass command line arguments taken by your training script with a hyperparameter dictionary which gets passed to the SageMaker Estimator class. You can see this in the examples below."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Rules\n",
- "Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the job.\n",
- "They can be used to assert certain conditions during training, and raise Cloudwatch Events based on them that you can\n",
- "use to process in any way you like. \n",
- "\n",
- "Tornasole comes with a set of **First Party rules** (1P rules).\n",
- "You can also write your own rules looking at these 1P rules for inspiration. \n",
- "Refer [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more on the APIs you can use to write your own rules as well as descriptions for the 1P rules that we provide. \n",
- " \n",
- "Here we will talk about how to use Sagemaker to evalute these rules on the training jobs.\n",
- "\n",
- "\n",
- "##### 1P Rule \n",
- "If you want to use a 1P rule. Specify the RuleName field with the 1P RuleName, and the rule will be automatically applied. You can pass any parameters accepted by the rule as part of the RuntimeConfigurations dictionary. Rules constructor take trial as parameter. \n",
- "A Trial in Tornasole's context refers to a training job. It is identified by the path where the saved tensors for the job are stored. \n",
- "A rule takes a `base_trial` which refers to the job whose run invokes the rule execution. \n",
- "\n",
- "**Note:** A rule can be written to compare & analyze tensors across training jobs. A rule which needs to compare tensors across trials can be run by passing the argument `other_trials`. The argument `base_trial` will automatically be set by SageMaker when executing the rule. The parameter `other_trials` (if taken by the rule) can be passed by passing `other-trials-paths` in the RuntimeConfigurations dictionary. The value for this argument should be `;` separated list of S3 output paths where the tensors for those trials are stored.\n",
- "\n",
- "Here's a example of a complex configuration for the SimilarAcrossRuns (which accepts one other trial and a regex pattern) where we ask for the rule to be invoked for the steps between 10 and 100.\n",
- "\n",
- "``` \n",
- "rules_specification = [ \n",
- " {\n",
- " \"RuleName\": \"SimilarAcrossRuns\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"other_trials\": \"s3://sagemaker--/past-job\",\n",
- " \"include_regex\": \".*\",\n",
- " \"start-step\": \"10\",\n",
- " \"end-step\": \"100\"\n",
- " }\n",
- " }\n",
- "]\n",
- "```\n",
- "List of 1P rules and details about the rules can be found in *First party rules* section in [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) \n",
- "\n",
- "\n",
- "##### Custom rule\n",
- "In this case you need to define a custom rule class which inherits from `smdebug.rules.Rule` class.\n",
- "You need to provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`. Again, you can pass any arguments taken by this rule through the RuntimeConfigurations dictionary. Note that the custom rules can only have arguments which expect a string as the value except the two arguments specifying trials to the Rule. Refer section *Writing a rule* in [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more details.\n",
- "\n",
- "Here's an example:\n",
- "```\n",
- "rules_specification = [\n",
- " {\n",
- " \"RuleName\": \"CustomRule\",\n",
- " \"SourceS3Uri\": \"s3://weiyou-tornasole-test/rule-script/custom_rule.py\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"threshold\" : \"0.5\"\n",
- " }\n",
- " }\n",
- "]\n",
- "```"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### CloudWatch Event Integration for Rules\n",
- "When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus [CloudWatch events](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) are emitted.\n",
- "\n",
- "After GA, you can configure a CloudWatch event rule to receive and process these events by setting up a target (Lambda function, SNS) as follows:\n",
- "\n",
- "- The [SageMaker TrainingJobStatus CW event] (https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#sagemaker_event_types) will include rule job statuses associated with the training job\n",
- "- A CW event will be emitted when a RuleStatus changes\n",
- "- Customer can create a CloudWatch event rule that monitors the Training Job customer started\n",
- "- Customer can set a Target (Lambda funtion, SQS) for the CloudWatch event rule that processes the event, and triggers an alarm for the customer based on the RuleStatus. \n",
- "\n",
- "Refer [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) for more details. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Train Resnet50 Estimator Interface, with Tornasole\n",
- "Now let us run a more complicated example, let us train ResNet50 on a GPU instance. The script which uses the TensorFlow Estimator interface is available [here](../scripts/train_imagenet_resnet_hvd.py). It supports various modes of using smdebug. Please refer to [this document](../../sm_resnet50.md) which summarizes the changes made to this script to save weights, gradients, activations of certain layers etc. You can also save large layers as reductions instead of saving the full tensor. Full details of Tornasole's APIs to save tensors are available in this document [DeveloperGuide_TensorFlow](../../DeveloperGuide_TF.md).\n",
- "\n",
- "The below hyperparameters initialize the weights of the model badly (to a small constant). This results in training proceeding badly with many gradients vanishing. We can monitor the situation using the VanishingGradient rule."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "resnet_script = '../scripts/train_imagenet_resnet_hvd.py'\n",
- "bad_resnet_hyperparameters = {\n",
- " 'enable_smdebug': True,\n",
- " 'save_weights': True,\n",
- " 'save_gradients': True,\n",
- " 'step_interval' : 100,\n",
- " 'num_epochs': 1,\n",
- " 'constant_initializer': 0.01\n",
- "}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_resnet_estimator = TensorFlow(role=sagemaker.get_execution_role(),\n",
- " base_job_name='tornasole-demo-resnet',\n",
- " train_instance_count=1,\n",
- " train_instance_type='ml.p3.2xlarge',\n",
- " image_name=gpu_docker_image_name,\n",
- " entry_point=resnet_script,\n",
- " framework_version='1.13.1',\n",
- " py_version='py3',\n",
- " script_mode=True,\n",
- " hyperparameters=bad_resnet_hyperparameters,\n",
- " debug=True,\n",
- " train_max_run=1800,\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"VanishingGradient\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " }\n",
- " ])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_resnet_estimator.fit(wait=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Note:wait=False above, made fit call fire and forget call. \n",
- "# Sagemaker will run the training job and rule job in the background. \n",
- "# To see status of training job:\n",
- "sagemaker_simple_estimator.sagemaker_session.sagemaker_client.describe_training_job(\n",
- " TrainingJobName=sagemaker_simple_estimator.latest_training_job.name\n",
- ")\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# To check status of rule execution job\n",
- "sagemaker_resnet_estimator.describe_rule_execution_jobs()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Train MNIST Estimator interface, with Tornasole\n",
- "If you do not want to use GPUs at this point, but want to run a slightly more complicated script than the simple example you saw above, you can train a model on CPU on the MNIST dataset as below. Let us monitor for VanishingGradient in this job. **We do not expect this rule to be fired for the below hyperparameters.**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "mnist_script = '../scripts/mnist.py'\n",
- "mnist_hyperparameters = {'num_epochs': 5}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_mnist_estimator = TensorFlow(role=sagemaker.get_execution_role(),\n",
- " base_job_name='tornasole-demo-mnist',\n",
- " train_instance_count=1,\n",
- " train_instance_type='ml.m4.xlarge',\n",
- " image_name=cpu_docker_image_name,\n",
- " entry_point=mnist_script,\n",
- " framework_version='1.13.1',\n",
- " py_version='py3',\n",
- " script_mode=True,\n",
- " hyperparameters=mnist_hyperparameters,\n",
- " debug=True,\n",
- " train_max_run=1800,\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"VanishingGradient\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " }\n",
- " ])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_mnist_estimator.fit(wait=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "sagemaker_mnist_estimator.sagemaker_session.sagemaker_client.describe_training_job(\n",
- " TrainingJobName=sagemaker_mnist_estimator.latest_training_job.name\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# To check status of rule execution job\n",
- "sagemaker_mnist_estimator.describe_rule_execution_jobs()"
- ]
- }
- ],
- "metadata": {
- "celltoolbar": "Raw Cell Format",
- "kernelspec": {
- "display_name": "conda_tensorflow_p36",
- "language": "python",
- "name": "conda_tensorflow_p36"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.5"
- },
- "pycharm": {
- "stem_cell": {
- "cell_type": "raw",
- "metadata": {
- "collapsed": false
- },
- "source": []
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/examples/tensorflow/sagemaker_byoc/mnist.py b/examples/tensorflow/sagemaker_byoc/mnist.py
new file mode 100644
index 000000000..bd354d24c
--- /dev/null
+++ b/examples/tensorflow/sagemaker_byoc/mnist.py
@@ -0,0 +1,139 @@
+"""
+This script is a simple MNIST training script which uses Tensorflow's Estimator interface.
+It has been orchestrated with SageMaker Debugger hooks to allow saving tensors during training.
+These hooks have been instrumented to read from json configuration that SageMaker will put in the training container.
+Configuration provided to the SageMaker python SDK when creating a job will be passed on to the hook.
+This allows you to use the same script with differing configurations across different runs.
+If you use an official SageMaker Framework container (i.e. AWS Deep Learning Container), then
+you do not have to orchestrate your script as below. Hooks will automatically be added in those environments.
+For more information, please refer to https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md
+"""
+
+# Standard Library
+import argparse
+import logging
+import random
+
+# Third Party
+import numpy as np
+import tensorflow as tf
+
+# First Party
+import smdebug.tensorflow as smd
+
+logging.getLogger().setLevel(logging.INFO)
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--lr", type=float, default=0.001)
+parser.add_argument("--random_seed", type=bool, default=False)
+parser.add_argument("--num_epochs", type=int, default=5, help="Number of epochs to train for")
+parser.add_argument(
+ "--num_steps",
+ type=int,
+ help="Number of steps to train for. If this" "is passed, it overrides num_epochs",
+)
+parser.add_argument(
+ "--num_eval_steps",
+ type=int,
+ help="Number of steps to evaluate for. If this"
+ "is passed, it doesnt evaluate over the full eval set",
+)
+parser.add_argument("--model_dir", type=str, default="/tmp/mnist_model")
+args = parser.parse_args()
+
+if args.random_seed:
+ tf.set_random_seed(2)
+ np.random.seed(2)
+ random.seed(12)
+
+# This allows you to create the hook from the configuration you pass to the SageMaker pySDK
+hook = smd.SessionHook.create_from_json_file()
+
+
+def cnn_model_fn(features, labels, mode):
+ """Model function for CNN."""
+ # Input Layer
+ input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])
+
+ # Convolutional Layer #1
+ conv1 = tf.layers.conv2d(
+ inputs=input_layer, filters=32, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
+ )
+
+ # Pooling Layer #1
+ pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
+
+ # Convolutional Layer #2 and Pooling Layer #2
+ conv2 = tf.layers.conv2d(
+ inputs=pool1, filters=64, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
+ )
+ pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
+
+ # Dense Layer
+ pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
+ dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
+ dropout = tf.layers.dropout(
+ inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN
+ )
+
+ # Logits Layer
+ logits = tf.layers.dense(inputs=dropout, units=10)
+
+ predictions = {
+ # Generate predictions (for PREDICT and EVAL mode)
+ "classes": tf.argmax(input=logits, axis=1),
+ # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
+ # `logging_hook`.
+ "probabilities": tf.nn.softmax(logits, name="softmax_tensor"),
+ }
+
+ if mode == tf.estimator.ModeKeys.PREDICT:
+ return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
+
+ # Calculate Loss (for both TRAIN and EVAL modes)
+ loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
+
+ # Configure the Training Op (for TRAIN mode)
+ if mode == tf.estimator.ModeKeys.TRAIN:
+ optimizer = tf.train.GradientDescentOptimizer(learning_rate=args.lr)
+
+ # SMD: Wrap your optimizer as follows to help SageMaker Debugger identify gradients
+ # This does not change your optimization logic, it returns back the same optimizer
+ optimizer = hook.wrap_optimizer(optimizer)
+
+ train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
+ return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
+
+ # Add evaluation metrics (for EVAL mode)
+ eval_metric_ops = {
+ "accuracy": tf.metrics.accuracy(labels=labels, predictions=predictions["classes"])
+ }
+ return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)
+
+
+# Load training and eval data
+((train_data, train_labels), (eval_data, eval_labels)) = tf.keras.datasets.mnist.load_data()
+
+train_data = train_data / np.float32(255)
+train_labels = train_labels.astype(np.int32) # not required
+
+eval_data = eval_data / np.float32(255)
+eval_labels = eval_labels.astype(np.int32) # not required
+
+mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn, model_dir=args.model_dir)
+
+train_input_fn = tf.estimator.inputs.numpy_input_fn(
+ x={"x": train_data}, y=train_labels, batch_size=128, num_epochs=args.num_epochs, shuffle=True
+)
+
+eval_input_fn = tf.estimator.inputs.numpy_input_fn(
+ x={"x": eval_data}, y=eval_labels, num_epochs=1, shuffle=False
+)
+
+# Set training mode so SMDebug can classify the steps into training mode
+hook.set_mode(smd.modes.TRAIN)
+mnist_classifier.train(input_fn=train_input_fn, steps=args.num_steps, hooks=[hook])
+
+# Set eval mode so SMDebug can classify the steps into eval mode
+hook.set_mode(smd.modes.EVAL)
+mnist_classifier.evaluate(input_fn=eval_input_fn, steps=args.num_eval_steps, hooks=[hook])
diff --git a/examples/tensorflow/scripts/simple.py b/examples/tensorflow/sagemaker_byoc/simple.py
similarity index 57%
rename from examples/tensorflow/scripts/simple.py
rename to examples/tensorflow/sagemaker_byoc/simple.py
index b8268d0af..08433dd68 100644
--- a/examples/tensorflow/scripts/simple.py
+++ b/examples/tensorflow/sagemaker_byoc/simple.py
@@ -1,3 +1,14 @@
+"""
+This script is a simple training script which uses Tensorflow's MonitoredSession interface.
+It has been orchestrated with SageMaker Debugger hooks to allow saving tensors during training.
+These hooks have been instrumented to read from json configuration that SageMaker will put in the training container.
+Configuration provided to the SageMaker python SDK when creating a job will be passed on to the hook.
+This allows you to use the same script with differing configurations across different runs.
+If you use an official SageMaker Framework container (i.e. AWS Deep Learning Container), then
+you do not have to orchestrate your script as below. Hooks will automatically be added in those environments.
+For more information, please refer to https://github.com/awslabs/sagemaker-debugger/blob/master/docs/
+"""
+
# Standard Library
import argparse
import random
@@ -13,7 +24,6 @@
def str2bool(v):
if isinstance(v, bool):
return v
-
if v.lower() in ("yes", "true", "t", "y", "1"):
return True
elif v.lower() in ("no", "false", "f", "n", "0"):
@@ -23,25 +33,11 @@ def str2bool(v):
parser = argparse.ArgumentParser()
-parser.add_argument("--script-mode", type=str2bool, default=False)
parser.add_argument("--model_dir", type=str, help="S3 path for the model")
parser.add_argument("--lr", type=float, help="Learning Rate", default=0.001)
parser.add_argument("--steps", type=int, help="Number of steps to run", default=100)
parser.add_argument("--scale", type=float, help="Scaling factor for inputs", default=1.0)
-parser.add_argument("--save_all", type=str2bool, default=True)
-parser.add_argument("--smdebug_path", type=str, default="/opt/ml/output/tensors")
-parser.add_argument("--save_frequency", type=int, help="How often to save TS data", default=10)
parser.add_argument("--random_seed", type=bool, default=False)
-feature_parser = parser.add_mutually_exclusive_group(required=False)
-feature_parser.add_argument(
- "--reductions",
- dest="reductions",
- action="store_true",
- help="save reductions of tensors instead of saving full tensors",
-)
-feature_parser.add_argument(
- "--no_reductions", dest="reductions", action="store_false", help="save full tensors"
-)
args = parser.parse_args()
# these random seeds are only intended for test purpose.
@@ -52,27 +48,7 @@ def str2bool(v):
np.random.seed(2)
random.seed(12)
-
-if args.script_mode:
- # save tensors as reductions if necessary
- rdnc = (
- smd.ReductionConfig(reductions=["mean"], abs_reductions=["max"], norms=["l1"])
- if args.reductions
- else None
- )
-
- # create the hook
- # Note that we are saving all tensors here by passing save_all=True
- hook = smd.SessionHook(
- out_dir=args.smdebug_path,
- save_all=args.save_all,
- include_collections=["weights", "gradients", "losses"],
- save_config=smd.SaveConfig(save_interval=args.save_frequency),
- reduction_config=rdnc,
- )
- hooks = [hook]
-else:
- hooks = []
+hook = smd.SessionHook.create_from_json_file()
# Network definition
# Note the use of name scopes
@@ -84,34 +60,30 @@ def str2bool(v):
y = tf.matmul(x, w0)
loss = tf.reduce_mean((tf.matmul(x, w) - y) ** 2, name="loss")
-smd.get_hook("session", create_if_not_exists=True).add_to_collection("losses", loss)
+hook.add_to_collection("losses", loss)
global_step = tf.Variable(17, name="global_step", trainable=False)
increment_global_step_op = tf.assign(global_step, global_step + 1)
optimizer = tf.train.AdamOptimizer(args.lr)
-if args.script_mode:
- # Wrap the optimizer with wrap_optimizer so Tornasole can find gradients and optimizer_variables to save
- optimizer = hook.wrap_optimizer(optimizer)
+# Wrap the optimizer with wrap_optimizer so smdebug can find gradients to save
+optimizer = hook.wrap_optimizer(optimizer)
# use this wrapped optimizer to minimize loss
optimizer_op = optimizer.minimize(loss, global_step=increment_global_step_op)
-if args.script_mode:
- hook.set_mode(smd.modes.TRAIN)
-
# pass the hook to hooks parameter of monitored session
-sess = tf.train.MonitoredSession(hooks=hooks)
+sess = tf.train.MonitoredSession(hooks=[hook])
# use this session for running the tensorflow model
+hook.set_mode(smd.modes.TRAIN)
for i in range(args.steps):
x_ = np.random.random((10, 2)) * args.scale
_loss, opt, gstep = sess.run([loss, optimizer_op, increment_global_step_op], {x: x_})
print(f"Step={i}, Loss={_loss}")
-if args.script_mode:
- hook.set_mode(smd.modes.EVAL)
+hook.set_mode(smd.modes.EVAL)
for i in range(args.steps):
x_ = np.random.random((10, 2)) * args.scale
sess.run([loss, increment_global_step_op], {x: x_})
diff --git a/examples/tensorflow/sagemaker_byoc/tf_keras_resnet.py b/examples/tensorflow/sagemaker_byoc/tf_keras_resnet.py
new file mode 100644
index 000000000..9ce9d2e74
--- /dev/null
+++ b/examples/tensorflow/sagemaker_byoc/tf_keras_resnet.py
@@ -0,0 +1,73 @@
+"""
+This script is a ResNet training script which uses Tensorflow's Keras interface.
+It has been orchestrated with SageMaker Debugger hooks to allow saving tensors during training.
+These hooks have been instrumented to read from json configuration that SageMaker will put in the training container.
+Configuration provided to the SageMaker python SDK when creating a job will be passed on to the hook.
+This allows you to use the same script with differing configurations across different runs.
+If you use an official SageMaker Framework container (i.e. AWS Deep Learning Container), then
+you do not have to orchestrate your script as below. Hooks will automatically be added in those environments.
+For more information, please refer to https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md
+"""
+
+# Standard Library
+import argparse
+
+# Third Party
+import numpy as np
+import tensorflow as tf
+from tensorflow.keras.applications.resnet50 import ResNet50
+from tensorflow.keras.datasets import cifar10
+from tensorflow.keras.utils import to_categorical
+
+# First Party
+import smdebug.tensorflow as smd
+
+
+def train(batch_size, epoch, model, hook):
+ (X_train, y_train), (X_valid, y_valid) = cifar10.load_data()
+
+ Y_train = to_categorical(y_train, 10)
+ Y_valid = to_categorical(y_valid, 10)
+
+ X_train = X_train.astype("float32")
+ X_valid = X_valid.astype("float32")
+
+ mean_image = np.mean(X_train, axis=0)
+ X_train -= mean_image
+ X_valid -= mean_image
+ X_train /= 128.0
+ X_valid /= 128.0
+
+ model.fit(
+ X_train,
+ Y_train,
+ batch_size=batch_size,
+ epochs=epoch,
+ validation_data=(X_valid, Y_valid),
+ shuffle=True,
+ callbacks=[hook],
+ )
+
+
+def main():
+ parser = argparse.ArgumentParser(description="Train resnet50 cifar10")
+ parser.add_argument("--batch_size", type=int, default=32)
+ parser.add_argument("--epoch", type=int, default=3)
+ parser.add_argument("--model_dir", type=str, default="./model_keras_resnet")
+ opt = parser.parse_args()
+
+ model = ResNet50(weights=None, input_shape=(32, 32, 3), classes=10)
+
+ # Create hook from the configuration provided through sagemaker python sdk
+ hook = smd.KerasHook.create_from_json_file()
+ optimizer = tf.keras.optimizers.Adam()
+ # wrap the optimizer so the hook can identify the gradients
+ optimizer = hook.wrap_optimizer(optimizer)
+ model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
+
+ # start the training.
+ train(opt.batch_size, opt.epoch, model, hook)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/examples/tensorflow/scripts/mnist.py b/examples/tensorflow/sagemaker_official_container/mnist.py
similarity index 79%
rename from examples/tensorflow/scripts/mnist.py
rename to examples/tensorflow/sagemaker_official_container/mnist.py
index 078b06a3b..2143c5bc9 100644
--- a/examples/tensorflow/scripts/mnist.py
+++ b/examples/tensorflow/sagemaker_official_container/mnist.py
@@ -1,19 +1,24 @@
+"""
+This script is a simple MNIST training script which uses Tensorflow's Estimator interface.
+It is designed to be used with SageMaker Debugger in an official SageMaker Framework container (i.e. AWS Deep Learning Container).
+You will notice that this script looks exactly like a normal TensorFlow training script.
+The hook needed by SageMaker Debugger to save tensors during training will be automatically added in those environments.
+The hook will load configuration from json configuration that SageMaker will put in the training container from the configuration provided using the SageMaker python SDK when creating a job.
+For more information, please refer to https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md
+"""
+
# Standard Library
import argparse
+import logging
import random
# Third Party
import numpy as np
import tensorflow as tf
-# First Party
-import smdebug.tensorflow as smd
+logging.getLogger().setLevel(logging.INFO)
parser = argparse.ArgumentParser()
-parser.add_argument("--script-mode", type=bool, default=False)
-parser.add_argument("--smdebug_path", type=str)
-parser.add_argument("--train_frequency", type=int, help="How often to save TS data", default=50)
-parser.add_argument("--eval_frequency", type=int, help="How often to save TS data", default=10)
parser.add_argument("--lr", type=float, default=0.001)
parser.add_argument("--random_seed", type=bool, default=False)
parser.add_argument("--num_epochs", type=int, default=5, help="Number of epochs to train for")
@@ -82,13 +87,10 @@ def cnn_model_fn(features, labels, mode):
# Calculate Loss (for both TRAIN and EVAL modes)
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
- tf.summary.scalar("loss", loss)
# Configure the Training Op (for TRAIN mode)
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=args.lr)
- if args.script_mode:
- optimizer = smd.get_hook().wrap_optimizer(optimizer)
train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
@@ -118,25 +120,6 @@ def cnn_model_fn(features, labels, mode):
x={"x": eval_data}, y=eval_labels, num_epochs=1, shuffle=False
)
-if args.script_mode:
- hook = smd.SessionHook(
- out_dir=args.smdebug_path,
- save_config=smd.SaveConfig(
- {
- smd.modes.TRAIN: smd.SaveConfigMode(args.train_frequency),
- smd.modes.EVAL: smd.SaveConfigMode(args.eval_frequency),
- }
- ),
- )
- hooks = [hook]
-else:
- hooks = []
-
-if args.script_mode:
- hook.set_mode(smd.modes.TRAIN)
-# train one step and display the probabilties
-mnist_classifier.train(input_fn=train_input_fn, steps=args.num_steps, hooks=hooks)
-
-if args.script_mode:
- hook.set_mode(smd.modes.EVAL)
-mnist_classifier.evaluate(input_fn=eval_input_fn, steps=args.num_eval_steps, hooks=hooks)
+mnist_classifier.train(input_fn=train_input_fn, steps=args.num_steps)
+
+mnist_classifier.evaluate(input_fn=eval_input_fn, steps=args.num_eval_steps)
diff --git a/examples/tensorflow/sagemaker_official_container/simple.py b/examples/tensorflow/sagemaker_official_container/simple.py
new file mode 100644
index 000000000..a5fb1a3f5
--- /dev/null
+++ b/examples/tensorflow/sagemaker_official_container/simple.py
@@ -0,0 +1,92 @@
+"""
+This script is a simple training script which uses Tensorflow's MonitoredSession interface.
+It is designed to be used with SageMaker Debugger in an official SageMaker Framework container (i.e. AWS Deep Learning Container).
+Here we create the hook object which loads configuration from the json file that SageMaker will
+put in the training container based on the configuration provided using the SageMaker python SDK when creating a job.
+We use this hook object here to add our custom loss to the losses collection and set the mode.
+For more information, please refer to https://github.com/awslabs/sagemaker-debugger/blob/master/docs/
+"""
+
+# Standard Library
+import argparse
+import random
+
+# Third Party
+import numpy as np
+import tensorflow as tf
+
+# First Party
+import smdebug.tensorflow as smd
+
+
+def str2bool(v):
+ if isinstance(v, bool):
+ return v
+ if v.lower() in ("yes", "true", "t", "y", "1"):
+ return True
+ elif v.lower() in ("no", "false", "f", "n", "0"):
+ return False
+ else:
+ raise argparse.ArgumentTypeError("Boolean value expected.")
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, help="S3 path for the model")
+parser.add_argument("--lr", type=float, help="Learning Rate", default=0.001)
+parser.add_argument("--steps", type=int, help="Number of steps to run", default=100)
+parser.add_argument("--scale", type=float, help="Scaling factor for inputs", default=1.0)
+parser.add_argument("--random_seed", type=bool, default=False)
+args = parser.parse_args()
+
+# these random seeds are only intended for test purpose.
+# for now, 2,2,12 could promise no assert failure when running tests
+# if you wish to change the number, notice that certain steps' tensor value may be capable of variation
+if args.random_seed:
+ tf.set_random_seed(2)
+ np.random.seed(2)
+ random.seed(12)
+
+# Network definition
+# Note the use of name scopes
+with tf.name_scope("foobar"):
+ x = tf.placeholder(shape=(None, 2), dtype=tf.float32)
+ w = tf.Variable(initial_value=[[10.0], [10.0]], name="weight1")
+with tf.name_scope("foobaz"):
+ w0 = [[1], [1.0]]
+ y = tf.matmul(x, w0)
+loss = tf.reduce_mean((tf.matmul(x, w) - y) ** 2, name="loss")
+
+hook = smd.SessionHook.create_from_json_file()
+hook.add_to_collection("losses", loss)
+
+global_step = tf.Variable(17, name="global_step", trainable=False)
+increment_global_step_op = tf.assign(global_step, global_step + 1)
+
+optimizer = tf.train.AdamOptimizer(args.lr)
+
+# Do not need to wrap the optimizer if in a zero script change environment
+# i.e. SageMaker/AWS Deep Learning Containers
+# as the framework will automatically do that there if the hook exists
+optimizer = hook.wrap_optimizer(optimizer)
+
+# use this wrapped optimizer to minimize loss
+optimizer_op = optimizer.minimize(loss, global_step=increment_global_step_op)
+
+# Do not need to pass the hook to the session if in a zero script change environment
+# i.e. SageMaker/AWS Deep Learning Containers
+# as the framework will automatically do that there if the hook exists
+sess = tf.train.MonitoredSession()
+
+# use this session for running the tensorflow model
+hook.set_mode(smd.modes.TRAIN)
+for i in range(args.steps):
+ x_ = np.random.random((10, 2)) * args.scale
+ _loss, opt, gstep = sess.run([loss, optimizer_op, increment_global_step_op], {x: x_})
+ print(f"Step={i}, Loss={_loss}")
+
+# set the mode for monitored session based runs
+# so smdebug can separate out steps by mode
+hook.set_mode(smd.modes.EVAL)
+for i in range(args.steps):
+ x_ = np.random.random((10, 2)) * args.scale
+ sess.run([loss, increment_global_step_op], {x: x_})
diff --git a/examples/tensorflow/sagemaker_official_container/tf_keras_resnet.py b/examples/tensorflow/sagemaker_official_container/tf_keras_resnet.py
new file mode 100644
index 000000000..f3e1930a2
--- /dev/null
+++ b/examples/tensorflow/sagemaker_official_container/tf_keras_resnet.py
@@ -0,0 +1,60 @@
+"""
+This script is a ResNet training script which uses Tensorflow's Keras interface.
+It is designed to be used with SageMaker Debugger in an official SageMaker Framework container (i.e. AWS Deep Learning Container).
+You will notice that this script looks exactly like a normal TensorFlow training script.
+The hook needed by SageMaker Debugger to save tensors during training will be automatically added in those environments.
+The hook will load configuration from json configuration that SageMaker will put in the training container from the configuration provided using the SageMaker python SDK when creating a job.
+For more information, please refer to https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md
+"""
+
+# Standard Library
+import argparse
+
+# Third Party
+import numpy as np
+from tensorflow.keras.applications.resnet50 import ResNet50
+from tensorflow.keras.datasets import cifar10
+from tensorflow.keras.utils import to_categorical
+
+
+def train(batch_size, epoch, model):
+ (X_train, y_train), (X_valid, y_valid) = cifar10.load_data()
+
+ Y_train = to_categorical(y_train, 10)
+ Y_valid = to_categorical(y_valid, 10)
+
+ X_train = X_train.astype("float32")
+ X_valid = X_valid.astype("float32")
+
+ mean_image = np.mean(X_train, axis=0)
+ X_train -= mean_image
+ X_valid -= mean_image
+ X_train /= 128.0
+ X_valid /= 128.0
+
+ model.fit(
+ X_train,
+ Y_train,
+ batch_size=batch_size,
+ epochs=epoch,
+ validation_data=(X_valid, Y_valid),
+ shuffle=True,
+ )
+
+
+def main():
+ parser = argparse.ArgumentParser(description="Train resnet50 cifar10")
+ parser.add_argument("--batch_size", type=int, default=128)
+ parser.add_argument("--epoch", type=int, default=3)
+ parser.add_argument("--model_dir", type=str, default="./model_keras_resnet")
+ opt = parser.parse_args()
+
+ model = ResNet50(weights=None, input_shape=(32, 32, 3), classes=10)
+ model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
+
+ # start the training.
+ train(opt.batch_size, opt.epoch, model)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/examples/tensorflow/scripts/distributed_training/horovod_mnist_estimator.py b/examples/tensorflow/scripts/distributed_training/horovod_mnist_estimator.py
deleted file mode 100644
index a9ac4488f..000000000
--- a/examples/tensorflow/scripts/distributed_training/horovod_mnist_estimator.py
+++ /dev/null
@@ -1,268 +0,0 @@
-# Copyright 2018 Uber Technologies, Inc. All Rights Reserved.
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Convolutional Neural Network Estimator for MNIST, built with tf.layers."""
-
-# Future
-from __future__ import absolute_import, division, print_function
-
-# Standard Library
-import argparse
-import errno
-import os
-
-# Third Party
-import horovod.tensorflow as hvd
-import numpy as np
-import tensorflow as tf
-from tensorflow import keras
-
-# First Party
-import smdebug.tensorflow as smd
-
-tf.logging.set_verbosity(tf.logging.INFO)
-
-
-def cnn_model_fn(features, labels, mode):
- """Model function for CNN."""
- # Input Layer
- # Reshape X to 4-D tensor: [batch_size, width, height, channels]
- # MNIST images are 28x28 pixels, and have one color channel
- input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])
-
- # Convolutional Layer #1
- # Computes 32 features using a 5x5 filter with ReLU activation.
- # Padding is added to preserve width and height.
- # Input Tensor Shape: [batch_size, 28, 28, 1]
- # Output Tensor Shape: [batch_size, 28, 28, 32]
- conv1 = tf.layers.conv2d(
- inputs=input_layer, filters=32, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
- )
-
- # Pooling Layer #1
- # First max pooling layer with a 2x2 filter and stride of 2
- # Input Tensor Shape: [batch_size, 28, 28, 32]
- # Output Tensor Shape: [batch_size, 14, 14, 32]
- pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
-
- # Convolutional Layer #2
- # Computes 64 features using a 5x5 filter.
- # Padding is added to preserve width and height.
- # Input Tensor Shape: [batch_size, 14, 14, 32]
- # Output Tensor Shape: [batch_size, 14, 14, 64]
- conv2 = tf.layers.conv2d(
- inputs=pool1, filters=64, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
- )
-
- # Pooling Layer #2
- # Second max pooling layer with a 2x2 filter and stride of 2
- # Input Tensor Shape: [batch_size, 14, 14, 64]
- # Output Tensor Shape: [batch_size, 7, 7, 64]
- pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
-
- # Flatten tensor into a batch of vectors
- # Input Tensor Shape: [batch_size, 7, 7, 64]
- # Output Tensor Shape: [batch_size, 7 * 7 * 64]
- pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
-
- # Dense Layer
- # Densely connected layer with 1024 neurons
- # Input Tensor Shape: [batch_size, 7 * 7 * 64]
- # Output Tensor Shape: [batch_size, 1024]
- dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
-
- # Add dropout operation; 0.6 probability that element will be kept
- dropout = tf.layers.dropout(
- inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN
- )
-
- # Logits layer
- # Input Tensor Shape: [batch_size, 1024]
- # Output Tensor Shape: [batch_size, 10]
- logits = tf.layers.dense(inputs=dropout, units=10)
-
- predictions = {
- # Generate predictions (for PREDICT and EVAL mode)
- "classes": tf.argmax(input=logits, axis=1),
- # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
- # `logging_hook`.
- "probabilities": tf.nn.softmax(logits, name="softmax_tensor"),
- }
- if mode == tf.estimator.ModeKeys.PREDICT:
- return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
-
- # Calculate Loss (for both TRAIN and EVAL modes)
- onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10)
- loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
-
- # Configure the Training Op (for TRAIN mode)
- if mode == tf.estimator.ModeKeys.TRAIN:
- # Horovod: scale learning rate by the number of workers.
- optimizer = tf.train.MomentumOptimizer(learning_rate=0.001 * hvd.size(), momentum=0.9)
-
- # Horovod: add Horovod Distributed Optimizer.
- optimizer = hvd.DistributedOptimizer(optimizer)
-
- # Add smdebug Optimizer
- optimizer = smd.get_hook().wrap_optimizer(optimizer)
-
- train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
- return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
-
- # Add evaluation metrics (for EVAL mode)
- eval_metric_ops = {
- "accuracy": tf.metrics.accuracy(labels=labels, predictions=predictions["classes"])
- }
- return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)
-
-
-def str2bool(v):
- if isinstance(v, bool):
- return v
-
- if v.lower() in ("yes", "true", "t", "y", "1"):
- return True
- elif v.lower() in ("no", "false", "f", "n", "0"):
- return False
- else:
- raise argparse.ArgumentTypeError("Boolean value expected.")
-
-
-def add_cli_args():
- cmdline = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-
- cmdline.add_argument(
- "--steps", type=int, default=20000, help="""Number of training steps to run."""
- )
-
- cmdline.add_argument("--save_all", type=str2bool, default=True)
- cmdline.add_argument("--smdebug_path", type=str, default="/opt/ml/output/tensors")
- cmdline.add_argument("--save_frequency", type=int, help="How often to save TS data", default=10)
- cmdline.add_argument(
- "--reductions",
- type=str2bool,
- dest="reductions",
- default=False,
- help="save reductions of tensors instead of saving full tensors",
- )
-
- return cmdline
-
-
-def main(unused_argv):
- # Get commandline args
-
- cmdline = add_cli_args()
- FLAGS, unknown_args = cmdline.parse_known_args()
-
- # Horovod: initialize Horovod.
- hvd.init()
-
- # Keras automatically creates a cache directory in ~/.keras/datasets for
- # storing the downloaded MNIST data. This creates a race
- # condition among the workers that share the same filesystem. If the
- # directory already exists by the time this worker gets around to creating
- # it, ignore the resulting exception and continue.
- cache_dir = os.path.join(os.path.expanduser("~"), ".keras", "datasets")
- if not os.path.exists(cache_dir):
- try:
- os.mkdir(cache_dir)
- except OSError as e:
- if e.errno == errno.EEXIST and os.path.isdir(cache_dir):
- pass
- else:
- raise
-
- # Download and load MNIST dataset.
- (train_data, train_labels), (eval_data, eval_labels) = keras.datasets.mnist.load_data(
- "/tmp/MNIST-data-%d" % hvd.rank()
- )
-
- # The shape of downloaded data is (-1, 28, 28), hence we need to reshape it
- # into (-1, 784) to feed into our network. Also, need to normalize the
- # features between 0 and 1.
- train_data = np.reshape(train_data, (-1, 784)) / 255.0
- eval_data = np.reshape(eval_data, (-1, 784)) / 255.0
-
- # Horovod: pin GPU to be used to process local rank (one GPU per process)
- config = tf.ConfigProto()
- config.gpu_options.allow_growth = True
- config.gpu_options.visible_device_list = str(hvd.local_rank())
-
- # Horovod: save checkpoints only on worker 0 to prevent other workers from
- # corrupting them.
- model_dir = "./mnist_convnet_model" if hvd.rank() == 0 else None
-
- # Create the Estimator
- mnist_classifier = tf.estimator.Estimator(
- model_fn=cnn_model_fn,
- model_dir=model_dir,
- config=tf.estimator.RunConfig(session_config=config),
- )
-
- # Set up logging for predictions
- # Log the values in the "Softmax" tensor with label "probabilities"
- tensors_to_log = {"probabilities": "softmax_tensor"}
- logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=500)
-
- # Horovod: BroadcastGlobalVariablesHook broadcasts initial variable states from
- # rank 0 to all other processes. This is necessary to ensure consistent
- # initialization of all workers when training is started with random weights or
- # restored from a checkpoint.
- bcast_hook = hvd.BroadcastGlobalVariablesHook(0)
-
- # Train the model
- train_input_fn = tf.estimator.inputs.numpy_input_fn(
- x={"x": train_data}, y=train_labels, batch_size=100, num_epochs=None, shuffle=True
- )
-
- # Setup the Tornasole Hook
-
- # save tensors as reductions if necessary
- rdnc = (
- smd.ReductionConfig(reductions=["mean"], abs_reductions=["max"], norms=["l1"])
- if FLAGS.reductions
- else None
- )
-
- ts_hook = smd.SessionHook(
- out_dir=FLAGS.smdebug_path,
- save_all=FLAGS.save_all,
- include_collections=["weights", "gradients", "losses", "biases"],
- save_config=smd.SaveConfig(save_interval=FLAGS.save_frequency),
- reduction_config=rdnc,
- )
-
- ts_hook.set_mode(smd.modes.TRAIN)
-
- # Horovod: adjust number of steps based on number of GPUs.
- mnist_classifier.train(
- input_fn=train_input_fn,
- steps=FLAGS.steps // hvd.size(),
- hooks=[logging_hook, bcast_hook, ts_hook],
- )
-
- # Evaluate the model and print results
- eval_input_fn = tf.estimator.inputs.numpy_input_fn(
- x={"x": eval_data}, y=eval_labels, num_epochs=1, shuffle=False
- )
-
- ts_hook.set_mode(smd.modes.EVAL)
-
- eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn, hooks=[ts_hook])
- print(eval_results)
-
-
-if __name__ == "__main__":
- tf.app.run()
diff --git a/examples/tensorflow/scripts/distributed_training/mirrored_strategy_mnist.py b/examples/tensorflow/scripts/distributed_training/mirrored_strategy_mnist.py
deleted file mode 100644
index e3c48318b..000000000
--- a/examples/tensorflow/scripts/distributed_training/mirrored_strategy_mnist.py
+++ /dev/null
@@ -1,268 +0,0 @@
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Convolutional Neural Network Estimator for MNIST, built with tf.layers."""
-
-# Future
-from __future__ import absolute_import, division, print_function
-
-# Standard Library
-import argparse
-
-# Third Party
-import numpy as np
-import tensorflow as tf
-from tensorflow.python.client import device_lib
-
-# First Party
-import smdebug.tensorflow as smd
-
-tf.logging.set_verbosity(tf.logging.INFO)
-
-
-def cnn_model_fn(features, labels, mode):
- """Model function for CNN."""
- # Input Layer
- # Reshape X to 4-D tensor: [batch_size, width, height, channels]
- # MNIST images are 28x28 pixels, and have one color channel
- input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])
-
- # Convolutional Layer #1
- # Computes 32 features using a 5x5 filter with ReLU activation.
- # Padding is added to preserve width and height.
- # Input Tensor Shape: [batch_size, 28, 28, 1]
- # Output Tensor Shape: [batch_size, 28, 28, 32]
- conv1 = tf.layers.conv2d(
- inputs=input_layer, filters=32, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
- )
-
- # Pooling Layer #1
- # First max pooling layer with a 2x2 filter and stride of 2
- # Input Tensor Shape: [batch_size, 28, 28, 32]
- # Output Tensor Shape: [batch_size, 14, 14, 32]
- pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
-
- # Convolutional Layer #2
- # Computes 64 features using a 5x5 filter.
- # Padding is added to preserve width and height.
- # Input Tensor Shape: [batch_size, 14, 14, 32]
- # Output Tensor Shape: [batch_size, 14, 14, 64]
- conv2 = tf.layers.conv2d(
- inputs=pool1, filters=64, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
- )
-
- # Pooling Layer #2
- # Second max pooling layer with a 2x2 filter and stride of 2
- # Input Tensor Shape: [batch_size, 14, 14, 64]
- # Output Tensor Shape: [batch_size, 7, 7, 64]
- pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
-
- # Flatten tensor into a batch of vectors
- # Input Tensor Shape: [batch_size, 7, 7, 64]
- # Output Tensor Shape: [batch_size, 7 * 7 * 64]
- pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
-
- # Dense Layer
- # Densely connected layer with 1024 neurons
- # Input Tensor Shape: [batch_size, 7 * 7 * 64]
- # Output Tensor Shape: [batch_size, 1024]
- dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
-
- # Add dropout operation; 0.6 probability that element will be kept
- dropout = tf.layers.dropout(
- inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN
- )
-
- # Logits layer
- # Input Tensor Shape: [batch_size, 1024]
- # Output Tensor Shape: [batch_size, 10]
- logits = tf.layers.dense(inputs=dropout, units=10)
-
- predictions = {
- # Generate predictions (for PREDICT and EVAL mode)
- "classes": tf.argmax(input=logits, axis=1),
- # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
- # `logging_hook`.
- "probabilities": tf.nn.softmax(logits, name="softmax_tensor"),
- }
- if mode == tf.estimator.ModeKeys.PREDICT:
- return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
-
- # Calculate Loss (for both TRAIN and EVAL modes)
- loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
-
- # Configure the Training Op (for TRAIN mode)
- if mode == tf.estimator.ModeKeys.TRAIN:
- optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
- optimizer = smd.get_hook().wrap_optimizer(optimizer)
- train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
- return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
-
- # Add evaluation metrics (for EVAL mode)
- eval_metric_ops = {
- "accuracy": tf.metrics.accuracy(labels=labels, predictions=predictions["classes"])
- }
- return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)
-
-
-def per_device_batch_size(batch_size, num_gpus):
- """For multi-gpu, batch-size must be a multiple of the number of GPUs.
- Note that this should eventually be handled by DistributionStrategies
- directly. Multi-GPU support is currently experimental, however,
- so doing the work here until that feature is in place.
- Args:
- batch_size: Global batch size to be divided among devices. This should be
- equal to num_gpus times the single-GPU batch_size for multi-gpu training.
- num_gpus: How many GPUs are used with DistributionStrategies.
- Returns:
- Batch size per device.
- Raises:
- ValueError: if batch_size is not divisible by number of devices
- """
- if num_gpus <= 1:
- return batch_size
-
- remainder = batch_size % num_gpus
- if remainder:
- err = (
- "When running with multiple GPUs, batch size "
- "must be a multiple of the number of available GPUs. Found {} "
- "GPUs with a batch size of {}; try --batch_size={} instead."
- ).format(num_gpus, batch_size, batch_size - remainder)
- raise ValueError(err)
- return int(batch_size / num_gpus)
-
-
-class InputFnProvider:
- def __init__(self, train_batch_size):
- self.train_batch_size = train_batch_size
- self.__load_data()
-
- def __load_data(self):
- # Load training and eval data
- mnist = tf.contrib.learn.datasets.load_dataset("mnist")
- self.train_data = mnist.train.images # Returns np.array
- self.train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
- self.eval_data = mnist.test.images # Returns np.array
- self.eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)
-
- def train_input_fn(self):
- """An input function for training"""
- # Shuffle, repeat, and batch the examples.
- dataset = tf.data.Dataset.from_tensor_slices(({"x": self.train_data}, self.train_labels))
- dataset = dataset.shuffle(1000).repeat().batch(self.train_batch_size)
- return dataset
-
- def eval_input_fn(self):
- """An input function for evaluation or prediction"""
- dataset = tf.data.Dataset.from_tensor_slices(({"x": self.eval_data}, self.eval_labels))
- dataset = dataset.batch(1)
- return dataset
-
-
-def str2bool(v):
- if isinstance(v, bool):
- return v
-
- if v.lower() in ("yes", "true", "t", "y", "1"):
- return True
- elif v.lower() in ("no", "false", "f", "n", "0"):
- return False
- else:
- raise argparse.ArgumentTypeError("Boolean value expected.")
-
-
-def add_cli_args():
- cmdline = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-
- cmdline.add_argument(
- "--steps", type=int, default=20000, help="""Number of training steps to run."""
- )
-
- cmdline.add_argument("--save_all", type=str2bool, default=True)
- cmdline.add_argument("--smdebug_path", type=str, default="/opt/ml/output/tensors")
- cmdline.add_argument("--save_frequency", type=int, help="How often to save TS data", default=10)
- cmdline.add_argument(
- "--reductions",
- type=str2bool,
- dest="reductions",
- default=False,
- help="save reductions of tensors instead of saving full tensors",
- )
-
- return cmdline
-
-
-def get_available_gpus():
- local_device_protos = device_lib.list_local_devices()
- return len([x.name for x in local_device_protos if x.device_type == "GPU"])
-
-
-def main(unused_argv):
- num_gpus = get_available_gpus()
- batch_size = 10 * num_gpus
-
- cmdline = add_cli_args()
- FLAGS, unknown_args = cmdline.parse_known_args()
-
- # input_fn which serves Dataset
- input_fn_provider = InputFnProvider(per_device_batch_size(batch_size, num_gpus))
-
- # Use multiple GPUs by MirroredStragtegy.
- # All avaiable GPUs will be used if `num_gpus` is omitted.
- if num_gpus > 1:
- distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=num_gpus)
- print("### Doing Multi GPU Training")
- else:
- distribution = None
- # Pass to RunConfig
- config = tf.estimator.RunConfig(
- train_distribute=distribution, model_dir="/tmp/mnist_convnet_model"
- )
-
- # save tensors as reductions if necessary
- rdnc = (
- smd.ReductionConfig(reductions=["mean"], abs_reductions=["max"], norms=["l1"])
- if FLAGS.reductions
- else None
- )
-
- ts_hook = smd.SessionHook(
- out_dir=FLAGS.smdebug_path,
- save_all=FLAGS.save_all,
- include_collections=["weights", "gradients", "losses", "biases"],
- save_config=smd.SaveConfig(save_interval=FLAGS.save_frequency),
- reduction_config=rdnc,
- )
-
- ts_hook.set_mode(smd.modes.TRAIN)
-
- # Create the Estimator
- # pass RunConfig
- mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn, config=config)
-
- # Train the model
- mnist_classifier.train(
- input_fn=input_fn_provider.train_input_fn, steps=FLAGS.steps, hooks=[ts_hook]
- )
-
- ts_hook.set_mode(smd.modes.EVAL)
- # Evaluate the model and print results
- eval_results = mnist_classifier.evaluate(
- input_fn=input_fn_provider.eval_input_fn, hooks=[ts_hook]
- )
- print(eval_results)
-
-
-if __name__ == "__main__":
- tf.app.run()
diff --git a/examples/tensorflow/scripts/distributed_training/parameter_server_training/hostfile.txt b/examples/tensorflow/scripts/distributed_training/parameter_server_training/hostfile.txt
deleted file mode 100644
index abe577993..000000000
--- a/examples/tensorflow/scripts/distributed_training/parameter_server_training/hostfile.txt
+++ /dev/null
@@ -1,3 +0,0 @@
-172.31.26.105:6665
-172.31.26.105:6666
-172.31.26.105:6667
diff --git a/examples/tensorflow/scripts/distributed_training/parameter_server_training/parameter_server_mnist.py b/examples/tensorflow/scripts/distributed_training/parameter_server_training/parameter_server_mnist.py
deleted file mode 100644
index a57dfd19a..000000000
--- a/examples/tensorflow/scripts/distributed_training/parameter_server_training/parameter_server_mnist.py
+++ /dev/null
@@ -1,276 +0,0 @@
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Convolutional Neural Network Estimator for MNIST, built with tf.layers."""
-
-# Future
-from __future__ import absolute_import, division, print_function
-
-# Standard Library
-import argparse
-import json
-import os
-
-# Third Party
-import numpy as np
-import tensorflow as tf
-from tensorflow.python.client import device_lib
-
-# First Party
-import smdebug.tensorflow as smd
-
-tf.logging.set_verbosity(tf.logging.INFO)
-
-
-def cnn_model_fn(features, labels, mode):
- """Model function for CNN."""
- input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])
-
- conv1 = tf.layers.conv2d(
- inputs=input_layer, filters=32, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
- )
-
- pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
-
- conv2 = tf.layers.conv2d(
- inputs=pool1, filters=64, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
- )
-
- pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
-
- pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
-
- dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
-
- dropout = tf.layers.dropout(
- inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN
- )
-
- logits = tf.layers.dense(inputs=dropout, units=10)
-
- predictions = {
- # Generate predictions (for PREDICT and EVAL mode)
- "classes": tf.argmax(input=logits, axis=1),
- # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
- # `logging_hook`.
- "probabilities": tf.nn.softmax(logits, name="softmax_tensor"),
- }
- if mode == tf.estimator.ModeKeys.PREDICT:
- return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
-
- # Calculate Loss (for both TRAIN and EVAL modes)
- loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
-
- # Configure the Training Op (for TRAIN mode)
- if mode == tf.estimator.ModeKeys.TRAIN:
- optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
- optimizer = smd.get_hook().wrap_optimizer(optimizer)
- train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
- return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
-
- # Add evaluation metrics (for EVAL mode)
- eval_metric_ops = {
- "accuracy": tf.metrics.accuracy(labels=labels, predictions=predictions["classes"])
- }
- return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)
-
-
-def per_device_batch_size(batch_size, num_gpus):
- """For multi-gpu, batch-size must be a multiple of the number of GPUs.
- Note that this should eventually be handled by DistributionStrategies
- directly. Multi-GPU support is currently experimental, however,
- so doing the work here until that feature is in place.
- Args:
- batch_size: Global batch size to be divided among devices. This should be
- equal to num_gpus times the single-GPU batch_size for multi-gpu training.
- num_gpus: How many GPUs are used with DistributionStrategies.
- Returns:
- Batch size per device.
- Raises:
- ValueError: if batch_size is not divisible by number of devices
- """
- if num_gpus <= 1:
- return batch_size
-
- remainder = batch_size % num_gpus
- if remainder:
- err = (
- "When running with multiple GPUs, batch size "
- "must be a multiple of the number of available GPUs. Found {} "
- "GPUs with a batch size of {}; try --batch_size={} instead."
- ).format(num_gpus, batch_size, batch_size - remainder)
- raise ValueError(err)
- return int(batch_size / num_gpus)
-
-
-class InputFnProvider:
- def __init__(self, train_batch_size):
- self.train_batch_size = train_batch_size
- self.__load_data()
-
- def __load_data(self):
- # Load training and eval data
- mnist = tf.contrib.learn.datasets.load_dataset("mnist")
- self.train_data = mnist.train.images # Returns np.array
- self.train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
- self.eval_data = mnist.test.images # Returns np.array
- self.eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)
-
- def train_input_fn(self):
- """An input function for training"""
- # Shuffle, repeat, and batch the examples.
- dataset = tf.data.Dataset.from_tensor_slices(({"x": self.train_data}, self.train_labels))
- dataset = dataset.shuffle(1000).repeat().batch(self.train_batch_size)
- return dataset
-
- def eval_input_fn(self):
- """An input function for evaluation or prediction"""
- dataset = tf.data.Dataset.from_tensor_slices(({"x": self.eval_data}, self.eval_labels))
- dataset = dataset.batch(1)
- return dataset
-
-
-def str2bool(v):
- if isinstance(v, bool):
- return v
-
- if v.lower() in ("yes", "true", "t", "y", "1"):
- return True
- elif v.lower() in ("no", "false", "f", "n", "0"):
- return False
- else:
- raise argparse.ArgumentTypeError("Boolean value expected.")
-
-
-def add_cli_args():
- cmdline = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-
- cmdline.add_argument(
- "--steps", type=int, default=20000, help="""Number of training steps to run."""
- )
-
- cmdline.add_argument("--save_all", type=str2bool, default=True)
- cmdline.add_argument("--smdebug_path", type=str, default="/opt/ml/output/tensors")
- cmdline.add_argument("--save_frequency", type=int, help="How often to save TS data", default=10)
- cmdline.add_argument(
- "--reductions",
- type=str2bool,
- dest="reductions",
- default=False,
- help="save reductions of tensors instead of saving full tensors",
- )
-
- cmdline.add_argument(
- "--node_type", type=str, required=True, dest="node_type", help="node type: worker or ps"
- )
-
- cmdline.add_argument(
- "--task_index", type=int, required=True, dest="task_index", help="task index"
- )
-
- cmdline.add_argument(
- "--hostfile",
- default=None,
- type=str,
- required=False,
- dest="hostfile",
- help="Path to hostfile",
- )
-
- return cmdline
-
-
-def get_available_gpus():
- local_device_protos = device_lib.list_local_devices()
- return len([x.name for x in local_device_protos if x.device_type == "GPU"])
-
-
-def main(unused_argv):
- num_gpus = get_available_gpus()
- batch_size = 10 * num_gpus
-
- cmdline = add_cli_args()
- FLAGS, unknown_args = cmdline.parse_known_args()
-
- # input_fn which serves Dataset
- input_fn_provider = InputFnProvider(per_device_batch_size(batch_size, num_gpus))
-
- # Use multiple GPUs by ParameterServerStrategy.
- # All avaiable GPUs will be used if `num_gpus` is omitted.
-
- if num_gpus > 1:
- strategy = tf.distribute.experimental.ParameterServerStrategy()
- if not os.getenv("TF_CONFIG"):
- if FLAGS.hostfile is None:
- raise Exception("--hostfile not provided and TF_CONFIG not set. Please do either.")
- nodes = list()
- try:
- f = open(FLAGS.hostfile)
- for line in f.readlines():
- nodes.append(line.strip())
- except OSError as e:
- print(e.errno)
-
- os.environ["TF_CONFIG"] = json.dumps(
- {
- "cluster": {"worker": [nodes[0], nodes[1]], "ps": [nodes[2]]},
- "task": {"type": FLAGS.node_type, "index": FLAGS.task_index},
- }
- )
-
- print("### Doing Multi GPU Training")
- else:
- strategy = None
- # Pass to RunConfig
- config = tf.estimator.RunConfig(train_distribute=strategy)
-
- # save tensors as reductions if necessary
- rdnc = (
- smd.ReductionConfig(reductions=["mean"], abs_reductions=["max"], norms=["l1"])
- if FLAGS.reductions
- else None
- )
-
- ts_hook = smd.SessionHook(
- out_dir=FLAGS.smdebug_path,
- save_all=FLAGS.save_all,
- include_collections=["weights", "gradients", "losses", "biases"],
- save_config=smd.SaveConfig(save_interval=FLAGS.save_frequency),
- reduction_config=rdnc,
- )
-
- ts_hook.set_mode(smd.modes.TRAIN)
-
- # Create the Estimator
- # pass RunConfig
- mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn, config=config)
-
- hooks = list()
- hooks.append(ts_hook)
-
- train_spec = tf.estimator.TrainSpec(
- input_fn=input_fn_provider.train_input_fn, max_steps=FLAGS.steps, hooks=hooks
- )
- eval_spec = tf.estimator.EvalSpec(
- input_fn=input_fn_provider.eval_input_fn, steps=FLAGS.steps, hooks=hooks
- )
-
- tf.estimator.train_and_evaluate(mnist_classifier, train_spec, eval_spec)
-
- # Evaluate the model and print results
- eval_results = mnist_classifier.evaluate(input_fn=input_fn_provider.eval_input_fn)
- print(eval_results)
-
-
-if __name__ == "__main__":
- tf.app.run()
diff --git a/examples/tensorflow/scripts/keras.py b/examples/tensorflow/scripts/keras.py
deleted file mode 100644
index 0fe13d919..000000000
--- a/examples/tensorflow/scripts/keras.py
+++ /dev/null
@@ -1,84 +0,0 @@
-# Future
-from __future__ import absolute_import, division, print_function, unicode_literals
-
-# Third Party
-import tensorflow as tf
-import tensorflow_datasets as tfds
-
-# First Party
-from smdebug.core.collection import CollectionKeys
-from smdebug.tensorflow import KerasHook
-
-tfds.disable_progress_bar()
-
-
-def train_model():
- print(tf.__version__)
-
- datasets, info = tfds.load(name="mnist", with_info=True, as_supervised=True)
-
- mnist_train, mnist_test = datasets["train"], datasets["test"]
-
- strategy = tf.distribute.MirroredStrategy()
-
- # You can also do info.splits.total_num_examples to get the total
- # number of examples in the dataset.
-
- num_train_examples = info.splits["train"].num_examples
- num_test_examples = info.splits["test"].num_examples
-
- BUFFER_SIZE = 10000
-
- BATCH_SIZE_PER_REPLICA = 64
- BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
-
- def scale(image, label):
- image = tf.cast(image, tf.float32)
- image /= 255
-
- return image, label
-
- train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
- eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)
-
- hook = KerasHook(
- out_dir="/tmp/ts_outputs/",
- include_collections=[
- # CollectionKeys.WEIGHTS,
- # CollectionKeys.GRADIENTS,
- # CollectionKeys.OPTIMIZER_VARIABLES,
- CollectionKeys.DEFAULT,
- # CollectionKeys.METRICS,
- # CollectionKeys.LOSSES,
- # CollectionKeys.OUTPUTS,
- # CollectionKeys.SCALARS,
- ],
- save_all=True,
- )
-
- with strategy.scope():
- model = tf.keras.Sequential(
- [
- tf.keras.layers.Conv2D(32, 3, activation="relu", input_shape=(28, 28, 1)),
- tf.keras.layers.MaxPooling2D(),
- tf.keras.layers.Flatten(),
- tf.keras.layers.Dense(64, activation="relu"),
- tf.keras.layers.Dense(10, activation="softmax"),
- ]
- )
- model.compile(
- loss="sparse_categorical_crossentropy",
- optimizer=hook.wrap_optimizer(tf.keras.optimizers.Adam()),
- metrics=["accuracy"],
- )
-
- # get_collection('default').include('Relu')
-
- callbacks = [
- hook
- # tf.keras.callbacks.TensorBoard(log_dir='/tmp/logs'),
- ]
-
- model.fit(train_dataset, epochs=1, callbacks=callbacks)
- model.predict(eval_dataset, callbacks=callbacks)
- model.fit(train_dataset, epochs=1, callbacks=callbacks)
diff --git a/examples/tensorflow/scripts/train_imagenet_resnet_hvd.py b/examples/tensorflow/scripts/train_imagenet_resnet_hvd.py
deleted file mode 100644
index f1a1ae098..000000000
--- a/examples/tensorflow/scripts/train_imagenet_resnet_hvd.py
+++ /dev/null
@@ -1,1502 +0,0 @@
-#!/usr/bin/env python
-# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-# * Redistributions of source code must retain the above copyright
-# notice, this list of conditions and the following disclaimer.
-# * Redistributions in binary form must reproduce the above copyright
-# notice, this list of conditions and the following disclaimer in the
-# documentation and/or other materials provided with the distribution.
-# * Neither the name of NVIDIA CORPORATION nor the names of its
-# contributors may be used to endorse or promote products derived
-# from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-# Future
-from __future__ import print_function
-
-# Standard Library
-import argparse
-import logging
-import math
-import os
-import random
-import re
-import shutil
-import sys
-import time
-from operator import itemgetter
-
-# Third Party
-import horovod.tensorflow as hvd
-import numpy as np
-import tensorflow as tf
-from tensorflow.contrib.image.python.ops import distort_image_ops
-from tensorflow.python.ops import data_flow_ops
-from tensorflow.python.util import nest
-
-# First Party
-import smdebug.tensorflow as smd
-
-try:
- from builtins import range
-except ImportError:
- pass
-
-
-def rank0log(logger, *args, **kwargs):
- if hvd.rank() == 0:
- if logger:
- logger.info("".join([str(x) for x in list(args)]))
- else:
- print(*args, **kwargs)
-
-
-class LayerBuilder(object):
- def __init__(
- self,
- activation=None,
- data_format="channels_last",
- training=False,
- use_batch_norm=False,
- batch_norm_config=None,
- conv_initializer=None,
- adv_bn_init=False,
- ):
- self.activation = activation
- self.data_format = data_format
- self.training = training
- self.use_batch_norm = use_batch_norm
- self.batch_norm_config = batch_norm_config
- self.conv_initializer = conv_initializer
- self.adv_bn_init = adv_bn_init
- if self.batch_norm_config is None:
- self.batch_norm_config = {
- "decay": 0.9,
- "epsilon": 1e-4,
- "scale": True,
- "zero_debias_moving_mean": False,
- }
-
- def _conv2d(self, inputs, activation, *args, **kwargs):
- x = tf.layers.conv2d(
- inputs,
- data_format=self.data_format,
- use_bias=not self.use_batch_norm,
- kernel_initializer=self.conv_initializer,
- activation=None if self.use_batch_norm else activation,
- *args,
- **kwargs
- )
- if self.use_batch_norm:
- x = self.batch_norm(x)
- x = activation(x) if activation is not None else x
- return x
-
- def conv2d_linear_last_bn(self, inputs, *args, **kwargs):
- x = tf.layers.conv2d(
- inputs,
- data_format=self.data_format,
- use_bias=False,
- kernel_initializer=self.conv_initializer,
- activation=None,
- *args,
- **kwargs
- )
- param_initializers = {
- "moving_mean": tf.zeros_initializer(),
- "moving_variance": tf.ones_initializer(),
- "beta": tf.zeros_initializer(),
- }
- if self.adv_bn_init:
- param_initializers["gamma"] = tf.zeros_initializer()
- else:
- param_initializers["gamma"] = tf.ones_initializer()
- x = self.batch_norm(x, param_initializers=param_initializers)
- return x
-
- def conv2d_linear(self, inputs, *args, **kwargs):
- return self._conv2d(inputs, None, *args, **kwargs)
-
- def conv2d(self, inputs, *args, **kwargs):
- return self._conv2d(inputs, self.activation, *args, **kwargs)
-
- def pad2d(self, inputs, begin, end=None):
- if end is None:
- end = begin
- try:
- _ = begin[1]
- except TypeError:
- begin = [begin, begin]
- try:
- _ = end[1]
- except TypeError:
- end = [end, end]
- if self.data_format == "channels_last":
- padding = [[0, 0], [begin[0], end[0]], [begin[1], end[1]], [0, 0]]
- else:
- padding = [[0, 0], [0, 0], [begin[0], end[0]], [begin[1], end[1]]]
- return tf.pad(inputs, padding)
-
- def max_pooling2d(self, inputs, *args, **kwargs):
- return tf.layers.max_pooling2d(inputs, data_format=self.data_format, *args, **kwargs)
-
- def average_pooling2d(self, inputs, *args, **kwargs):
- return tf.layers.average_pooling2d(inputs, data_format=self.data_format, *args, **kwargs)
-
- def dense_linear(self, inputs, units, **kwargs):
- return tf.layers.dense(inputs, units, activation=None)
-
- def dense(self, inputs, units, **kwargs):
- return tf.layers.dense(inputs, units, activation=self.activation)
-
- def activate(self, inputs, activation=None):
- activation = activation or self.activation
- return activation(inputs) if activation is not None else inputs
-
- def batch_norm(self, inputs, **kwargs):
- all_kwargs = dict(self.batch_norm_config)
- all_kwargs.update(kwargs)
- data_format = "NHWC" if self.data_format == "channels_last" else "NCHW"
- return tf.contrib.layers.batch_norm(
- inputs, is_training=self.training, data_format=data_format, fused=True, **all_kwargs
- )
-
- def spatial_average2d(self, inputs):
- shape = inputs.get_shape().as_list()
- if self.data_format == "channels_last":
- n, h, w, c = shape
- else:
- n, c, h, w = shape
- n = -1 if n is None else n
- x = tf.layers.average_pooling2d(inputs, (h, w), (1, 1), data_format=self.data_format)
- return tf.reshape(x, [n, c])
-
- def flatten2d(self, inputs):
- x = inputs
- if self.data_format != "channel_last":
- # Note: This ensures the output order matches that of NHWC networks
- x = tf.transpose(x, [0, 2, 3, 1])
- input_shape = x.get_shape().as_list()
- num_inputs = 1
- for dim in input_shape[1:]:
- num_inputs *= dim
- return tf.reshape(x, [-1, num_inputs], name="flatten")
-
- def residual2d(self, inputs, network, units=None, scale=1.0, activate=False):
- outputs = network(inputs)
- c_axis = -1 if self.data_format == "channels_last" else 1
- h_axis = 1 if self.data_format == "channels_last" else 2
- w_axis = h_axis + 1
- ishape, oshape = [y.get_shape().as_list() for y in [inputs, outputs]]
- ichans, ochans = ishape[c_axis], oshape[c_axis]
- strides = (
- (ishape[h_axis] - 1) // oshape[h_axis] + 1,
- (ishape[w_axis] - 1) // oshape[w_axis] + 1,
- )
- with tf.name_scope("residual"):
- if ochans != ichans or strides[0] != 1 or strides[1] != 1:
- inputs = self.conv2d_linear(inputs, units, 1, strides, "SAME")
- x = inputs + scale * outputs
- if activate:
- x = self.activate(x)
- return x
-
-
-def resnet_bottleneck_v1(builder, inputs, depth, depth_bottleneck, stride, basic=False):
- num_inputs = inputs.get_shape().as_list()[1]
- x = inputs
- with tf.name_scope("resnet_v1"):
- if depth == num_inputs:
- if stride == 1:
- shortcut = x
- else:
- shortcut = builder.max_pooling2d(x, 1, stride)
- else:
- shortcut = builder.conv2d_linear(x, depth, 1, stride, "SAME")
- if basic:
- x = builder.pad2d(x, 1)
- x = builder.conv2d(x, depth_bottleneck, 3, stride, "VALID")
- x = builder.conv2d_linear(x, depth, 3, 1, "SAME")
- else:
- x = builder.conv2d(x, depth_bottleneck, 1, 1, "SAME")
- x = builder.conv2d(x, depth_bottleneck, 3, stride, "SAME")
- # x = builder.conv2d_linear(x, depth, 1, 1, 'SAME')
- x = builder.conv2d_linear_last_bn(x, depth, 1, 1, "SAME")
- x = tf.nn.relu(x + shortcut)
- smd.get_hook().add_to_collection("relu_activations", x)
- return x
-
-
-def inference_resnet_v1_impl(builder, inputs, layer_counts, basic=False):
- x = inputs
- x = builder.pad2d(x, 3)
- x = builder.conv2d(x, 64, 7, 2, "VALID")
- x = builder.max_pooling2d(x, 3, 2, "SAME")
- for i in range(layer_counts[0]):
- x = resnet_bottleneck_v1(builder, x, 256, 64, 1, basic)
- for i in range(layer_counts[1]):
- x = resnet_bottleneck_v1(builder, x, 512, 128, 2 if i == 0 else 1, basic)
- for i in range(layer_counts[2]):
- x = resnet_bottleneck_v1(builder, x, 1024, 256, 2 if i == 0 else 1, basic)
- for i in range(layer_counts[3]):
- x = resnet_bottleneck_v1(builder, x, 2048, 512, 2 if i == 0 else 1, basic)
- return builder.spatial_average2d(x)
-
-
-def inference_resnet_v1(
- inputs,
- nlayer,
- data_format="channels_last",
- training=False,
- conv_initializer=None,
- adv_bn_init=False,
-):
- """Deep Residual Networks family of models
- https://arxiv.org/abs/1512.03385
- """
- builder = LayerBuilder(
- tf.nn.relu,
- data_format,
- training,
- use_batch_norm=True,
- conv_initializer=conv_initializer,
- adv_bn_init=adv_bn_init,
- )
- if nlayer == 18:
- return inference_resnet_v1_impl(builder, inputs, [2, 2, 2, 2], basic=True)
- elif nlayer == 34:
- return inference_resnet_v1_impl(builder, inputs, [3, 4, 6, 3], basic=True)
- elif nlayer == 50:
- return inference_resnet_v1_impl(builder, inputs, [3, 4, 6, 3])
- elif nlayer == 101:
- return inference_resnet_v1_impl(builder, inputs, [3, 4, 23, 3])
- elif nlayer == 152:
- return inference_resnet_v1_impl(builder, inputs, [3, 8, 36, 3])
- else:
- raise ValueError("Invalid nlayer (%i); must be one of: 18,34,50,101,152" % nlayer)
-
-
-def get_model_func(model_name):
- if model_name.startswith("resnet"):
- nlayer = int(model_name[len("resnet") :])
- return lambda images, *args, **kwargs: inference_resnet_v1(images, nlayer, *args, **kwargs)
- else:
- raise ValueError("Invalid model type: %s" % model_name)
-
-
-def deserialize_image_record(record):
- feature_map = {
- "image/encoded": tf.FixedLenFeature([], tf.string, ""),
- "image/class/label": tf.FixedLenFeature([1], tf.int64, -1),
- "image/class/text": tf.FixedLenFeature([], tf.string, ""),
- "image/object/bbox/xmin": tf.VarLenFeature(dtype=tf.float32),
- "image/object/bbox/ymin": tf.VarLenFeature(dtype=tf.float32),
- "image/object/bbox/xmax": tf.VarLenFeature(dtype=tf.float32),
- "image/object/bbox/ymax": tf.VarLenFeature(dtype=tf.float32),
- }
- with tf.name_scope("deserialize_image_record"):
- obj = tf.parse_single_example(record, feature_map)
- imgdata = obj["image/encoded"]
- label = tf.cast(obj["image/class/label"], tf.int32)
- bbox = tf.stack(
- [obj["image/object/bbox/%s" % x].values for x in ["ymin", "xmin", "ymax", "xmax"]]
- )
- bbox = tf.transpose(tf.expand_dims(bbox, 0), [0, 2, 1])
- text = obj["image/class/text"]
- return imgdata, label, bbox, text
-
-
-def decode_jpeg(imgdata, channels=3):
- return tf.image.decode_jpeg(
- imgdata, channels=channels, fancy_upscaling=False, dct_method="INTEGER_FAST"
- )
-
-
-def crop_and_resize_image(image, original_bbox, height, width, distort=False, nsummary=10):
- with tf.name_scope("crop_and_resize"):
- # Evaluation is done on a center-crop of this ratio
- eval_crop_ratio = 0.8
- if distort:
- initial_shape = [
- int(round(height / eval_crop_ratio)),
- int(round(width / eval_crop_ratio)),
- 3,
- ]
- bbox_begin, bbox_size, bbox = tf.image.sample_distorted_bounding_box(
- initial_shape,
- bounding_boxes=tf.constant([0.0, 0.0, 1.0, 1.0], dtype=tf.float32, shape=[1, 1, 4]),
- min_object_covered=0.1,
- aspect_ratio_range=[3.0 / 4.0, 4.0 / 3.0],
- area_range=[0.08, 1.0],
- max_attempts=100,
- seed=11 * hvd.rank(),
- # Need to set for deterministic results
- use_image_if_no_bounding_boxes=True,
- )
- bbox = bbox[0, 0] # Remove batch, box_idx dims
- else:
- # Central crop
- ratio_y = ratio_x = eval_crop_ratio
- bbox = tf.constant(
- [0.5 * (1 - ratio_y), 0.5 * (1 - ratio_x), 0.5 * (1 + ratio_y), 0.5 * (1 + ratio_x)]
- )
- image = tf.image.crop_and_resize(image[None, :, :, :], bbox[None, :], [0], [height, width])[
- 0
- ]
- return image
-
-
-def parse_and_preprocess_image_record(
- record,
- counter,
- height,
- width,
- brightness,
- contrast,
- saturation,
- hue,
- distort=False,
- nsummary=10,
- increased_aug=False,
-):
- imgdata, label, bbox, text = deserialize_image_record(record)
- label -= 1 # Change to 0-based (don't use background class)
- with tf.name_scope("preprocess_train"):
- try:
- image = decode_jpeg(imgdata, channels=3)
- except:
- image = tf.image.decode_png(imgdata, channels=3)
- image = crop_and_resize_image(image, bbox, height, width, distort)
- if distort:
- image = tf.image.random_flip_left_right(image)
- if increased_aug:
- image = tf.image.random_brightness(image, max_delta=brightness)
- image = distort_image_ops.random_hsv_in_yiq(
- image,
- lower_saturation=saturation,
- upper_saturation=2.0 - saturation,
- max_delta_hue=hue * math.pi,
- )
- image = tf.image.random_contrast(image, lower=contrast, upper=2.0 - contrast)
- tf.summary.image("distorted_color_image", tf.expand_dims(image, 0))
- image = tf.clip_by_value(image, 0.0, 255.0)
- image = tf.cast(image, tf.uint8)
- return image, label
-
-
-def make_dataset(
- filenames,
- take_count,
- batch_size,
- height,
- width,
- brightness,
- contrast,
- saturation,
- hue,
- training=False,
- num_threads=10,
- nsummary=10,
- shard=False,
- synthetic=False,
- increased_aug=False,
-):
- if synthetic and training:
- input_shape = [height, width, 3]
- input_element = nest.map_structure(
- lambda s: tf.constant(0.5, tf.float32, s), tf.TensorShape(input_shape)
- )
- label_element = nest.map_structure(
- lambda s: tf.constant(1, tf.int32, s), tf.TensorShape([1])
- )
- element = (input_element, label_element)
- ds = tf.data.Dataset.from_tensors(element).repeat()
- else:
- shuffle_buffer_size = 10000
- num_readers = 1
- if hvd.size() > len(filenames):
- assert (hvd.size() % len(filenames)) == 0
- filenames = filenames * (hvd.size() / len(filenames))
-
- ds = tf.data.Dataset.from_tensor_slices(filenames)
- if shard:
- # split the dataset into parts for each GPU
- ds = ds.shard(hvd.size(), hvd.rank())
-
- if not training:
- # make sure all ranks have the same amount
- ds = ds.take(take_count)
-
- if training:
- ds = ds.shuffle(1000, seed=7 * (1 + hvd.rank()))
-
- ds = ds.interleave(tf.data.TFRecordDataset, cycle_length=num_readers, block_length=1)
- counter = tf.data.Dataset.range(sys.maxsize)
- ds = tf.data.Dataset.zip((ds, counter))
- preproc_func = lambda record, counter_: parse_and_preprocess_image_record(
- record,
- counter_,
- height,
- width,
- brightness,
- contrast,
- saturation,
- hue,
- distort=training,
- nsummary=nsummary if training else 0,
- increased_aug=increased_aug,
- )
- ds = ds.map(preproc_func, num_parallel_calls=num_threads)
- if training:
- ds = ds.apply(
- tf.data.experimental.shuffle_and_repeat(
- shuffle_buffer_size, seed=5 * (1 + hvd.rank())
- )
- )
- ds = ds.batch(batch_size)
- return ds
-
-
-def stage(tensors):
- """Stages the given tensors in a StagingArea for asynchronous put/get.
- """
- stage_area = data_flow_ops.StagingArea(
- dtypes=[tensor.dtype for tensor in tensors],
- shapes=[tensor.get_shape() for tensor in tensors],
- )
- put_op = stage_area.put(tensors)
- get_tensors = stage_area.get()
- tf.add_to_collection("STAGING_AREA_PUTS", put_op)
- return put_op, get_tensors
-
-
-class PrefillStagingAreasHook(tf.train.SessionRunHook):
- def after_create_session(self, session, coord):
- enqueue_ops = tf.get_collection("STAGING_AREA_PUTS")
- for i in range(len(enqueue_ops)):
- session.run(enqueue_ops[: i + 1])
-
-
-class LogSessionRunHook(tf.train.SessionRunHook):
- def __init__(self, global_batch_size, num_records, display_every=10, logger=None):
- self.global_batch_size = global_batch_size
- self.num_records = num_records
- self.display_every = display_every
- self.logger = logger
-
- def after_create_session(self, session, coord):
- rank0log(self.logger, " Step Epoch Speed Loss FinLoss LR")
- self.elapsed_secs = 0.0
- self.count = 0
-
- def before_run(self, run_context):
- self.t0 = time.time()
- return tf.train.SessionRunArgs(
- fetches=[tf.train.get_global_step(), "loss:0", "total_loss:0", "learning_rate:0"]
- )
-
- def after_run(self, run_context, run_values):
- self.elapsed_secs += time.time() - self.t0
- self.count += 1
- global_step, loss, total_loss, lr = run_values.results
- if global_step == 1 or global_step % self.display_every == 0:
- dt = self.elapsed_secs / self.count
- img_per_sec = self.global_batch_size / dt
- epoch = global_step * self.global_batch_size / self.num_records
- self.logger.info(
- "%6i %5.1f %7.1f %6.3f %6.3f %7.5f"
- % (global_step, epoch, img_per_sec, loss, total_loss, lr)
- )
- self.elapsed_secs = 0.0
- self.count = 0
-
-
-def _fp32_trainvar_getter(
- getter, name, shape=None, dtype=None, trainable=True, regularizer=None, *args, **kwargs
-):
- storage_dtype = tf.float32 if trainable else dtype
-
- bn_in_name = False
- for x in ["BatchNorm", "batchnorm", "batch_norm", "Batch_Norm"]:
- if x in name:
- bn_in_name = True
-
- if trainable and not bn_in_name:
- use_regularizer = regularizer
- else:
- use_regularizer = None
-
- variable = getter(
- name,
- shape,
- dtype=storage_dtype,
- trainable=trainable,
- regularizer=use_regularizer,
- *args,
- **kwargs
- )
- if trainable and dtype != tf.float32:
- cast_name = name + "/fp16_cast"
- try:
- cast_variable = tf.get_default_graph().get_tensor_by_name(cast_name + ":0")
- except KeyError:
- cast_variable = tf.cast(variable, dtype, name=cast_name)
- cast_variable._ref = variable._ref
- variable = cast_variable
- return variable
-
-
-def fp32_trainable_vars(name="fp32_vars", *args, **kwargs):
- """A varible scope with custom variable getter to convert fp16 trainable
- variables with fp32 storage followed by fp16 cast.
- """
- return tf.variable_scope(name, custom_getter=_fp32_trainvar_getter, *args, **kwargs)
-
-
-class MixedPrecisionOptimizer(tf.train.Optimizer):
- """An optimizer that updates trainable variables in fp32."""
-
- def __init__(self, optimizer, scale=None, name="MixedPrecisionOptimizer", use_locking=False):
- super(MixedPrecisionOptimizer, self).__init__(name=name, use_locking=use_locking)
- self._optimizer = optimizer
- self._scale = float(scale) if scale is not None else 1.0
-
- def compute_gradients(self, loss, var_list=None, *args, **kwargs):
- if var_list is None:
- var_list = tf.trainable_variables() + tf.get_collection(
- tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES
- )
-
- replaced_list = var_list
-
- if self._scale != 1.0:
- loss = tf.scalar_mul(self._scale, loss)
- gradvar = self._optimizer.compute_gradients(loss, replaced_list, *args, **kwargs)
-
- final_gradvar = []
- for orig_var, (grad, var) in zip(var_list, gradvar):
- if var is not orig_var:
- grad = tf.cast(grad, orig_var.dtype)
- if self._scale != 1.0:
- grad = tf.scalar_mul(1.0 / self._scale, grad)
- final_gradvar.append((grad, orig_var))
- return final_gradvar
-
- def apply_gradients(self, *args, **kwargs):
- return self._optimizer.apply_gradients(*args, **kwargs)
-
-
-class LarcOptimizer(tf.train.Optimizer):
- """ LARC implementation
- -------------------
- Parameters:
- - optimizer: initial optimizer that you wanna apply
- example: tf.train.MomentumOptimizer
- - learning_rate: initial learning_rate from initial optimizer
- - clip: if True apply LARC otherwise LARS
- - epsilon: default value is weights or grads are 0.
- - name
- - use_locking
- """
-
- def __init__(
- self,
- optimizer,
- learning_rate,
- eta,
- clip=True,
- epsilon=1.0,
- name="LarcOptimizer",
- use_locking=False,
- ):
- super(LarcOptimizer, self).__init__(name=name, use_locking=use_locking)
- self._optimizer = optimizer
- self._learning_rate = learning_rate
- self._eta = float(eta)
- self._clip = clip
- self._epsilon = float(epsilon)
-
- def compute_gradients(self, *args, **kwargs):
- return self._optimizer.compute_gradients(*args, **kwargs)
-
- def apply_gradients(self, gradvars, *args, **kwargs):
- v_list = [tf.norm(tensor=v, ord=2) for _, v in gradvars]
- g_list = [tf.norm(tensor=g, ord=2) if g is not None else 0.0 for g, _ in gradvars]
- v_norms = tf.stack(v_list)
- g_norms = tf.stack(g_list)
- zeds = tf.zeros_like(v_norms)
- # assign epsilon if weights or grads = 0, to avoid division by zero
- # also prevent biases to get stuck at initialization (0.)
- cond = tf.logical_and(tf.not_equal(v_norms, zeds), tf.not_equal(g_norms, zeds))
- true_vals = tf.scalar_mul(self._eta, tf.div(v_norms, g_norms))
- false_vals = tf.fill(tf.shape(v_norms), self._epsilon)
- larc_local_lr = tf.where(cond, true_vals, false_vals)
- if self._clip:
- ones = tf.ones_like(v_norms)
- lr = tf.fill(tf.shape(v_norms), self._learning_rate)
- # We need gradients to compute local learning rate,
- # so compute_gradients from initial optimizer have to called
- # for which learning rate is already fixed
- # We then have to scale the gradients instead of the learning rate.
- larc_local_lr = tf.minimum(tf.div(larc_local_lr, lr), ones)
- gradvars = [
- (tf.multiply(larc_local_lr[i], g), v) if g is not None else (None, v)
- for i, (g, v) in enumerate(gradvars)
- ]
- return self._optimizer.apply_gradients(gradvars, *args, **kwargs)
-
-
-def get_with_default(obj, key, default_value):
- return obj[key] if key in obj and obj[key] is not None else default_value
-
-
-def get_lr(
- lr,
- steps,
- lr_steps,
- warmup_it,
- decay_steps,
- global_step,
- lr_decay_mode,
- cdr_first_decay_ratio,
- cdr_t_mul,
- cdr_m_mul,
- cdr_alpha,
- lc_periods,
- lc_alpha,
- lc_beta,
-):
- if lr_decay_mode == "steps":
- learning_rate = tf.train.piecewise_constant(global_step, steps, lr_steps)
- elif lr_decay_mode == "poly" or lr_decay_mode == "poly_cycle":
- cycle = lr_decay_mode == "poly_cycle"
- learning_rate = tf.train.polynomial_decay(
- lr,
- global_step - warmup_it,
- decay_steps=decay_steps - warmup_it,
- end_learning_rate=0.00001,
- power=2,
- cycle=cycle,
- )
- elif lr_decay_mode == "cosine_decay_restarts":
- learning_rate = tf.train.cosine_decay_restarts(
- lr,
- global_step - warmup_it,
- (decay_steps - warmup_it) * cdr_first_decay_ratio,
- t_mul=cdr_t_mul,
- m_mul=cdr_m_mul,
- alpha=cdr_alpha,
- )
- elif lr_decay_mode == "cosine":
- learning_rate = tf.train.cosine_decay(
- lr, global_step - warmup_it, decay_steps=decay_steps - warmup_it, alpha=0.0
- )
- elif lr_decay_mode == "linear_cosine":
- learning_rate = tf.train.linear_cosine_decay(
- lr,
- global_step - warmup_it,
- decay_steps=decay_steps - warmup_it,
- num_periods=lc_periods,
- # 0.47,
- alpha=lc_alpha, # 0.0,
- beta=lc_beta,
- ) # 0.00001)
- else:
- raise ValueError("Invalid type of lr_decay_mode")
- return learning_rate
-
-
-def warmup_decay(warmup_lr, global_step, warmup_steps, warmup_end_lr):
- from tensorflow.python.ops import math_ops
-
- p = tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32)
- diff = math_ops.subtract(warmup_end_lr, warmup_lr)
- res = math_ops.add(warmup_lr, math_ops.multiply(diff, p))
- return res
-
-
-def cnn_model_function(features, labels, mode, params):
- labels = tf.reshape(labels, (-1,)) # Squash unnecessary unary dim
- lr = params["lr"]
- lr_steps = params["lr_steps"]
- steps = params["steps"]
- use_larc = params["use_larc"]
- leta = params["leta"]
- lr_decay_mode = params["lr_decay_mode"]
- decay_steps = params["decay_steps"]
- cdr_first_decay_ratio = params["cdr_first_decay_ratio"]
- cdr_t_mul = params["cdr_t_mul"]
- cdr_m_mul = params["cdr_m_mul"]
- cdr_alpha = params["cdr_alpha"]
- lc_periods = params["lc_periods"]
- lc_alpha = params["lc_alpha"]
- lc_beta = params["lc_beta"]
-
- model_name = params["model"]
- num_classes = params["n_classes"]
- model_dtype = get_with_default(params, "dtype", tf.float32)
- model_format = get_with_default(params, "format", "channels_first")
- device = get_with_default(params, "device", "/gpu:0")
- model_func = get_model_func(model_name)
- inputs = features # TODO: Should be using feature columns?
- is_training = mode == tf.estimator.ModeKeys.TRAIN
- momentum = params["mom"]
- weight_decay = params["wdecay"]
- warmup_lr = params["warmup_lr"]
- warmup_it = params["warmup_it"]
- loss_scale = params["loss_scale"]
-
- adv_bn_init = params["adv_bn_init"]
- conv_init = params["conv_init"]
-
- if mode == tf.estimator.ModeKeys.TRAIN:
- with tf.device("/cpu:0"):
- preload_op, (inputs, labels) = stage([inputs, labels])
- smd.get_hook().add_to_collection("inputs", inputs)
-
- with tf.device(device):
- if mode == tf.estimator.ModeKeys.TRAIN:
- gpucopy_op, (inputs, labels) = stage([inputs, labels])
- inputs = tf.cast(inputs, model_dtype)
- imagenet_mean = np.array([121, 115, 100], dtype=np.float32)
- imagenet_std = np.array([70, 68, 71], dtype=np.float32)
- inputs = tf.subtract(inputs, imagenet_mean)
- inputs = tf.multiply(inputs, 1.0 / imagenet_std)
- if model_format == "channels_first":
- inputs = tf.transpose(inputs, [0, 3, 1, 2])
- with fp32_trainable_vars(regularizer=tf.contrib.layers.l2_regularizer(weight_decay)):
- top_layer = model_func(
- inputs,
- data_format=model_format,
- training=is_training,
- conv_initializer=conv_init,
- adv_bn_init=adv_bn_init,
- )
- logits = tf.layers.dense(
- top_layer, num_classes, kernel_initializer=tf.random_normal_initializer(stddev=0.01)
- )
- predicted_classes = tf.argmax(logits, axis=1, output_type=tf.int32)
- logits = tf.cast(logits, tf.float32)
- if mode == tf.estimator.ModeKeys.PREDICT:
- probabilities = tf.softmax(logits)
- predictions = {
- "class_ids": predicted_classes[:, None],
- "probabilities": probabilities,
- "logits": logits,
- }
- return tf.estimator.EstimatorSpec(mode, predictions=predictions)
- loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
- loss = tf.identity(loss, name="loss") # For access by logger
-
- if mode == tf.estimator.ModeKeys.EVAL:
- # Allow fallback to CPU if no GPU support for these ops
- with tf.device(None):
- accuracy = tf.metrics.accuracy(labels=labels, predictions=predicted_classes)
- top5acc = tf.metrics.mean(tf.cast(tf.nn.in_top_k(logits, labels, 5), tf.float32))
- newaccuracy = (hvd.allreduce(accuracy[0]), accuracy[1])
- newtop5acc = (hvd.allreduce(top5acc[0]), top5acc[1])
- metrics = {"val-top1acc": newaccuracy, "val-top5acc": newtop5acc}
- return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)
-
- assert mode == tf.estimator.ModeKeys.TRAIN
- reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
- total_loss = tf.add_n([loss] + reg_losses, name="total_loss")
-
- batch_size = tf.shape(inputs)[0]
-
- global_step = tf.train.get_global_step()
-
- # Allow fallback to CPU if no GPU support for these ops
- with tf.device("/cpu:0"):
- learning_rate = tf.cond(
- global_step < warmup_it,
- lambda: warmup_decay(warmup_lr, global_step, warmup_it, lr),
- lambda: get_lr(
- lr,
- steps,
- lr_steps,
- warmup_it,
- decay_steps,
- global_step,
- lr_decay_mode,
- cdr_first_decay_ratio,
- cdr_t_mul,
- cdr_m_mul,
- cdr_alpha,
- lc_periods,
- lc_alpha,
- lc_beta,
- ),
- )
- learning_rate = tf.identity(learning_rate, "learning_rate")
- tf.summary.scalar("learning_rate", learning_rate)
-
- opt = tf.train.MomentumOptimizer(learning_rate, momentum, use_nesterov=True)
- opt = hvd.DistributedOptimizer(opt)
- if use_larc:
- opt = LarcOptimizer(opt, learning_rate, leta, clip=True)
-
- opt = MixedPrecisionOptimizer(opt, scale=loss_scale)
- opt = smd.get_hook().wrap_optimizer(opt)
-
- update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) or []
- with tf.control_dependencies(update_ops):
- gate_gradients = tf.train.Optimizer.GATE_NONE
- train_op = opt.minimize(
- total_loss, global_step=tf.train.get_global_step(), gate_gradients=gate_gradients
- )
- train_op = tf.group(preload_op, gpucopy_op, train_op)
-
- return tf.estimator.EstimatorSpec(mode, loss=total_loss, train_op=train_op)
-
-
-def get_num_records(filenames):
- def count_records(tf_record_filename):
- count = 0
- for _ in tf.python_io.tf_record_iterator(tf_record_filename):
- count += 1
- return count
-
- nfile = len(filenames)
- return count_records(filenames[0]) * (nfile - 1) + count_records(filenames[-1])
-
-
-def str2bool(v):
- if isinstance(v, bool):
- return v
-
- if v.lower() in ("yes", "true", "t", "y", "1"):
- return True
- elif v.lower() in ("no", "false", "f", "n", "0"):
- return False
- else:
- raise argparse.ArgumentTypeError("Boolean value expected.")
-
-
-def sort_and_load_ckpts(log_dir):
- ckpts = []
- for f in os.listdir(log_dir):
- m = re.match(r"model.ckpt-([0-9]+).index", f)
- if m is None:
- continue
- fullpath = os.path.join(log_dir, f)
- ckpts.append(
- {
- "step": int(m.group(1)),
- "path": os.path.splitext(fullpath)[0],
- "mtime": os.stat(fullpath).st_mtime,
- }
- )
- ckpts.sort(key=itemgetter("step"))
- return ckpts
-
-
-def add_cli_args():
- cmdline = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
- # Basic options
- cmdline.add_argument(
- "-m",
- "--model",
- default="resnet50",
- help="""Name of model to run:
- resnet[18,34,50,101,152]""",
- )
- cmdline.add_argument(
- "--data_dir",
- help="""Path to dataset in TFRecord format
- (aka Example protobufs). Files should be
- named 'train-*' and 'validation-*'.""",
- )
- cmdline.add_argument(
- "--synthetic",
- type=str2bool,
- default=False,
- help="""Whether to use synthetic data for training.
- If data_dir is not given, uses synthetic data
- by default""",
- )
- cmdline.add_argument(
- "-b", "--batch_size", default=128, type=int, help="""Size of each minibatch per GPU"""
- )
- cmdline.add_argument(
- "--num_batches",
- type=int,
- default=1000,
- help="""Number of batches to run.
- Ignored during eval or if num epochs given""",
- )
- cmdline.add_argument(
- "--num_epochs",
- type=int,
- help="""Number of epochs to run.
- Overrides --num_batches. Ignored during eval.""",
- )
- cmdline.add_argument(
- "--log_dir",
- default="tf_logs",
- help="""Directory in which to write training
- summaries and checkpoints. If the log directory
- already contains some checkpoints, it tries
- to resume training from the last saved checkpoint.
- Pass --clear_log if you want to clear all
- checkpoints and start a fresh run""",
- )
- cmdline.add_argument("--model_dir", default=None, type=str)
- cmdline.add_argument("--random_seed", type=str2bool, default=False)
- cmdline.add_argument(
- "--clear_log",
- type=str2bool,
- default=False,
- help="""Clear the log folder passed
- so a fresh run can be started""",
- )
- cmdline.add_argument("--log_name", type=str, default="hvd_train.log")
- cmdline.add_argument(
- "--local_ckpt",
- type=str2bool,
- default=False,
- help="""Performs local checkpoints
- (i.e. one per node)""",
- )
- cmdline.add_argument(
- "--display_every",
- default=20,
- type=int,
- help="""How often (in iterations) to print out
- running information.""",
- )
- cmdline.add_argument(
- "--eval",
- type=str2bool,
- default=False,
- help="""Evaluate the top-1 and top-5 accuracy of
- the latest checkpointed model. If you want to
- evaluate using multiple GPUs ensure that all
- processes have access to all checkpoints.
- Either if checkpoints were saved using
- --local_ckpt or they were saved to a
- shared directory which all processes can access.""",
- )
- cmdline.add_argument(
- "--eval_interval",
- type=int,
- help="""Evaluate accuracy per eval_interval
- number of epochs""",
- )
- cmdline.add_argument(
- "--fp16",
- type=str2bool,
- default=True,
- help="""Train using float16 (half) precision instead
- of float32.""",
- )
- cmdline.add_argument(
- "--num_gpus",
- default=1,
- type=int,
- help="""Specify total number of GPUS used to
- train a checkpointed model during eval.
- Used only to calculate epoch number to
- print during evaluation""",
- )
- cmdline.add_argument("--save_checkpoints_steps", type=int, default=1000)
- cmdline.add_argument("--save_summary_steps", type=int, default=0)
- cmdline.add_argument(
- "--adv_bn_init",
- type=str2bool,
- default=True,
- help="""init gamme of the last BN of
- each ResMod at 0.""",
- )
- cmdline.add_argument(
- "--adv_conv_init", type=str2bool, default=True, help="""init conv with MSRA initializer"""
- )
- cmdline.add_argument("--lr", type=float, help="""Start learning rate""")
- cmdline.add_argument("--mom", default=0.90, type=float, help="""Momentum""")
- cmdline.add_argument("--wdecay", default=0.0001, type=float, help="""Weight decay""")
- cmdline.add_argument("--loss_scale", default=1024.0, type=float, help="""loss scale""")
- cmdline.add_argument(
- "--warmup_lr", default=0.001, type=float, help="""Warmup starting from this learning rate"""
- )
- cmdline.add_argument(
- "--warmup_epochs",
- default=0,
- type=int,
- help="""Number of epochs in which to warmup
- to given lr""",
- )
- cmdline.add_argument(
- "--lr_decay_steps",
- default="30,60,80",
- type=str,
- help="""epoch numbers at which lr is decayed
- by lr_decay_lrs. Used when lr_decay_mode is steps""",
- )
- cmdline.add_argument(
- "--lr_decay_lrs", default="", type=str, help="""learning rates at specific epochs"""
- )
- cmdline.add_argument(
- "--lr_decay_mode",
- default="poly",
- help="""Takes either `steps`
- (decay by a factor at specified steps)
- or `poly`(polynomial_decay with degree 2)""",
- )
-
- cmdline.add_argument(
- "--use_larc",
- type=str2bool,
- default=False,
- help="""Use Layer wise Adaptive Rate Control
- which helps convergence at really
- large batch sizes""",
- )
- cmdline.add_argument(
- "--leta",
- default=0.013,
- type=float,
- help="""The trust coefficient for LARC optimization,
- LARC Eta""",
- )
- cmdline.add_argument(
- "--cdr_first_decay_ratio",
- default=0.33,
- type=float,
- help="""Cosine Decay Restart First
- Decay Steps ratio""",
- )
- cmdline.add_argument(
- "--cdr_t_mul", default=2.0, type=float, help="""Cosine Decay Restart t_mul"""
- )
- cmdline.add_argument(
- "--cdr_m_mul", default=0.1, type=float, help="""Cosine Decay Restart m_mul"""
- )
- cmdline.add_argument(
- "--cdr_alpha", default=0.0, type=float, help="""Cosine Decay Restart alpha"""
- )
- cmdline.add_argument(
- "--lc_periods", default=0.47, type=float, help="""Linear Cosine num of periods"""
- )
- cmdline.add_argument("--lc_alpha", default=0.0, type=float, help="""linear Cosine alpha""")
- cmdline.add_argument("--lc_beta", default=0.00001, type=float, help="""Liner Cosine Beta""")
-
- cmdline.add_argument(
- "--increased_aug",
- type=str2bool,
- default=False,
- help="""Increase augmentations helpful when training
- with large number of GPUs such as 128 or 256""",
- )
- cmdline.add_argument("--contrast", default=0.6, type=float, help="""contrast factor""")
- cmdline.add_argument("--saturation", default=0.6, type=float, help="""saturation factor""")
- cmdline.add_argument(
- "--hue",
- default=0.13,
- type=float,
- help="""hue max delta factor,
- hue delta = hue * math.pi""",
- )
- cmdline.add_argument("--brightness", default=0.3, type=float, help="""Brightness factor""")
-
- # tornasole arguments
- cmdline.add_argument(
- "--enable_smdebug", type=str2bool, default=False, help="""enable Tornasole"""
- )
- cmdline.add_argument(
- "--smdebug_path",
- default="tornasole_outputs/default_run",
- help="""Directory in which to write tornasole data.
- This can be a local path or
- S3 path in the form s3://bucket_name/prefix_name""",
- )
- cmdline.add_argument(
- "--tornasole_save_all", type=str2bool, default=False, help="""save all tensors"""
- )
- cmdline.add_argument(
- "--tornasole_dryrun",
- type=str2bool,
- default=False,
- help="""If enabled, do not write data to disk""",
- )
- cmdline.add_argument(
- "--tornasole_exclude",
- nargs="+",
- default=[],
- type=str,
- action="append",
- help="""List of REs for tensors to exclude from
- Tornasole's default collection""",
- )
- cmdline.add_argument(
- "--tornasole_include",
- nargs="+",
- default=[],
- type=str,
- action="append",
- help="""List of REs for tensors to include from
- Tornasole's default collection""",
- )
- cmdline.add_argument(
- "--step_interval", default=10, type=int, help="""Save tornasole data every N runs"""
- )
- cmdline.add_argument("--save_weights", type=str2bool, default=False)
- cmdline.add_argument("--save_gradients", type=str2bool, default=False)
- cmdline.add_argument("--tornasole_save_inputs", type=str2bool, default=False)
- cmdline.add_argument("--tornasole_save_relu_activations", type=str2bool, default=False)
- cmdline.add_argument(
- "--tornasole_relu_reductions",
- type=str,
- help="""A comma separated list of reductions can be
- passed. If passed, saves relu activations
- in the form of these reductions.""",
- )
- cmdline.add_argument(
- "--tornasole_relu_reductions_abs",
- type=str,
- help="""A comma separated list of absolute reductions
- can be passed. If passed, saves relu activations
- in the form of these reductions on absolute values
- of the tensor.""",
- )
- cmdline.add_argument(
- "--constant_initializer",
- type=float,
- help="""if passed sets that constant as initial
- weight, if not uses default initialization
- strategies""",
- )
- return cmdline
-
-
-def create_hook(FLAGS):
- abs_reductions = []
- reductions = []
- if FLAGS.tornasole_relu_reductions:
- for r in FLAGS.tornasole_relu_reductions:
- reductions.append(r)
- if FLAGS.tornasole_relu_reductions_abs:
- for r in FLAGS.tornasole_relu_reductions_abs:
- abs_reductions.append(r)
- if reductions or abs_reductions:
- rnc = smd.ReductionConfig(reductions=reductions, abs_reductions=abs_reductions)
- else:
- rnc = None
-
- include_collections = ["losses"]
-
- hook = smd.SessionHook(
- out_dir=FLAGS.smdebug_path,
- save_config=smd.SaveConfig(save_interval=FLAGS.step_interval),
- reduction_config=rnc,
- include_collections=include_collections,
- save_all=FLAGS.tornasole_save_all,
- )
- if FLAGS.save_weights is True:
- include_collections.append("weights")
- if FLAGS.save_gradients is True:
- include_collections.append("gradients")
- if FLAGS.tornasole_save_relu_activations is True:
- include_collections.append("relu_activations")
- if FLAGS.tornasole_save_inputs is True:
- include_collections.append("inputs")
- if FLAGS.tornasole_include:
- hook.get_collection("default").include(FLAGS.tornasole_include)
- include_collections.append("default")
- return hook
-
-
-def main():
- gpu_thread_count = 2
- os.environ["TF_GPU_THREAD_MODE"] = "gpu_private"
- os.environ["TF_GPU_THREAD_COUNT"] = str(gpu_thread_count)
- os.environ["TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT"] = "1"
- os.environ["TF_ENABLE_WINOGRAD_NONFUSED"] = "1"
- hvd.init()
-
- config = tf.ConfigProto()
- config.gpu_options.visible_device_list = str(hvd.local_rank())
- config.gpu_options.force_gpu_compatible = True # Force pinned memory
- config.intra_op_parallelism_threads = 1 # Avoid pool of Eigen threads
- config.inter_op_parallelism_threads = 5
-
- cmdline = add_cli_args()
- FLAGS, unknown_args = cmdline.parse_known_args()
-
- # these random seeds are only intended for test purpose.
- # if you wish to change the seed settings, notice that certain
- # steps' tensor value may be capable of variation
- if FLAGS.random_seed:
- random.seed(5 * (1 + hvd.rank()))
- np.random.seed(7 * (1 + hvd.rank()))
- tf.set_random_seed(31 * (1 + hvd.rank()))
-
- if len(unknown_args) > 0:
- for bad_arg in unknown_args:
- print("ERROR: Unknown command line arg: %s" % bad_arg)
- raise ValueError("Invalid command line arg(s)")
-
- FLAGS.data_dir = None if FLAGS.data_dir == "" else FLAGS.data_dir
- FLAGS.log_dir = None if FLAGS.log_dir == "" else FLAGS.log_dir
- FLAGS.model_dir = None if FLAGS.model_dir == "" else FLAGS.model_dir
-
- if FLAGS.eval:
- FLAGS.log_name = "eval_" + FLAGS.log_name
- if FLAGS.local_ckpt:
- do_checkpoint = hvd.local_rank() == 0
- else:
- do_checkpoint = hvd.rank() == 0
- if hvd.local_rank() == 0 and FLAGS.clear_log and os.path.isdir(FLAGS.log_dir):
- shutil.rmtree(FLAGS.log_dir)
- barrier = hvd.allreduce(tf.constant(0, dtype=tf.float32))
- tf.Session(config=config).run(barrier)
-
- if hvd.local_rank() == 0 and not os.path.isdir(FLAGS.log_dir):
- os.makedirs(FLAGS.log_dir)
- barrier = hvd.allreduce(tf.constant(0, dtype=tf.float32))
- tf.Session(config=config).run(barrier)
-
- logger = logging.getLogger(FLAGS.log_name)
- logger.setLevel(logging.INFO) # INFO, ERROR
- if not hvd.rank():
- fh = logging.FileHandler(os.path.join(FLAGS.log_dir, FLAGS.log_name))
- fh.setLevel(logging.DEBUG)
- formatter = logging.Formatter("%(message)s")
- fh.setFormatter(formatter)
- # add handlers to logger
- logger.addHandler(fh)
-
- height, width = 224, 224
- global_batch_size = FLAGS.batch_size * hvd.size()
- rank0log(logger, "PY" + str(sys.version) + "TF" + str(tf.__version__))
- rank0log(logger, "Horovod size: ", hvd.size())
-
- if FLAGS.data_dir:
- filename_pattern = os.path.join(FLAGS.data_dir, "%s-*")
- train_filenames = sorted(tf.gfile.Glob(filename_pattern % "train"))
- eval_filenames = sorted(tf.gfile.Glob(filename_pattern % "validation"))
- num_training_samples = get_num_records(train_filenames)
- rank0log(logger, "Using data from: ", FLAGS.data_dir)
- if not FLAGS.eval:
- rank0log(logger, "Found ", num_training_samples, " training samples")
- else:
- if not FLAGS.synthetic:
- FLAGS.synthetic = True
- rank0log(
- logger,
- "data_dir missing. Using synthetic data. "
- "If you want to run on real data"
- "pass --data_dir PATH_TO_DATA",
- )
- train_filenames = eval_filenames = []
- num_training_samples = 1281167
- training_samples_per_rank = num_training_samples // hvd.size()
-
- if FLAGS.num_epochs:
- nstep = num_training_samples * FLAGS.num_epochs // global_batch_size
- elif FLAGS.num_batches:
- nstep = FLAGS.num_batches
- FLAGS.num_epochs = max(nstep * global_batch_size // num_training_samples, 1)
- else:
- raise ValueError("Either num_epochs or num_batches has to be passed")
- nstep_per_epoch = num_training_samples // global_batch_size
- decay_steps = nstep
-
- if FLAGS.lr_decay_mode == "steps":
- steps = [int(x) * nstep_per_epoch for x in FLAGS.lr_decay_steps.split(",")]
- lr_steps = [float(x) for x in FLAGS.lr_decay_lrs.split(",")]
- else:
- steps = []
- lr_steps = []
-
- if not FLAGS.lr:
- if FLAGS.use_larc:
- FLAGS.lr = 3.7
- else:
- FLAGS.lr = (hvd.size() * FLAGS.batch_size * 0.1) / 256
- if not FLAGS.save_checkpoints_steps:
- # default to save one checkpoint per epoch
- FLAGS.save_checkpoints_steps = nstep_per_epoch
- if not FLAGS.save_summary_steps:
- # default to save one checkpoint per epoch
- FLAGS.save_summary_steps = nstep_per_epoch
-
- if not FLAGS.eval:
- rank0log(logger, "Using a learning rate of ", FLAGS.lr)
- rank0log(logger, "Checkpointing every " + str(FLAGS.save_checkpoints_steps) + " steps")
- rank0log(logger, "Saving summary every " + str(FLAGS.save_summary_steps) + " steps")
-
- warmup_it = nstep_per_epoch * FLAGS.warmup_epochs
- if FLAGS.constant_initializer:
- initializer_conv = tf.constant_initializer(FLAGS.constant_initializer)
- elif FLAGS.adv_conv_init:
- initializer_conv = tf.variance_scaling_initializer()
- else:
- initializer_conv = None
-
- if FLAGS.model_dir is None:
- FLAGS.model_dir = FLAGS.log_dir
-
- if FLAGS.enable_smdebug is True and hvd.rank() == 0:
- hook = create_hook(FLAGS)
-
- classifier = tf.estimator.Estimator(
- model_fn=cnn_model_function,
- model_dir=FLAGS.model_dir,
- params={
- "model": FLAGS.model,
- "decay_steps": decay_steps,
- "n_classes": 1000,
- "dtype": tf.float16 if FLAGS.fp16 is True else tf.float32,
- "format": "channels_first",
- "device": "/gpu:0",
- "lr": FLAGS.lr,
- "mom": FLAGS.mom,
- "wdecay": FLAGS.wdecay,
- "use_larc": FLAGS.use_larc,
- "leta": FLAGS.leta,
- "steps": steps,
- "lr_steps": lr_steps,
- "lr_decay_mode": FLAGS.lr_decay_mode,
- "warmup_it": warmup_it,
- "warmup_lr": FLAGS.warmup_lr,
- "cdr_first_decay_ratio": FLAGS.cdr_first_decay_ratio,
- "cdr_t_mul": FLAGS.cdr_t_mul,
- "cdr_m_mul": FLAGS.cdr_m_mul,
- "cdr_alpha": FLAGS.cdr_alpha,
- "lc_periods": FLAGS.lc_periods,
- "lc_alpha": FLAGS.lc_alpha,
- "lc_beta": FLAGS.lc_beta,
- "loss_scale": FLAGS.loss_scale,
- "adv_bn_init": FLAGS.adv_bn_init,
- "conv_init": initializer_conv,
- },
- config=tf.estimator.RunConfig(
- tf_random_seed=31 * (1 + hvd.rank()),
- session_config=config,
- save_summary_steps=FLAGS.save_summary_steps if do_checkpoint else None,
- save_checkpoints_steps=FLAGS.save_checkpoints_steps if do_checkpoint else None,
- keep_checkpoint_max=None,
- ),
- )
-
- if not FLAGS.eval:
- num_preproc_threads = 5
- rank0log(logger, "Using preprocessing threads per GPU: ", num_preproc_threads)
- training_hooks = [hvd.BroadcastGlobalVariablesHook(0), PrefillStagingAreasHook()]
- if hvd.rank() == 0:
- training_hooks.append(
- LogSessionRunHook(
- global_batch_size, num_training_samples, FLAGS.display_every, logger
- )
- )
- if FLAGS.enable_smdebug is True:
- training_hooks.append(hook)
- try:
- if FLAGS.enable_smdebug is True:
- hook.set_mode(smd.modes.TRAIN)
- start_time = time.time()
- classifier.train(
- input_fn=lambda: make_dataset(
- train_filenames,
- training_samples_per_rank,
- FLAGS.batch_size,
- height,
- width,
- FLAGS.brightness,
- FLAGS.contrast,
- FLAGS.saturation,
- FLAGS.hue,
- training=True,
- num_threads=num_preproc_threads,
- shard=True,
- synthetic=FLAGS.synthetic,
- increased_aug=FLAGS.increased_aug,
- ),
- max_steps=nstep,
- hooks=training_hooks,
- )
- rank0log(logger, "Finished in ", time.time() - start_time)
- except KeyboardInterrupt:
- print("Keyboard interrupt")
- elif FLAGS.eval and not FLAGS.synthetic:
- rank0log(logger, "Evaluating")
- rank0log(logger, "Validation dataset size: {}".format(get_num_records(eval_filenames)))
- barrier = hvd.allreduce(tf.constant(0, dtype=tf.float32))
- tf.Session(config=config).run(barrier)
- time.sleep(5) # a little extra margin...
- if FLAGS.num_gpus == 1:
- rank0log(
- logger,
- """If you are evaluating checkpoints of a
- multi-GPU run on a single GPU, ensure you set --num_gpus to
- the number of GPUs it was trained on.
- This will ensure that the epoch number is
- accurately displayed in the below logs.""",
- )
- try:
- ckpts = sort_and_load_ckpts(FLAGS.log_dir)
- for i, c in enumerate(ckpts):
- if i < len(ckpts) - 1:
- if (not FLAGS.eval_interval) or (i % FLAGS.eval_interval != 0):
- continue
- eval_result = classifier.evaluate(
- input_fn=lambda: make_dataset(
- eval_filenames,
- get_num_records(eval_filenames),
- FLAGS.batch_size,
- height,
- width,
- FLAGS.brightness,
- FLAGS.contrast,
- FLAGS.saturation,
- FLAGS.hue,
- training=False,
- shard=True,
- increased_aug=False,
- ),
- checkpoint_path=c["path"],
- )
- c["epoch"] = c["step"] / (
- num_training_samples // (FLAGS.batch_size * FLAGS.num_gpus)
- )
- c["top1"] = eval_result["val-top1acc"]
- c["top5"] = eval_result["val-top5acc"]
- c["loss"] = eval_result["loss"]
- rank0log(logger, " step epoch top1 top5 loss checkpoint_time(UTC)")
- barrier = hvd.allreduce(tf.constant(0, dtype=tf.float32))
- for i, c in enumerate(ckpts):
- tf.Session(config=config).run(barrier)
- if "top1" not in c:
- continue
- rank0log(
- logger,
- "{:5d} {:5.1f} {:5.3f} {:6.2f} {:6.2f} {time}".format(
- c["step"],
- c["epoch"],
- c["top1"] * 100,
- c["top5"] * 100,
- c["loss"],
- time=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(c["mtime"])),
- ),
- )
- rank0log(logger, "Finished evaluation")
- except KeyboardInterrupt:
- logger.error("Keyboard interrupt")
-
-
-if __name__ == "__main__":
- main()
diff --git a/examples/xgboost/README.md b/examples/xgboost/README.md
new file mode 100644
index 000000000..9818d371a
--- /dev/null
+++ b/examples/xgboost/README.md
@@ -0,0 +1,2 @@
+## Example Notebooks
+Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger)
diff --git a/examples/xgboost/notebooks/xgboost_abalone.ipynb b/examples/xgboost/notebooks/xgboost_abalone.ipynb
deleted file mode 100644
index 312cc5f42..000000000
--- a/examples/xgboost/notebooks/xgboost_abalone.ipynb
+++ /dev/null
@@ -1,813 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# XGBoost for Regression in Tornasole\n",
- "This notebook will demonstrate the simplest kind of interactive analysis that can be run in smdebug. It will focus on the predicting the age of abalone ([Abalone dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html)) using [XGBoost](https://github.com/dmlc/xgboost) for regression.\n",
- "\n",
- "## Setup\n",
- "\n",
- "Some basic setup that's always helpful"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "%load_ext autoreload\n",
- "%autoreload 2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Make sure that you can run `xgboost`. You can install `xgboost` by doing\n",
- "```shell\n",
- "$ pip3 install xgboost\n",
- "```\n",
- "You'll probably have to restart this notebook after doing this.\n",
- "\n",
- "Let's import some basic libraries for ML"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import xgboost as xgb\n",
- "import matplotlib.pyplot as plt\n",
- "import seaborn as sns"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's copy the Tornasole libraries to this instance, this step has to be executed only once. \n",
- "Please make sure that the AWS account you are using can access the `tornasole-external-preview-use1` bucket.\n",
- "\n",
- "To do so you'll need the appropriate AWS credentials. There are several ways of doing this:\n",
- "- inject temporary credentials \n",
- "- if running on EC2, use [EC2 roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html) that can access all S3 buckets\n",
- "- (preferred) run this notebook on a [SageMaker notebook instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html)\n",
- "\n",
- "The code below downloads the necessary `.whl` files and installs them in the current environment. Only run the first time!\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "#WARNING - uncomment this code only if you haven't done this before\n",
- "#!aws s3 sync s3://tornasole-external-preview-use1/sdk/ts-binaries/tornasole_xgboost/py3/latest/ tornasole_xgboost/\n",
- "#!pip install tornasole_xgboost/tornasole-*\n",
- "\n",
- "# If you run into a version conflict with boto, run the following\n",
- "#!pip uninstall -y botocore boto3 aioboto3 aiobotocore && pip install botocore==1.12.91 boto3==1.9.91 aiobotocore==0.10.2 aioboto3==6.4.1"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Fetching the dataset\n",
- "\n",
- "We use the [Abalone data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html) originally from the [UCI data repository](https://archive.ics.uci.edu/ml/datasets/abalone). More details about the original dataset can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names). In the libsvm converted [version](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html), the nominal feature (Male/Female/Infant) has been converted into a real valued feature. Age of abalone is to be predicted from eight physical measurements."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "CPU times: user 28.5 ms, sys: 9.39 ms, total: 37.9 ms\n",
- "Wall time: 6.01 s\n"
- ]
- }
- ],
- "source": [
- "%%time\n",
- "import random\n",
- "import tempfile\n",
- "import urllib.request\n",
- "\n",
- "\n",
- "def load_abalone(train_split=0.8, seed=42):\n",
- "\n",
- " if not (0 < train_split <= 1):\n",
- " raise ValueError(\"'train_split' must be between 0 and 1.\")\n",
- "\n",
- " url = \"https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/abalone\"\n",
- "\n",
- " response = urllib.request.urlopen(url).read().decode(\"utf-8\")\n",
- " lines = response.strip().split('\\n')\n",
- " n = sum(1 for line in lines)\n",
- " indices = list(range(n))\n",
- " random.seed(seed)\n",
- " random.shuffle(indices)\n",
- " train_indices = set(indices[:int(n * 0.8)])\n",
- "\n",
- " with tempfile.NamedTemporaryFile(mode='w', delete=False) as train_file:\n",
- " with tempfile.NamedTemporaryFile(mode='w', delete=False) as valid_file:\n",
- " for idx, line in enumerate(lines):\n",
- " if idx in train_indices:\n",
- " train_file.write(line + '\\n')\n",
- " else:\n",
- " valid_file.write(line + '\\n')\n",
- "\n",
- " return train_file.name, valid_file.name\n",
- "\n",
- "train_file, validation_file = load_abalone()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "## Model Training\n",
- "\n",
- "At this point we have all the ingredients installed on our machine. We can now start training."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [],
- "source": [
- "from smdebug import SaveConfig\n",
- "from smdebug.xgboost import SessionHook\n",
- "from smdebug.trials import LocalTrial"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can change the logging level if appropriate "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "#import logging\n",
- "#logging.getLogger(\"tornasole\").setLevel(logging.WARNING)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Clean up from previous runs, we remove old data (warning - we assume that we have set `ts_output` as the directory into which we send data)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "!rm -rf ./ts_output/"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We load the datasets into [DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.DMatrix) objects and define some hyperparameters - it doesn't really matter what it is."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[20:28:55] 3341x9 matrix with 26728 entries loaded from /var/folders/8f/_jkz_jm54y56k334j6vkhlr51s_6bx/T/tmps8ervlbn\n",
- "[20:28:55] 836x9 matrix with 6688 entries loaded from /var/folders/8f/_jkz_jm54y56k334j6vkhlr51s_6bx/T/tmprvihe8rx\n"
- ]
- }
- ],
- "source": [
- "dtrain = xgb.DMatrix(train_file)\n",
- "dval = xgb.DMatrix(validation_file)\n",
- "\n",
- "watchlist = [(dtrain, 'train'), (dval, 'validation')]\n",
- "\n",
- "params = {\n",
- " \"max_depth\": 5,\n",
- " \"eta\": 0.2,\n",
- " \"gamma\": 4,\n",
- " \"min_child_weight\": 6,\n",
- " \"subsample\": 0.7,\n",
- " \"silent\": 0,\n",
- " \"objective\": \"reg:squarederror\",\n",
- " \"eval_metric\": [\"rmse\", \"mae\"]\n",
- "}\n",
- "\n",
- "num_round = 100"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Importantly - we **add the Tornasole Hook**. This hook will be run at every iteration and will save selected performance metrics, feature importances, or [SHAP](https://github.com/slundberg/shap) values (in this case, all of them) to the desired directory (in this case, `'{base_loc}/{run_id}'`.\n",
- "\n",
- "`{base_loc}` can be either a path on a local file system (for instance, `./ts_output/`) or an S3 bucket/object (`s3://mybucket/myprefix/`).\n",
- "\n",
- "See the documentation for more details."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[2019-09-04 20:28:55.303 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:86] Saving to ./ts_output\n"
- ]
- }
- ],
- "source": [
- "save_config = SaveConfig(save_interval=5)\n",
- "\n",
- "hook = SessionHook(\n",
- " out_dir=\"./ts_output\",\n",
- " save_config=save_config,\n",
- " shap_data=dtrain\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "At this point we are ready to train. We will train this simple model.\n",
- "\n",
- "Behind the scenes, the `SessionHook` is saving the data requested."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[2019-09-04 20:28:55.359 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 0.\n",
- "[0]\ttrain-rmse:8.12809\ttrain-mae:7.57979\tvalidation-rmse:7.9501\tvalidation-mae:7.42376\n",
- "[1]\ttrain-rmse:6.66075\ttrain-mae:6.0853\tvalidation-rmse:6.50774\tvalidation-mae:5.95214\n",
- "[2]\ttrain-rmse:5.49178\ttrain-mae:4.87746\tvalidation-rmse:5.35986\tvalidation-mae:4.76687\n",
- "[3]\ttrain-rmse:4.58214\ttrain-mae:3.91531\tvalidation-rmse:4.47214\tvalidation-mae:3.81942\n",
- "[4]\ttrain-rmse:3.87782\ttrain-mae:3.14893\tvalidation-rmse:3.79344\tvalidation-mae:3.07788\n",
- "[2019-09-04 20:28:55.438 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 5.\n",
- "[5]\ttrain-rmse:3.35051\ttrain-mae:2.57049\tvalidation-rmse:3.28752\tvalidation-mae:2.51613\n",
- "[6]\ttrain-rmse:2.95214\ttrain-mae:2.14503\tvalidation-rmse:2.91198\tvalidation-mae:2.11696\n",
- "[7]\ttrain-rmse:2.66585\ttrain-mae:1.85231\tvalidation-rmse:2.63266\tvalidation-mae:1.83957\n",
- "[8]\ttrain-rmse:2.45031\ttrain-mae:1.65847\tvalidation-rmse:2.45043\tvalidation-mae:1.6836\n",
- "[9]\ttrain-rmse:2.29575\ttrain-mae:1.53516\tvalidation-rmse:2.30491\tvalidation-mae:1.57687\n",
- "[2019-09-04 20:28:55.698 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 10.\n",
- "[10]\ttrain-rmse:2.18057\ttrain-mae:1.4608\tvalidation-rmse:2.21775\tvalidation-mae:1.51841\n",
- "[11]\ttrain-rmse:2.09701\ttrain-mae:1.41371\tvalidation-rmse:2.16059\tvalidation-mae:1.4858\n",
- "[12]\ttrain-rmse:2.04006\ttrain-mae:1.38316\tvalidation-rmse:2.12174\tvalidation-mae:1.46568\n",
- "[13]\ttrain-rmse:1.99211\ttrain-mae:1.35925\tvalidation-rmse:2.08493\tvalidation-mae:1.45008\n",
- "[14]\ttrain-rmse:1.96792\ttrain-mae:1.35184\tvalidation-rmse:2.06902\tvalidation-mae:1.44609\n",
- "[2019-09-04 20:28:55.906 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 15.\n",
- "[15]\ttrain-rmse:1.94583\ttrain-mae:1.34568\tvalidation-rmse:2.05138\tvalidation-mae:1.44183\n",
- "[16]\ttrain-rmse:1.92788\ttrain-mae:1.34136\tvalidation-rmse:2.04385\tvalidation-mae:1.43984\n",
- "[17]\ttrain-rmse:1.91056\ttrain-mae:1.33725\tvalidation-rmse:2.03618\tvalidation-mae:1.43834\n",
- "[18]\ttrain-rmse:1.89749\ttrain-mae:1.33499\tvalidation-rmse:2.0322\tvalidation-mae:1.4393\n",
- "[19]\ttrain-rmse:1.88413\ttrain-mae:1.33075\tvalidation-rmse:2.03752\tvalidation-mae:1.44543\n",
- "[2019-09-04 20:28:56.163 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 20.\n",
- "[20]\ttrain-rmse:1.87967\ttrain-mae:1.32948\tvalidation-rmse:2.03493\tvalidation-mae:1.44581\n",
- "[21]\ttrain-rmse:1.86798\ttrain-mae:1.32499\tvalidation-rmse:2.03925\tvalidation-mae:1.44907\n",
- "[22]\ttrain-rmse:1.8582\ttrain-mae:1.32131\tvalidation-rmse:2.03707\tvalidation-mae:1.44972\n",
- "[23]\ttrain-rmse:1.85303\ttrain-mae:1.31908\tvalidation-rmse:2.03875\tvalidation-mae:1.45157\n",
- "[24]\ttrain-rmse:1.84696\ttrain-mae:1.31813\tvalidation-rmse:2.04676\tvalidation-mae:1.45646\n",
- "[2019-09-04 20:28:56.518 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 25.\n",
- "[25]\ttrain-rmse:1.83281\ttrain-mae:1.31091\tvalidation-rmse:2.04628\tvalidation-mae:1.45422\n",
- "[26]\ttrain-rmse:1.821\ttrain-mae:1.30427\tvalidation-rmse:2.04828\tvalidation-mae:1.45734\n",
- "[27]\ttrain-rmse:1.81488\ttrain-mae:1.30071\tvalidation-rmse:2.04599\tvalidation-mae:1.45618\n",
- "[28]\ttrain-rmse:1.8002\ttrain-mae:1.29268\tvalidation-rmse:2.05193\tvalidation-mae:1.4594\n",
- "[29]\ttrain-rmse:1.79308\ttrain-mae:1.29057\tvalidation-rmse:2.05075\tvalidation-mae:1.46388\n",
- "[2019-09-04 20:28:56.806 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 30.\n",
- "[30]\ttrain-rmse:1.78557\ttrain-mae:1.28705\tvalidation-rmse:2.05292\tvalidation-mae:1.46461\n",
- "[31]\ttrain-rmse:1.77816\ttrain-mae:1.28089\tvalidation-rmse:2.05105\tvalidation-mae:1.46399\n",
- "[32]\ttrain-rmse:1.76314\ttrain-mae:1.27199\tvalidation-rmse:2.05361\tvalidation-mae:1.46855\n",
- "[33]\ttrain-rmse:1.75584\ttrain-mae:1.2695\tvalidation-rmse:2.04789\tvalidation-mae:1.4681\n",
- "[34]\ttrain-rmse:1.75165\ttrain-mae:1.26756\tvalidation-rmse:2.04984\tvalidation-mae:1.47014\n",
- "[2019-09-04 20:28:57.182 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 35.\n",
- "[35]\ttrain-rmse:1.74615\ttrain-mae:1.26296\tvalidation-rmse:2.0479\tvalidation-mae:1.4715\n",
- "[36]\ttrain-rmse:1.73885\ttrain-mae:1.25856\tvalidation-rmse:2.04948\tvalidation-mae:1.4712\n",
- "[37]\ttrain-rmse:1.736\ttrain-mae:1.257\tvalidation-rmse:2.04992\tvalidation-mae:1.47051\n",
- "[38]\ttrain-rmse:1.73002\ttrain-mae:1.25165\tvalidation-rmse:2.05359\tvalidation-mae:1.47123\n",
- "[39]\ttrain-rmse:1.72742\ttrain-mae:1.25118\tvalidation-rmse:2.05419\tvalidation-mae:1.47141\n",
- "[2019-09-04 20:28:57.649 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 40.\n",
- "[40]\ttrain-rmse:1.72441\ttrain-mae:1.2498\tvalidation-rmse:2.0541\tvalidation-mae:1.47108\n",
- "[41]\ttrain-rmse:1.71452\ttrain-mae:1.24448\tvalidation-rmse:2.05683\tvalidation-mae:1.47221\n",
- "[42]\ttrain-rmse:1.70619\ttrain-mae:1.23868\tvalidation-rmse:2.0529\tvalidation-mae:1.47012\n",
- "[43]\ttrain-rmse:1.6994\ttrain-mae:1.23462\tvalidation-rmse:2.05173\tvalidation-mae:1.4685\n",
- "[44]\ttrain-rmse:1.69531\ttrain-mae:1.2325\tvalidation-rmse:2.05062\tvalidation-mae:1.46729\n",
- "[2019-09-04 20:28:57.973 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 45.\n",
- "[45]\ttrain-rmse:1.68994\ttrain-mae:1.22921\tvalidation-rmse:2.05038\tvalidation-mae:1.46773\n",
- "[46]\ttrain-rmse:1.68733\ttrain-mae:1.22771\tvalidation-rmse:2.04775\tvalidation-mae:1.46629\n",
- "[47]\ttrain-rmse:1.6745\ttrain-mae:1.22109\tvalidation-rmse:2.05052\tvalidation-mae:1.46931\n",
- "[48]\ttrain-rmse:1.6677\ttrain-mae:1.21652\tvalidation-rmse:2.05003\tvalidation-mae:1.46771\n",
- "[49]\ttrain-rmse:1.66666\ttrain-mae:1.21571\tvalidation-rmse:2.05055\tvalidation-mae:1.46835\n",
- "[2019-09-04 20:28:58.286 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 50.\n",
- "[50]\ttrain-rmse:1.66069\ttrain-mae:1.21231\tvalidation-rmse:2.05735\tvalidation-mae:1.47412\n",
- "[51]\ttrain-rmse:1.65219\ttrain-mae:1.20659\tvalidation-rmse:2.05643\tvalidation-mae:1.47175\n",
- "[52]\ttrain-rmse:1.6502\ttrain-mae:1.20494\tvalidation-rmse:2.05467\tvalidation-mae:1.46944\n",
- "[53]\ttrain-rmse:1.64501\ttrain-mae:1.20097\tvalidation-rmse:2.0534\tvalidation-mae:1.46682\n",
- "[54]\ttrain-rmse:1.63611\ttrain-mae:1.1947\tvalidation-rmse:2.05339\tvalidation-mae:1.46562\n",
- "[2019-09-04 20:28:58.703 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 55.\n",
- "[55]\ttrain-rmse:1.63324\ttrain-mae:1.19463\tvalidation-rmse:2.05613\tvalidation-mae:1.47189\n",
- "[56]\ttrain-rmse:1.62148\ttrain-mae:1.18714\tvalidation-rmse:2.05909\tvalidation-mae:1.47115\n",
- "[57]\ttrain-rmse:1.61668\ttrain-mae:1.18364\tvalidation-rmse:2.05695\tvalidation-mae:1.46885\n",
- "[58]\ttrain-rmse:1.6119\ttrain-mae:1.18027\tvalidation-rmse:2.05439\tvalidation-mae:1.46791\n",
- "[59]\ttrain-rmse:1.61104\ttrain-mae:1.1797\tvalidation-rmse:2.05469\tvalidation-mae:1.46735\n",
- "[2019-09-04 20:28:59.112 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 60.\n",
- "[60]\ttrain-rmse:1.60972\ttrain-mae:1.17818\tvalidation-rmse:2.05268\tvalidation-mae:1.46727\n",
- "[61]\ttrain-rmse:1.60483\ttrain-mae:1.17471\tvalidation-rmse:2.05238\tvalidation-mae:1.46723\n",
- "[62]\ttrain-rmse:1.60038\ttrain-mae:1.17187\tvalidation-rmse:2.05876\tvalidation-mae:1.46983\n",
- "[63]\ttrain-rmse:1.58954\ttrain-mae:1.16512\tvalidation-rmse:2.05812\tvalidation-mae:1.46628\n",
- "[64]\ttrain-rmse:1.58059\ttrain-mae:1.16055\tvalidation-rmse:2.06236\tvalidation-mae:1.4686\n",
- "[2019-09-04 20:28:59.606 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 65.\n",
- "[65]\ttrain-rmse:1.57333\ttrain-mae:1.15614\tvalidation-rmse:2.06741\tvalidation-mae:1.47281\n",
- "[66]\ttrain-rmse:1.57042\ttrain-mae:1.15445\tvalidation-rmse:2.06701\tvalidation-mae:1.47235\n",
- "[67]\ttrain-rmse:1.56843\ttrain-mae:1.15295\tvalidation-rmse:2.0663\tvalidation-mae:1.47219\n",
- "[68]\ttrain-rmse:1.56508\ttrain-mae:1.15063\tvalidation-rmse:2.06515\tvalidation-mae:1.47276\n",
- "[69]\ttrain-rmse:1.56182\ttrain-mae:1.14841\tvalidation-rmse:2.06628\tvalidation-mae:1.47551\n",
- "[2019-09-04 20:29:00.062 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 70.\n",
- "[70]\ttrain-rmse:1.55922\ttrain-mae:1.14672\tvalidation-rmse:2.06947\tvalidation-mae:1.47674\n",
- "[71]\ttrain-rmse:1.55271\ttrain-mae:1.14336\tvalidation-rmse:2.07123\tvalidation-mae:1.4787\n",
- "[72]\ttrain-rmse:1.54876\ttrain-mae:1.14062\tvalidation-rmse:2.07051\tvalidation-mae:1.47639\n",
- "[73]\ttrain-rmse:1.54667\ttrain-mae:1.13933\tvalidation-rmse:2.075\tvalidation-mae:1.47814\n",
- "[74]\ttrain-rmse:1.54487\ttrain-mae:1.13787\tvalidation-rmse:2.07309\tvalidation-mae:1.47687\n",
- "[2019-09-04 20:29:00.610 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 75.\n",
- "[75]\ttrain-rmse:1.53595\ttrain-mae:1.1307\tvalidation-rmse:2.07264\tvalidation-mae:1.47924\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[76]\ttrain-rmse:1.53418\ttrain-mae:1.12846\tvalidation-rmse:2.07262\tvalidation-mae:1.47965\n",
- "[77]\ttrain-rmse:1.52683\ttrain-mae:1.12359\tvalidation-rmse:2.07258\tvalidation-mae:1.47693\n",
- "[78]\ttrain-rmse:1.52175\ttrain-mae:1.12025\tvalidation-rmse:2.0736\tvalidation-mae:1.47748\n",
- "[79]\ttrain-rmse:1.51673\ttrain-mae:1.11775\tvalidation-rmse:2.07555\tvalidation-mae:1.47927\n",
- "[2019-09-04 20:29:01.110 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 80.\n",
- "[80]\ttrain-rmse:1.51552\ttrain-mae:1.11628\tvalidation-rmse:2.07579\tvalidation-mae:1.47937\n",
- "[81]\ttrain-rmse:1.51042\ttrain-mae:1.11335\tvalidation-rmse:2.07407\tvalidation-mae:1.47882\n",
- "[82]\ttrain-rmse:1.49888\ttrain-mae:1.10658\tvalidation-rmse:2.07054\tvalidation-mae:1.47881\n",
- "[83]\ttrain-rmse:1.48993\ttrain-mae:1.10055\tvalidation-rmse:2.07382\tvalidation-mae:1.48155\n",
- "[84]\ttrain-rmse:1.48741\ttrain-mae:1.09893\tvalidation-rmse:2.07477\tvalidation-mae:1.48274\n",
- "[2019-09-04 20:29:01.640 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 85.\n",
- "[85]\ttrain-rmse:1.48121\ttrain-mae:1.09505\tvalidation-rmse:2.07463\tvalidation-mae:1.48405\n",
- "[86]\ttrain-rmse:1.47717\ttrain-mae:1.09075\tvalidation-rmse:2.07385\tvalidation-mae:1.48084\n",
- "[87]\ttrain-rmse:1.47266\ttrain-mae:1.08722\tvalidation-rmse:2.07319\tvalidation-mae:1.4827\n",
- "[88]\ttrain-rmse:1.46499\ttrain-mae:1.08129\tvalidation-rmse:2.07362\tvalidation-mae:1.4862\n",
- "[89]\ttrain-rmse:1.46304\ttrain-mae:1.08039\tvalidation-rmse:2.07298\tvalidation-mae:1.48554\n",
- "[2019-09-04 20:29:02.197 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 90.\n",
- "[90]\ttrain-rmse:1.45968\ttrain-mae:1.07896\tvalidation-rmse:2.07361\tvalidation-mae:1.487\n",
- "[91]\ttrain-rmse:1.45246\ttrain-mae:1.07513\tvalidation-rmse:2.07227\tvalidation-mae:1.48551\n",
- "[92]\ttrain-rmse:1.44731\ttrain-mae:1.07157\tvalidation-rmse:2.07416\tvalidation-mae:1.48621\n",
- "[93]\ttrain-rmse:1.44069\ttrain-mae:1.06819\tvalidation-rmse:2.07838\tvalidation-mae:1.49\n",
- "[94]\ttrain-rmse:1.43273\ttrain-mae:1.06434\tvalidation-rmse:2.08272\tvalidation-mae:1.49347\n",
- "[2019-09-04 20:29:03.068 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 95.\n",
- "[95]\ttrain-rmse:1.4278\ttrain-mae:1.06026\tvalidation-rmse:2.08366\tvalidation-mae:1.49517\n",
- "[96]\ttrain-rmse:1.42101\ttrain-mae:1.05588\tvalidation-rmse:2.0787\tvalidation-mae:1.49234\n",
- "[97]\ttrain-rmse:1.41775\ttrain-mae:1.05389\tvalidation-rmse:2.07722\tvalidation-mae:1.49251\n",
- "[98]\ttrain-rmse:1.41412\ttrain-mae:1.05038\tvalidation-rmse:2.07725\tvalidation-mae:1.49371\n",
- "[2019-09-04 20:29:03.090 38f9d36a2c42.ant.amazon.com:42958 INFO hook.py:182] Saved iteration 99.\n",
- "[99]\ttrain-rmse:1.40749\ttrain-mae:1.04556\tvalidation-rmse:2.07435\tvalidation-mae:1.49251\n"
- ]
- }
- ],
- "source": [
- "bst = xgb.train(\n",
- " params=params,\n",
- " dtrain=dtrain,\n",
- " evals=watchlist,\n",
- " num_boost_round=num_round,\n",
- " callbacks=[hook])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Data Analysis - Manual\n",
- "Now that we have trained the system we can analyze the data. Notice that this notebook focuses on after-the-fact analysis. Tornasole also provides a collection of tools to do automatic analysis as the training run is progressing, which will be covered in a different notebook.\n",
- "\n",
- "We import a basic analysis library, which defines a concept of `Trial`. A `Trial` is a single training run, which is depositing values in a local directory (`LocalTrial`) or S3 (`S3Trial`). In this case we are using a `LocalTrial` - if you wish, you can change the output from `./ts_output` to `s3://mybucket/myprefix` and use `S3Trial` instead of `LocalTrial`."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And we read the data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[2019-09-04 20:29:03.117 38f9d36a2c42.ant.amazon.com:42958 INFO local_trial.py:22] Loading trial myrun at path ./ts_output\n",
- "[2019-09-04 20:29:03.119 38f9d36a2c42.ant.amazon.com:42958 INFO local_trial.py:58] Loaded 3 collections\n"
- ]
- }
- ],
- "source": [
- "trial = LocalTrial(\"myrun\", \"./ts_output\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can list all the tensors we know something about. Each one of these names is the name of a tensor - the name is a combination of the feature name (which, in these cases, is auto-assigned by XGBoost) and whether it's an evaluation metric, feature importance, or SHAP value."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[2019-09-04 20:29:06.817 38f9d36a2c42.ant.amazon.com:42958 INFO trial.py:98] Training has ended, will try to do a final refresh in 5 sec\n",
- "[2019-09-04 20:29:11.839 38f9d36a2c42.ant.amazon.com:42958 INFO trial.py:103] Marked loaded all steps to True\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "['train-rmse',\n",
- " 'train-mae',\n",
- " 'validation-rmse',\n",
- " 'validation-mae',\n",
- " 'f8/feature_importance',\n",
- " 'f2/feature_importance',\n",
- " 'f6/feature_importance',\n",
- " 'f1/feature_importance',\n",
- " 'f4/feature_importance',\n",
- " 'f1/average_shap',\n",
- " 'f2/average_shap',\n",
- " 'f4/average_shap',\n",
- " 'f6/average_shap',\n",
- " 'f3/feature_importance',\n",
- " 'f5/feature_importance',\n",
- " 'f7/feature_importance',\n",
- " 'f3/average_shap',\n",
- " 'f7/average_shap',\n",
- " 'f5/average_shap']"
- ]
- },
- "execution_count": 12,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trial.tensors()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For each tensor we can ask for which steps we have data - in this case, every 5 steps"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]"
- ]
- },
- "execution_count": 13,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trial.tensor(\"f8/feature_importance\").steps()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can obtain each tensor at each step as a `numpy` array"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "numpy.ndarray"
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "type(trial.tensor(\"f8/feature_importance\").value(30))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Performance metrics\n",
- "\n",
- "We can also create a simple function that visualizes the training and validation errors\n",
- "as the training progresses.\n",
- "We expect each gradient to get smaller over time, as the system converges to a good solution.\n",
- "Now, remember that this is an interactive analysis - we are showing these tensors to give an idea of the data. \n",
- "\n",
- "Later on in this notebook we will run an automated analysis."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Define a function that, for the given tensor name, walks through all \n",
- "# the iterations for which we have data and fetches the value.\n",
- "# Returns the set of steps and the values\n",
- "\n",
- "def get_data(trial, tname):\n",
- " tensor = trial.tensor(tname)\n",
- " steps = tensor.steps()\n",
- " vals = [tensor.value(s) for s in steps]\n",
- " return steps, vals"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "metrics_to_plot = [\"train-rmse\", \"validation-rmse\"]\n",
- "for metric in metrics_to_plot:\n",
- " steps, data = get_data(trial, metric)\n",
- " plt.plot(steps, data, label=metric)\n",
- "plt.xlabel('Iteration')\n",
- "plt.ylabel('Root mean squred error')\n",
- "plt.legend()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Feature importances\n",
- "\n",
- "We can also visualize the feature importances as determined by\n",
- "[xgboost.get_fscore()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_fscore).\n",
- "Note that feature importances with zero values are not included here\n",
- "(which means that those features were not used in any split condisitons)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [],
- "source": [
- "def plot_collections(trial, endswith, ylabel=''):\n",
- " \n",
- " plt.figure(\n",
- " num=1, figsize=(8, 8), dpi=80,\n",
- " facecolor='w', edgecolor='k')\n",
- "\n",
- " features_to_plot = [\n",
- " tname for tname in trial.tensors()\n",
- " if tname.endswith(endswith)\n",
- " ]\n",
- "\n",
- " for feature in sorted(features_to_plot):\n",
- " steps, data = get_data(trial, feature)\n",
- " label = feature.replace(endswith, '')\n",
- " plt.plot(steps, data, label=label)\n",
- "\n",
- " plt.legend(bbox_to_anchor=(1.04,1), loc='upper left')\n",
- " plt.xlabel('Iteration')\n",
- " plt.ylabel(ylabel)\n",
- " plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "plot_collections(trial, \"/feature_importance\", \"Feature importance\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### SHAP\n",
- "\n",
- "[SHAP](https://github.com/slundberg/shap) (SHapley Additive exPlanations) is\n",
- "another approach to explain the output of machine learning models.\n",
- "SHAP values represent a feature's contribution to a change in the model output."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "plot_collections(trial, \"/average_shap\", \"SHAP values\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Data Analysis - Automatic\n",
- "So far we have conducted a human analysis, but the real power of Tornasole comes from having automatic monitoring of training runs. To do so we will build a SageMaker-based system that monitors existing runs in real time. Data traces deposited in S3 are the exchange mechanism: \n",
- "- the training system deposits data into s3://mybucket/myrun/\n",
- "- the monitoring system watches and reads data from s3://mybucket/myrun/\n",
- "\n",
- "In this example we will simulate reading from that."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {},
- "outputs": [],
- "source": [
- "from smdebug.rules.generic import LossNotDecreasing\n",
- "from smdebug.rules.rule_invoker import invoke_rule"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[2019-09-04 20:29:13.371 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:65] LossNotDecreasing rule created with num_steps: 5, diff_percent: 0.0, mode: GLOBAL, tensor_regex: , collection_names: metric\n",
- "[2019-09-04 20:29:13.372 38f9d36a2c42.ant.amazon.com:42958 INFO rule_invoker.py:76] Started execution of rule LossNotDecreasing at step 0\n",
- "[2019-09-04 20:29:13.376 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 1 loss is not decreasing over the last 5 steps at step 20\n",
- "[2019-09-04 20:29:13.377 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 2 losses are not decreasing over the last 5 steps at step 25\n",
- "[2019-09-04 20:29:13.380 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 2 losses are not decreasing over the last 5 steps at step 30\n",
- "[2019-09-04 20:29:13.382 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 1 loss is not decreasing over the last 5 steps at step 35\n",
- "[2019-09-04 20:29:13.384 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 1 loss is not decreasing over the last 5 steps at step 40\n",
- "[2019-09-04 20:29:13.385 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 2 losses are not decreasing over the last 5 steps at step 50\n",
- "[2019-09-04 20:29:13.388 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 2 losses are not decreasing over the last 5 steps at step 65\n",
- "[2019-09-04 20:29:13.389 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 2 losses are not decreasing over the last 5 steps at step 70\n",
- "[2019-09-04 20:29:13.391 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 2 losses are not decreasing over the last 5 steps at step 75\n",
- "[2019-09-04 20:29:13.392 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 2 losses are not decreasing over the last 5 steps at step 80\n",
- "[2019-09-04 20:29:13.393 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 1 loss is not decreasing over the last 5 steps at step 85\n",
- "[2019-09-04 20:29:13.395 38f9d36a2c42.ant.amazon.com:42958 INFO loss_decrease.py:180] 1 loss is not decreasing over the last 5 steps at step 90\n",
- "[2019-09-04 20:29:13.396 38f9d36a2c42.ant.amazon.com:42958 INFO rule_invoker.py:90] Ended execution of rule LossNotDecreasing at end_step 94\n"
- ]
- }
- ],
- "source": [
- "loss_not_decreasing = LossNotDecreasing(\n",
- " trial,\n",
- " use_losses_collection=False,\n",
- " collection_names=\"metrics\",\n",
- " num_steps=5)\n",
- "invoke_rule(loss_not_decreasing, end_step=95)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This concludes this notebook. For more information see the documentation at \n",
- "- https://github.com/awslabs/tornasole_core\n"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.4"
- },
- "pycharm": {
- "stem_cell": {
- "cell_type": "raw",
- "source": [],
- "metadata": {
- "collapsed": false
- }
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/examples/xgboost/sagemaker-notebooks/xgboost_classification.ipynb b/examples/xgboost/sagemaker-notebooks/xgboost_classification.ipynb
deleted file mode 100644
index c7d7cfdb4..000000000
--- a/examples/xgboost/sagemaker-notebooks/xgboost_classification.ipynb
+++ /dev/null
@@ -1,726 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Debugging SageMaker XGBoost Training Jobs with Tornasole\n",
- "\n",
- "This notebook uses the MNIST dataset to demonstrate a classification task using Tornasole with XGBoost.\n",
- "For a regression problem, see [xgboost_regression.ipynb](xgboost_regression.ipynb).\n",
- "\n",
- "## Overview\n",
- "\n",
- "Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. \n",
- "Tornasole helps you to monitor your training in near real time using rules and would provide you\n",
- "alerts, once it has detected inconsistency in training. \n",
- "\n",
- "Using Tornasole is a two step process: Saving tensors and Analysis.\n",
- "Let's look at each one of them closely.\n",
- "\n",
- "### Saving tensors (and scalars)\n",
- "\n",
- "In deep learning algorithms, tensors define the state of the training job\n",
- "at any particular instant in its lifecycle.\n",
- "Tornasole exposes a library which allows you to capture these tensors and\n",
- "save them for analysis.\n",
- "Although XGBoost is not a deep learning algorithm, Tornasole is highly customizable\n",
- "and can help provide interpretability by saving insightful metrics, such as\n",
- "performance metrics or feature importances, at different frequencies.\n",
- "Refer to [DeveloperGuide_XGBoost](../DeveloperGuide_XG.md) for details on how to\n",
- "save the metrics you want.\n",
- "\n",
- "### Analysis\n",
- "\n",
- "Analysis of the tensors emitted is captured by the Tornasole concept called ***Rules***.\n",
- "On a very broad level, a rule is a python code used to detect certain conditions during training.\n",
- "Some of the conditions that a data scientist training an algorithm may care about are\n",
- "monitoring for gradients getting too large or too small, detecting overfitting, and so on.\n",
- "Tornasole will come pre-packaged with certain rules.\n",
- "Users can write their own rules using the Tornasole APIs.\n",
- "You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook,\n",
- "using Tornasole's full set of APIs. \n",
- "Please refer to [DeveloperGuide_Rules](../../../rules/DeveloperGuide_Rules.md) for more details about analysis.\n",
- "\n",
- "This example guides you through installation of the required components for emitting tensors in a \n",
- "SageMaker training job and applying a rule over the tensors to monitor the live status of the job. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup\n",
- "\n",
- "We will also install the required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them. This is only for the purposes of this private beta. Once we do this, we will be ready to use smdebug.\n",
- "\n",
- "You'll probably have to restart this notebook after running the following code cell."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "! aws s3 sync s3://tornasole-external-preview-use1/sdk/ ~/SageMaker/tornasole-preview-sdk/\n",
- "! pip3 -q install ~/SageMaker/tornasole-preview-sdk/ts-binaries/tornasole_xgboost/py3/latest/tornasole-* --user\n",
- "! chmod +x ~/SageMaker/tornasole-preview-sdk/installer.sh && ~/SageMaker/tornasole-preview-sdk/installer.sh"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### If you running this notebook for the first time, please wait for the above setup to complete and restart the notebook by selecting *Kernel -> Restart Kernel* before proceeding."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We have built SageMaker XGBoost containers with smdebug. You can use them from ECR from SageMaker. Here are the links to the images. Please use the image from the appropriate region in which you want your jobs to run."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "import boto3\n",
- "from sagemaker import get_execution_role\n",
- "\n",
- "# Below changes the region to be one where this notebook is running\n",
- "REGION = boto3.Session().region_name\n",
- "ROLE = get_execution_role()\n",
- "os.environ[\"AWS_REGION\"] = REGION\n",
- "\n",
- "TAG = \"latest\"\n",
- "docker_image_name = \"072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-xgboost-0.90-cpu:{}\".format(REGION, TAG)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Training XGBoost models in SageMaker with Tornasole\n",
- "\n",
- "### SageMaker XGBoost as a framwork\n",
- "\n",
- "We'll train a few XGBoost models in this notebook with Tornasole enabled and monitor the training jobs with Tornasole Rules. This will be done using SageMaker XGBoost 0.90 Container as a framework. The [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) can be used as a built-in algorithm or as a framework such TensorFlow. Using XGBoost as a framework provides more flexibility than using it as a built-in algorithm as it enables more advanced scenarios that allow pre-processing and post-processing scripts to be incorporated into your training script.\n",
- "\n",
- "Let us first train a simple example training script [xgboost_mnist_basic_hook_demo.py](../scripts/xgboost_abalone_basic_hook_demo.py) with XGBoost enabled in SageMaker using the SageMaker Estimator API, along with a LossNotDecreasing Rule to monitor the training job in realtime. A Tornasole Rule is essentially python code which analyzes tensors saved by tornasole and validates some condition. LossNotDecreasing rule is a first party (1P) rule provided by smdebug. For other 1P rules that can be used in XGBoost, refer to [FirstPartyRules.md](../../../rules/FirstPartyRules.md)\n",
- "\n",
- "During training, Tornasole will capture tensors as specified in its configuration and LossNotDecreasing Rule job will monitor whether you are running into a situation where loss is not going down. The rule will emit a cloudwatch event if it finds that the performance metrics are not decreasing during training.\n",
- "\n",
- "### Enabling Tornasole in the script\n",
- "\n",
- "You can see in the script that we have made a couple of simple changes to enable smdebug. We created a SessionHook which we pass as a callback function when creating a Booster. We passed a SaveConfig object telling the hook to save the evaluation metrics, feature importances, and SHAP values at regular intervals. Note that Tornasole is highly configurable, you can choose exactly what to save. The changes are described in a bit more detail below after we train this example as well as in even more detail in our [Developer Guide for XGBoost](../DeveloperGuide_XG.md). \n",
- "\n",
- "```python\n",
- "from smdebug.xgboost import SessionHook, SaveConfig\n",
- "\n",
- "save_config = SaveConfig(save_interval=frequency)\n",
- "hook = SessionHook(save_config=save_config)\n",
- "\n",
- "bst = xgboost.train(\n",
- " ...\n",
- " callbacks=[hook]\n",
- ")\n",
- "```\n",
- "\n",
- "### XGBoost for Classification\n",
- "\n",
- "We use the [MNIST data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html) stored in [LIBSVM](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) format.\n",
- "\n",
- "Refer to [XGBoost for Classification](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/xgboost_mnist)\n",
- "for an example of using classification from Amazon SageMaker's implementation of\n",
- "[XGBoost](https://github.com/dmlc/xgboost)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "entry_point_script = \"../scripts/xgboost_mnist_basic_hook_demo.py\"\n",
- "\n",
- "hyperparameters={\n",
- " \"max_depth\": \"5\",\n",
- " \"eta\": \"0.5\",\n",
- " \"gamma\": \"4\",\n",
- " \"min_child_weight\": \"6\",\n",
- " \"silent\": \"0\",\n",
- " \"objective\": \"multi:softmax\",\n",
- " \"num_class\": \"10\", # num_class is required for 'multi:*' objectives\n",
- " \"num_round\": \"10\",\n",
- " \"save_frequency\": \"1\"\n",
- "}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sagemaker.xgboost import XGBoost\n",
- "\n",
- "estimator = XGBoost(\n",
- " image_name=docker_image_name,\n",
- " base_job_name=\"demo-tornasole-xgboost-classification\",\n",
- " entry_point=entry_point_script,\n",
- " hyperparameters=hyperparameters,\n",
- " train_instance_type=\"ml.m4.4xlarge\",\n",
- " train_instance_count=1,\n",
- " framework_version=\"0.90-1\",\n",
- " py_version=\"py3\",\n",
- " role=ROLE,\n",
- " \n",
- " # These are Tornasole specific parameters, \n",
- " # debug=True means rule specified in rules_specification \n",
- " # will run as rule job. \n",
- " # Below, we specify to run the first party rule LossNotDecreasing\n",
- " # on a ml.c5.4xlarge instance\n",
- " debug=True,\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"LossNotDecreasing\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"RuntimeConfigurations\": {\n",
- " \"use_losses_collection\": \"False\",\n",
- " \"tensor_regex\": \"train-merror,validation-merror\",\n",
- " \"num_steps\" : \"10\"\n",
- " }\n",
- " }\n",
- " ]\n",
- ")\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "pycharm": {
- "name": "#%% md\n"
- }
- },
- "source": [
- "*Note that Tornasole is only supported for `py_version='py3'` currently.*"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# This is a fire and forget event.\n",
- "# By setting wait=False, we just submit the job to run in the background.\n",
- "# In the background SageMaker will spin off 1 training job and 1 rule job for you.\n",
- "# Please follow this notebook to see status of the training job and the rule job.\n",
- "estimator.fit(wait=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Result\n",
- "As a result of the above command, SageMaker will spin off 1 training job and 1 rule job for you - the first one being the job which produces the tensors to be analyzed and the second one, which analyzes the tensors to check if `train-merror` and `validation-merror` are not decreasing at any point during training.\n",
- "\n",
- "### Describing the training job\n",
- "We can check the status of the training job by running the following command:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Below command will give the status of training job\n",
- "# Note: In the output of below command you will see DebugConfig parameter \n",
- "job_name = estimator.latest_training_job.name\n",
- "client = estimator.sagemaker_session.sagemaker_client\n",
- "description = client.describe_training_job(TrainingJobName=job_name)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# The status of the training job can be seen below\n",
- "description[\"TrainingJobStatus\"]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Once your training job is started SageMaker will spin up a rule execution job to run the LossNotDecreasing rule.\n",
- "\n",
- "### Tornasole specific parameters in the description\n",
- "**DebugConfig** parameter has details about Tornasole related configuration. The key parameters to look for below are\n",
- "\n",
- "*S3OutputPath* : This is the path where output tensors from tornasole is getting saved. \n",
- "*RuleConfig*' : This parameter tells about the rule config parameter that was passed when creating the trainning job. In this you should be able to see details of the rule that ran for training. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "description[\"DebugConfig\"]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Check the status of the Rule Execution Job\n",
- "To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. These details are also available as part of the response `description` above under: `description['RuleMonitoringStatuses']`\n",
- "\n",
- "\n",
- "The logs of the training job are available in the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`. \n",
- "\n",
- "You will see that once the rule execution job starts, that it identifies the loss not decreasing situation in the training job, raises the `RuleEvaluationConditionMet` exception and ends the job. \n",
- "\n",
- "**Note that the next cell blocks until the rule execution job ends. You can stop it at any point to proceed to the rest of the notebook. Once it says RuleStatus is Started, and shows the `RuleExecutionJobArn`, you can look at the status of the rule being monitored. At that point, we can also look at the logs as shown in the next cell**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.describe_rule_execution_jobs()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Check logs of the rule execution jobs\n",
- "\n",
- "If you want to access the logs of a particular rule job name, you can do the following. First, you need to get the rule job name (`RuleExecutionJobArn` field from the training job description). Note that this is only available after the rule job reaches Started stage. Hence the next cell waits till the job name is available."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import time\n",
- "\n",
- "rule_descr = client.describe_training_job(TrainingJobName=job_name)[\"RuleMonitoringStatuses\"]\n",
- "print(\"Waiting for rule execution job to start\")\n",
- "while \"RuleExecutionJobArn\" not in rule_descr[0]:\n",
- " time.sleep(5)\n",
- " rule_descr = client.describe_training_job(TrainingJobName=job_name)[\"RuleMonitoringStatuses\"]\n",
- "\n",
- "rule_job_arn = rule_descr[0][\"RuleExecutionJobArn\"]\n",
- "print(\"Rule execution job has started. The job ARN is {}\".format(rule_job_arn))\n",
- "rule_job_name = rule_job_arn.split('/')[1]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we can attach to this job to see its logs"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sagemaker.estimator import Estimator\n",
- "loss_not_decreasing = Estimator.attach(rule_job_name)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In the above example, the `LossNotDecreasing` rule was completed without producing an alert because both `train-merror` and `validation-merror` decreased steadily throught the training run. To see an example of the rule when performance metrics stop decreasing during training, see [xgboost_regression.ipynb](xgboost_regression.ipynb)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Data Analysis - Manual\n",
- "\n",
- "Now that we have trained the system we can analyze the data. Here we focus on after-the-fact analysis.\n",
- "\n",
- "We import a basic analysis library, which defines a concept of `Trial` that represents a single training run."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "from urllib.parse import urlparse\n",
- "from smdebug.trials import create_trial\n",
- "\n",
- "s3_output_path = description[\"DebugConfig\"][\"DebugHookConfig\"][\"S3OutputPath\"]\n",
- "trial = create_trial(s3_output_path)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can list all the tensors we know something about. Each one of these names is the name of a tensor - the name is a combination of the feature name (which, in these cases, is auto-assigned by XGBoost) and whether it's an evaluation metric, feature importance, or SHAP value. We also have `y/validation` for true labels from the validation set and `y_hat/validation` for predicted labels on the same validation set."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "trial.tensors()[:10]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For each tensor we can ask for which steps we have data - in this case, every 2 steps"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(list(trial.tensor(\"validation-merror\").steps()))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can obtain each tensor at each step as a `numpy` array"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "type(trial.tensor(\"train-merror\").value(5))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Performance metrics\n",
- "\n",
- "We can also create a simple function that visualizes the training and validation errors\n",
- "as the training progresses.\n",
- "We expect each training errors to get smaller over time, as the system converges to a good solution.\n",
- "Now, remember that this is an interactive analysis - we are showing these tensors to give an idea of the data. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt\n",
- "import seaborn as sns\n",
- "\n",
- "# Define a function that, for the given tensor name, walks through all \n",
- "# the iterations for which we have data and fetches the value.\n",
- "# Returns the set of steps and the values\n",
- "def get_data(trial, tname):\n",
- " tensor = trial.tensor(tname)\n",
- " steps = tensor.steps()\n",
- " vals = [tensor.value(s) for s in steps]\n",
- " return steps, vals"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "metrics_to_plot = [\"train-merror\", \"validation-merror\"]\n",
- "for metric in metrics_to_plot:\n",
- " steps, data = get_data(trial, metric)\n",
- " plt.plot(steps, data, label=metric)\n",
- "plt.xlabel('Iteration')\n",
- "plt.ylabel('Classification error')\n",
- "plt.legend()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Feature importances\n",
- "\n",
- "We can also visualize the feature importances as determined by\n",
- "[xgboost.get_fscore()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_fscore).\n",
- "Note that feature importances with zero values are not included here\n",
- "(which means that those features were not used in any split condisitons)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def plot_collections(trial, collection_name, ylabel=''):\n",
- " \n",
- " plt.figure(\n",
- " num=1, figsize=(8, 8), dpi=80,\n",
- " facecolor='w', edgecolor='k')\n",
- "\n",
- " features = trial.collection(collection_name).tensor_names\n",
- "\n",
- " # to avoid cluttering, we will plot only one out of 20 features\n",
- " for feature in list(features)[::20]:\n",
- " steps, data = get_data(trial, feature)\n",
- " label = feature.replace('/' + collection_name, '')\n",
- " plt.plot(steps, data, label=label)\n",
- "\n",
- " plt.legend(bbox_to_anchor=(1.04,1), loc='upper left')\n",
- " plt.xlabel('Iteration')\n",
- " plt.ylabel(ylabel)\n",
- " plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plot_collections(trial, \"feature_importance\", \"Feature importance\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### SHAP\n",
- "\n",
- "[SHAP](https://github.com/slundberg/shap) (SHapley Additive exPlanations) is\n",
- "another approach to explain the output of machine learning models.\n",
- "SHAP values represent a feature's contribution to a change in the model output."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plot_collections(trial, \"average_shap\", \"SHAP values\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Confusion matrix"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "from sklearn.metrics import confusion_matrix\n",
- "from IPython.display import display, clear_output\n",
- "\n",
- "fig, ax = plt.subplots()\n",
- "\n",
- "for step in range(0, 9):\n",
- " cm = confusion_matrix(\n",
- " trial.tensor('labels').value(step),\n",
- " trial.tensor('predictions').value(step)\n",
- " )\n",
- " normalized_cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
- " sns.heatmap(normalized_cm, cmap=\"bone\", ax=ax, cbar=False, annot=cm, fmt='')\n",
- " print(f\"iteartion: {step}\")\n",
- " display(fig)\n",
- " plt.pause(1)\n",
- " ax.clear()\n",
- " clear_output(wait=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 1P rule: Confusion matrix\n",
- "\n",
- "As another example of using a first party (1P) rule provided by Tornasole, let us again train the example training script [xgboost_mnist_basic_hook_demo.py](../scripts/xgboost_abalone_basic_hook_demo.py) and use a 1P rule `Confusion` to monitor the training job in realtime.\n",
- "\n",
- "During training, `Confusion` Rule job will monitor whether you are running into a situation where the ratio of on-diagonal and off-diagonal values in the confusion matrix is not within a specified range. In other words, this rule evaluates the goodness of a confusion matrix for a classification problem. It creates a matrix of size `category_no` $\\times$ `category_no` and populates it with data coming from (`y`, `y_hat`) pairs. For each (`y`, `y_hat`) pairs the count in `confusion[y][y_hat]` is incremented by 1. Once the matrix is fully populated, the ratio of data on- and off-diagonal will be evaluated according to:\n",
- "\n",
- "- For elements on the diagonal:\n",
- "\n",
- "$$ \\frac{ \\text{confusion}_{ii} }{ \\sum_j \\text{confusion}_{jj} } \\geq \\text{min_diag} $$\n",
- "\n",
- "- For elements off the diagonal:\n",
- "\n",
- "$$ \\frac{ \\text{confusion}_{ji} }{ \\sum_j \\text{confusion}_{ji} } \\leq \\text{max_off_diag} $$\n",
- "\n",
- "If the condition is met, the rule will emit a cloudwatch event.\n",
- "\n",
- "Note that this rule will infer the default parameters if configurations are not specified, so you can simply use\n",
- "\n",
- "```python\n",
- "rules_specification = [\n",
- " {\n",
- " \"RuleName\": \"Confusion\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\"\n",
- " }\n",
- "]\n",
- "```\n",
- "If you want to specify the optional parameters, you can do so by using `RuntimeConfigurations`:\n",
- "\n",
- "```python\n",
- "rules_specification = [\n",
- " {\n",
- " \"RuleName\": \"Confusion\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"RuntimeConfigurations\": {\n",
- " \"category_no\": \"10\",\n",
- " \"min_diag\": \"0.8\",\n",
- " \"max_diag\": \"0.2\"\n",
- " }\n",
- " }\n",
- "]\n",
- "```\n",
- "\n",
- "For `Confusion` Rule API and other 1P rules that can be used in XGBoost, refer to [FirstPartyRules.md](../../../rules/FirstPartyRules.md)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator = XGBoost(\n",
- " image_name=docker_image_name,\n",
- " base_job_name=\"demo-tornasole-xgboost-confusion\",\n",
- " entry_point=entry_point_script,\n",
- " hyperparameters=hyperparameters,\n",
- " train_instance_type=\"ml.m4.4xlarge\",\n",
- " train_instance_count=1,\n",
- " framework_version=\"0.90-1\",\n",
- " py_version=\"py3\",\n",
- " role=ROLE,\n",
- "\n",
- " debug=True,\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"Confusion\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\"\n",
- " }\n",
- " ]\n",
- ")\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.fit(wait=False)\n",
- "\n",
- "job_name = estimator.latest_training_job.name\n",
- "client = estimator.sagemaker_session.sagemaker_client\n",
- "description = client.describe_training_job(TrainingJobName=job_name)\n",
- "\n",
- "description[\"TrainingJobStatus\"]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "description[\"DebugConfig\"]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.describe_rule_execution_jobs()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This notebook showed two examples of using 1P rules provided Tornasole, but you can also write your own rules looking at these 1P rules for inspiration. Refer to [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more on the APIs you can use to write your own rules as well as descriptions for the 1P rules that we provide. [xgboost_regression.ipynb](xgboost_regression.ipynb) also demonstrates how to use a custom rule that monitors the ratio of feature importance values."
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "conda_python3",
- "language": "python",
- "name": "conda_python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.5"
- },
- "pycharm": {
- "stem_cell": {
- "cell_type": "raw",
- "metadata": {
- "collapsed": false
- },
- "source": []
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/examples/xgboost/sagemaker-notebooks/xgboost_regression.ipynb b/examples/xgboost/sagemaker-notebooks/xgboost_regression.ipynb
deleted file mode 100644
index a8c358516..000000000
--- a/examples/xgboost/sagemaker-notebooks/xgboost_regression.ipynb
+++ /dev/null
@@ -1,962 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Debugging SageMaker XGBoost Training Jobs with Tornasole\n",
- "\n",
- "\n",
- "This notebook uses the [Abalone data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html) to demonstrate a regression task using Tornasole with XGBoost. For a classification problem, see [xgboost_classification.ipynb](xgboost_classification.ipynb).\n",
- "\n",
- "## Overview\n",
- "\n",
- "Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. \n",
- "Tornasole helps you to monitor your training in near real time using rules and would provide you\n",
- "alerts, once it has detected inconsistency in training. \n",
- "\n",
- "Using Tornasole is a two step process: Saving tensors and Analysis.\n",
- "Let's look at each one of them closely.\n",
- "\n",
- "### Saving tensors (and scalars)\n",
- "\n",
- "In deep learning algorithms, tensors define the state of the training job\n",
- "at any particular instant in its lifecycle.\n",
- "Tornasole exposes a library which allows you to capture these tensors and\n",
- "save them for analysis.\n",
- "Although XGBoost is not a deep learning algorithm, Tornasole is highly customizable\n",
- "and can help provide interpretability by saving insightful metrics, such as\n",
- "performance metrics or feature importances, at different frequencies.\n",
- "Refer to [DeveloperGuide_XGBoost](../DeveloperGuide_XG.md) for details on how to\n",
- "save the metrics you want.\n",
- "\n",
- "### Analysis\n",
- "\n",
- "Analysis of the tensors emitted is captured by the Tornasole concept called ***Rules***.\n",
- "On a very broad level, a rule is a python code used to detect certain conditions during training.\n",
- "Some of the conditions that a data scientist training an algorithm may care about are\n",
- "monitoring for gradients getting too large or too small, detecting overfitting, and so on.\n",
- "Tornasole will come pre-packaged with certain rules.\n",
- "Users can write their own rules using the Tornasole APIs.\n",
- "You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook,\n",
- "using Tornasole's full set of APIs. \n",
- "Please refer [DeveloperGuide_Rules](../../../rules/DeveloperGuide_Rules.md) for more details about analysis.\n",
- "\n",
- "This example guides you through installation of the required components for emitting tensors in a \n",
- "SageMaker training job and applying a rule over the tensors to monitor the live status of the job. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We will also install the required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them. This is only for the purposes of this private beta. Once we do this, we will be ready to use smdebug.\n",
- "\n",
- "You'll probably have to restart this notebook after running the following code cell."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "! aws s3 sync s3://tornasole-external-preview-use1/sdk/ ~/SageMaker/tornasole-preview-sdk/\n",
- "! pip3 -q install ~/SageMaker/tornasole-preview-sdk/ts-binaries/tornasole_xgboost/py3/latest/tornasole-* --user\n",
- "! chmod +x ~/SageMaker/tornasole-preview-sdk/installer.sh && ~/SageMaker/tornasole-preview-sdk/installer.sh"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### If you running this notebook for the first time, please wait for the above setup to complete and restart the notebook by selecting *Kernel -> Restart Kernel* before proceeding."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We have built SageMaker XGBoost containers with smdebug. You can use them from ECR from SageMaker. Here are the links to the images. Please use the image from the appropriate region in which you want your jobs to run."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "import boto3\n",
- "from sagemaker import get_execution_role\n",
- "\n",
- "# Below changes the region to be one where this notebook is running\n",
- "REGION = boto3.Session().region_name\n",
- "ROLE = get_execution_role()\n",
- "os.environ[\"AWS_REGION\"] = REGION\n",
- "\n",
- "TAG = \"latest\"\n",
- "docker_image_name = \"072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-xgboost-0.90-cpu:{}\".format(REGION, TAG)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Training XGBoost models in SageMaker with Tornasole\n",
- "\n",
- "### SageMaker XGBoost as a framwork\n",
- "\n",
- "We'll train a few XGBoost models in this notebook with Tornasole enabled and monitor the training jobs with Tornasole Rules. This will be done using SageMaker XGBoost 0.90 Container as a framework. The [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) can be used as a built-in algorithm or as a framework such TensorFlow. Using XGBoost as a framework provides more flexible than using it as a built-in algorithm as it enables more advanced scenarios that allow pre-processing and post-processing scripts to be incorporated into your training script.\n",
- "\n",
- "Let us first train a simple example training script [xgboost_abalone_basic_hook_demo.py](../scripts/xgboost_abalone_basic_hook_demo.py) with XGBoost enabled in SageMaker using the SageMaker Estimator API, along with a LossNotDecreasing Rule to monitor the training job in realtime. A Tornasole Rule is essentially python code which analyzes tensors saved by tornasole and validates some condition. LossNotDecreasing rule is a first party (1P) rule provided by smdebug. For other 1P rules that can be used in XGBoost, refer to [FirstPartyRules.md](../../../rules/FirstPartyRules.md)\n",
- "\n",
- "During training, Tornasole will capture tensors as specified in its configuration and LossNotDecreasing Rule job will monitor whether you are running into a situation where loss is not going down. The rule will emit a cloudwatch event if it finds that the performance metrics are not decreasing during training.\n",
- "\n",
- "### Enabling Tornasole in the script\n",
- "\n",
- "You can see in the script that we have made a couple of simple changes to enable smdebug. We created a SessionHook which we pass as a callback function when creating a Booster. We passed a SaveConfig object telling the hook to save the evaluation metrics, feature importances, and SHAP values at regular intervals. Note that Tornasole is highly configurable, you can choose exactly what to save. The changes are described in a bit more detail below after we train this example as well as in even more detail in our [Developer Guide for XGBoost](../DeveloperGuide_XG.md). \n",
- "\n",
- "```python\n",
- "from smdebug.xgboost import SessionHook, SaveConfig\n",
- "\n",
- "save_config = SaveConfig(save_interval=frequency)\n",
- "hook = SessionHook(save_config=save_config)\n",
- "\n",
- "bst = xgboost.train(\n",
- " ...\n",
- " callbacks=[hook]\n",
- ")\n",
- "```\n",
- "\n",
- "### XGBoost for Regression\n",
- "\n",
- "We use the [Abalone data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html) originally from the [UCI data repository](https://archive.ics.uci.edu/ml/datasets/abalone). More details about the original dataset can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names). In the libsvm converted [version](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html), the nominal feature (Male/Female/Infant) has been converted into a real valued feature. Age of abalone is to be predicted from eight physical measurements.\n",
- "\n",
- "Refer to [XGBoost for Regression](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/xgboost_abalone)\n",
- "for an example of using regression from Amazon SageMaker's implementation of\n",
- "[XGBoost](https://github.com/dmlc/xgboost).\n",
- "\n",
- "Just a quick reminder if you are not familiar with script mode in SageMaker. You can pass command line arguments taken by your training script with a hyperparameter dictionary which gets passed to the SageMaker XGBoost Estimator class. You can see this in the examples below."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "entry_point_script = \"../scripts/xgboost_abalone_basic_hook_demo.py\"\n",
- "\n",
- "hyperparameters={\n",
- " \"max_depth\": \"5\",\n",
- " \"eta\": \"0.2\",\n",
- " \"gamma\": \"4\",\n",
- " \"min_child_weight\": \"6\",\n",
- " \"subsample\": \"0.7\",\n",
- " \"silent\": \"0\",\n",
- " \"objective\": \"reg:linear\",\n",
- " \"num_round\": \"50\",\n",
- " \"save_frequency\": \"2\"\n",
- "}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sagemaker.xgboost import XGBoost\n",
- "\n",
- "estimator = XGBoost(\n",
- " image_name=docker_image_name,\n",
- " base_job_name=\"demo-tornasole-xgboost\",\n",
- " entry_point=entry_point_script,\n",
- " hyperparameters=hyperparameters,\n",
- " train_instance_type=\"ml.m4.4xlarge\",\n",
- " train_instance_count=1,\n",
- " framework_version=\"0.90-1\",\n",
- " py_version=\"py3\",\n",
- " role=ROLE,\n",
- " \n",
- " # These are Tornasole specific parameters, \n",
- " # debug=True means rule specified in rules_specification \n",
- " # will run as rule job. \n",
- " # Below, we specify to run the first party rule LossNotDecreasing\n",
- " # on a ml.c5.4xlarge instance\n",
- " debug=True,\n",
- " rules_specification=[\n",
- " {\n",
- " \"RuleName\": \"LossNotDecreasing\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"RuntimeConfigurations\": {\n",
- " \"use_losses_collection\": \"False\",\n",
- " \"tensor_regex\": \"train-rmse,validation-rmse\",\n",
- " \"num_steps\" : \"10\"\n",
- " }\n",
- " }\n",
- " ]\n",
- ")\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "pycharm": {
- "name": "#%% md\n"
- }
- },
- "source": [
- "*Note that Tornasole is only supported for `py_version='py3'` currently.*"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# This is a fire and forget event.\n",
- "# By setting wait=False, we just submit the job to run in the background.\n",
- "# In the background SageMaker will spin off 1 training job and 1 rule job for you.\n",
- "# Please follow this notebook to see status of the training job and the rule job.\n",
- "estimator.fit(wait=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Result\n",
- "As a result of the above command, SageMaker will spin off 1 training job and 1 rule job for you - the first one being the job which produces the tensors to be analyzed and the second one, which analyzes the tensors to check if `train-rmse` and `validation-rmse` are not decreasing at any point during training.\n",
- "\n",
- "### Describing the training job\n",
- "We can check the status of the training job by running the following command:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Below command will give the status of training job\n",
- "# Note: In the output of below command you will see DebugConfig parameter \n",
- "job_name = estimator.latest_training_job.name\n",
- "client = estimator.sagemaker_session.sagemaker_client\n",
- "description = client.describe_training_job(TrainingJobName=job_name)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# The status of the training job can be seen below\n",
- "description[\"TrainingJobStatus\"]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Once your training job is started SageMaker will spin up a rule execution job to run the LossNotDecreasing rule.\n",
- "\n",
- "### Tornasole specific parameters in the description\n",
- "**DebugConfig** parameter has details about Tornasole related configuration. The key parameters to look for below are\n",
- "\n",
- "*S3OutputPath* : This is the path where output tensors from tornasole is getting saved. \n",
- "*RuleConfig*' : This parameter tells about the rule config parameter that was passed when creating the trainning job. In this you should be able to see details of the rule that ran for training. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "description[\"DebugConfig\"]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Check the status of the Rule Execution Job\n",
- "To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. These details are also available as part of the response `description` above under: `description['RuleMonitoringStatuses']`\n",
- "\n",
- "\n",
- "The logs of the training job are available in the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`. \n",
- "\n",
- "You will see that once the rule execution job starts, that it identifies the loss not decreasing situation in the training job, raises the `RuleEvaluationConditionMet` exception and ends the job. \n",
- "\n",
- "**Note that the next cell blocks until the rule execution job ends. You can stop it at any point to proceed to the rest of the notebook. Once it says RuleStatus is Started, and shows the `RuleExecutionJobArn`, you can look at the status of the rule being monitored. At that point, we can also look at the logs as shown in the next cell**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.describe_rule_execution_jobs()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Check logs of the rule execution jobs\n",
- "\n",
- "If you want to access the logs of a particular rule job name, you can do the following. First, you need to get the rule job name (`RuleExecutionJobArn` field from the training job description). Note that this is only available after the rule job reaches Started stage. Hence the next cell waits till the job name is available."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import time\n",
- "\n",
- "rule_descr = client.describe_training_job(TrainingJobName=job_name)[\"RuleMonitoringStatuses\"]\n",
- "print(\"Waiting for rule execution job to start\")\n",
- "while \"RuleExecutionJobArn\" not in rule_descr[0]:\n",
- " time.sleep(5)\n",
- " rule_descr = client.describe_training_job(TrainingJobName=job_name)[\"RuleMonitoringStatuses\"]\n",
- "\n",
- "rule_job_arn = rule_descr[0][\"RuleExecutionJobArn\"]\n",
- "print(\"Rule execution job has started. The job ARN is {}\".format(rule_job_arn))\n",
- "rule_job_name = rule_job_arn.split('/')[1]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we can attach to this job to see its logs"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sagemaker.estimator import Estimator\n",
- "loss_not_decreasing = Estimator.attach(rule_job_name)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Receive a CloudWatch Event for Rules\n",
- "When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus [CloudWatch events](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) are emitted. More details on this, see [below](#CloudWatch-Event-Integration-for-Rules). \n",
- "\n",
- "\n",
- "### Making this a good run\n",
- "\n",
- "In above example, we saw how a LossNotDecreasing Rule was run which analyzed the performance metrics when training was running and produced an alert in form of cloudwatch event.\n",
- "\n",
- "You can go back and change the hyperparameters passed to the estimator to `hyperparameters` and start a new training job (e.g., use a smaller learning rate `eta=0.05`). You will see that the LossNotDecreasing rule is not fired in that case as both `train-rmse` and `validation-rmse` keep decreasing steadily throughout the entire training duration."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Data Analysis - Manual\n",
- "\n",
- "Now that we have trained the system we can analyze the data. Here we focus on after-the-fact analysis.\n",
- "\n",
- "We import a basic analysis library, which defines a concept of `Trial` that represents a single training run."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "from urllib.parse import urlparse\n",
- "from smdebug.trials import create_trial\n",
- "\n",
- "s3_output_path = description[\"DebugConfig\"][\"DebugHookConfig\"][\"S3OutputPath\"]\n",
- "trial = create_trial(s3_output_path)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can list all the tensors we know something about. Each one of these names is the name of a tensor - the name is a combination of the feature name (which, in these cases, is auto-assigned by XGBoost) and whether it's an evaluation metric, feature importance, or SHAP value."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "trial.tensors()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For each tensor we can ask for which steps we have data - in this case, every 2 steps"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(list(trial.tensor(\"train-rmse\").steps()))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can obtain each tensor at each step as a `numpy` array"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "type(trial.tensor(\"train-rmse\").value(30))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Performance metrics\n",
- "\n",
- "We can also create a simple function that visualizes the training and validation errors\n",
- "as the training progresses.\n",
- "We expect each gradient to get smaller over time, as the system converges to a good solution.\n",
- "Now, remember that this is an interactive analysis - we are showing these tensors to give an idea of the data. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt\n",
- "import seaborn as sns\n",
- "\n",
- "# Define a function that, for the given tensor name, walks through all \n",
- "# the iterations for which we have data and fetches the value.\n",
- "# Returns the set of steps and the values\n",
- "def get_data(trial, tname):\n",
- " tensor = trial.tensor(tname)\n",
- " steps = tensor.steps()\n",
- " vals = [tensor.value(s) for s in steps]\n",
- " return steps, vals"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "metrics_to_plot = [\"train-rmse\", \"validation-rmse\"]\n",
- "for metric in metrics_to_plot:\n",
- " steps, data = get_data(trial, metric)\n",
- " plt.plot(steps, data, label=metric)\n",
- "plt.xlabel('Iteration')\n",
- "plt.ylabel('Root mean squred error')\n",
- "plt.legend()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Feature importances\n",
- "\n",
- "We can also visualize the feature importances as determined by\n",
- "[xgboost.get_fscore()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_fscore).\n",
- "Note that feature importances with zero values are not included here\n",
- "(which means that those features were not used in any split conditons)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def plot_collections(trial, collection_name, ylabel=''):\n",
- " \n",
- " plt.figure(\n",
- " num=1, figsize=(8, 8), dpi=80,\n",
- " facecolor='w', edgecolor='k')\n",
- "\n",
- " features = trial.collection(collection_name).tensor_names\n",
- "\n",
- " for feature in sorted(features):\n",
- " steps, data = get_data(trial, feature)\n",
- " label = feature.replace('/' + collection_name, '')\n",
- " plt.plot(steps, data, label=label)\n",
- "\n",
- " plt.legend(bbox_to_anchor=(1.04,1), loc='upper left')\n",
- " plt.xlabel('Iteration')\n",
- " plt.ylabel(ylabel)\n",
- " plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plot_collections(trial, \"feature_importance\", \"Feature importance\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### SHAP\n",
- "\n",
- "[SHAP](https://github.com/slundberg/shap) (SHapley Additive exPlanations) is\n",
- "another approach to explain the output of machine learning models.\n",
- "SHAP values represent a feature's contribution to a change in the model output."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plot_collections(trial, \"average_shap\", \"SHAP values\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We also have an example at the end of this notebook that demonstrates how to use a custom rule in smdebug. Before moving further, let's take some detailed look at Tornasole, some of which were touched upon above.\n",
- "\n",
- "\n",
- "## Enabling Tornasole in the training script\n",
- "\n",
- "The first step to using Tornasole is to save tensors from the training job. The containers we provide in SageMaker come with Tornasole library installed, which needs to be used to enable Tornasole in your training script.\n",
- "\n",
- "To enable Tornasole in the training script, you need to create and pass SessionHook, a construct Tornasole exposes to save tensors. Here's how you will need to modify your training script.\n",
- "\n",
- "First, you need to import `smdebug.xgboost`. \n",
- "```\n",
- "import tornasole\n",
- "import smdebug.xgboost as tx\n",
- "```\n",
- "Then create the SessionHook by specifying what you want to save and when you want to save them.\n",
- "```\n",
- "hook = tx.SessionHook(include_collections=['metric','feature_importance'],\n",
- " save_config=smdebug.SaveConfig(save_interval=5))\n",
- "```\n",
- "Now pass this hook as a callback function to the Booster object's train method.\n",
- "```\n",
- "import xgboost\n",
- "\n",
- "bst = xgboost.train(..., callbacks=[hook])\n",
- "```\n",
- "\n",
- "Refer to our example script [xgboost_abalone_basic_hook_demo.py](../scripts/xgboost_abalone_basic_hook_demo.py) for examples of using Tornasole with the XGBoost interface.\n",
- "\n",
- "Refer [DeveloperGuide_XGBoost.md](../DeveloperGuide_XG.md) for more details on the APIs Tornasole provides to help you save tensors."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Enabling Tornasole with SageMaker\n",
- "\n",
- "#### Storage\n",
- "The tensors saved by Tornasole are, by default, stored in the S3 output path of the training job, under the folder **`/tensors-`**. This is done to ensure that we don't end up accidentally overwriting the tensors from a training job with the others. Rules evaluation require separation of the tensors paths to be evaluated correctly.\n",
- "\n",
- "If you don't provide an S3 output path to the estimator, SageMaker creates one for you as: **`s3://sagemaker--/`**\n",
- "\n",
- "This path is used to create a Tornasole Trial taken by Rules (see below).\n",
- "\n",
- "#### New Parameters \n",
- "The new parameters in Sagemaker Estimator to look out for are\n",
- "\n",
- "- `debug`: (bool)\n",
- "This indicates that debugging should be enabled for the training job. \n",
- "Setting this as `True` would make Tornasole available for use with the job\n",
- "\n",
- "- `rules_specification`: (list[*dict*])\n",
- "You can specify any number of rules to monitor your SageMaker training job. This parameter takes a list of python dictionaries, one for each rule you want to enable. Each `dict` is of the following form:\n",
- "```\n",
- "{\n",
- " \"RuleName\": \n",
- " # The name of the class implementing the Tornasole Rule interface. (required)\n",
- "\n",
- " \"SourceS3Uri\": \n",
- " # S3 URI of the rule script containing the class in 'RuleName'. \n",
- " # This is not required if you want to use one of the\n",
- " # First Party rules provided to you by Amazon. \n",
- " # In such a case you can leave it empty or not pass it. \n",
- " # If you want to run a custom rule \n",
- " # defined by you, you will need to define the custom rule class in a python \n",
- " # file and provide it to SageMaker as a S3 URI. \n",
- " # SageMaker will fetch this file and try to look for the rule class \n",
- " # identified by RuleName in this file.\n",
- " \n",
- " \"InstanceType\": \n",
- " # The ML instance type which should be used to run the rule evaluation job\n",
- " \n",
- " \"VolumeSizeInGB\": \n",
- " # The volume size to store the runtime artifacts from the rule evaluation \n",
- " \n",
- " \"RuntimeConfigurations\": {\n",
- " # Map defining the parameters required to instantiate the Rule class and\n",
- " # parameters regarding invokation of the rule (start-step and end-step)\n",
- " # This can be any parameter taken by the rule. \n",
- " # Every value here needs to be a string. \n",
- " # So when you write custom rules, ensure that you can parse each argument \n",
- " # from a string.\n",
- " #\n",
- " # PARAMS CAN BE\n",
- " #\n",
- " # STANDARD PARAMS FOR RULE EXECUTION\n",
- " # \"start-step\": \n",
- " # \"end-step\": \n",
- " # \"other-trials-paths\": (';' separated list of s3 paths as a string)\n",
- " # \"logging-level\": (can be one of \"CRITICAL\", \"FATAL\", \"ERROR\", \n",
- " # \"WARNING\", \"WARN\", \"DEBUG\", \"NOTSET\")\n",
- " #\n",
- " # ANY OTHER PARAMETER TAKEN BY THE RULE\n",
- " # \"parameter\" : \n",
- " # : \n",
- " }\n",
- "}\n",
- "```"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Rules\n",
- "Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the job.\n",
- "They can be used to assert certain conditions during training, and raise Cloudwatch Events based on them that you can\n",
- "use to process in any way you like. \n",
- "\n",
- "Tornasole comes with a set of **First Party rules** (1P rules).\n",
- "You can also write your own rules looking at these 1P rules for inspiration. \n",
- "Refer [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more on the APIs you can use to write your own rules as well as descriptions for the 1P rules that we provide. \n",
- " \n",
- "Here we will talk about how to use Sagemaker to evalute these rules on the training jobs.\n",
- "\n",
- "\n",
- "##### 1P Rule \n",
- "If you want to use a 1P rule. Specify the RuleName field with the 1P RuleName, and the rule will be automatically applied. You can pass any parameters accepted by the rule as part of the RuntimeConfigurations dictionary. Rules constructor take trial as parameter. \n",
- "A Trial in Tornasole's context refers to a training job. It is identified by the path where the saved tensors for the job are stored. \n",
- "A rule takes a `base_trial` which refers to the job whose run invokes the rule execution. \n",
- "\n",
- "**Note:** A rule can be written to compare & analyze tensors across training jobs. A rule which needs to compare tensors across trials can be run by passing the argument `other_trials`. The argument `base_trial` will automatically be set by SageMaker when executing the rule. The parameter `other_trials` (if taken by the rule) can be passed by passing `other-trials-paths` in the RuntimeConfigurations dictionary. The value for this argument should be `;` separated list of S3 output paths where the tensors for those trials are stored.\n",
- "\n",
- "Here's a example of a complex configuration for the SimilarAcrossRuns (which accepts one other trial and a regex pattern) where we ask for the rule to be invoked for the steps between 10 and 100.\n",
- "\n",
- "``` \n",
- "rules_specification = [ \n",
- " {\n",
- " \"RuleName\": \"SimilarAcrossRuns\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"other_trials\": \"s3://sagemaker--/past-job\",\n",
- " \"include_regex\": \".*\",\n",
- " \"start-step\": \"10\",\n",
- " \"end-step\": \"100\"\n",
- " }\n",
- " }\n",
- "]\n",
- "```\n",
- "List of 1P rules and details about the rules can be found in *First party rules* section in [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) \n",
- "\n",
- "\n",
- "##### Custom rule\n",
- "In this case you need to define a custom rule class which inherits from `smdebug.rules.Rule` class.\n",
- "You need to provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`. Again, you can pass any arguments taken by this rule through the RuntimeConfigurations dictionary. Note that the custom rules can only have arguments which expect a string as the value except the two arguments specifying trials to the Rule. Refer section *Writing a rule* in [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more details.\n",
- "\n",
- "Here's an example:\n",
- "```\n",
- "rules_specification = [\n",
- " {\n",
- " \"RuleName\": \"CustomRule\",\n",
- " \"SourceS3Uri\": \"s3://tornasole-test/rule-script/custom_rule.py\",\n",
- " \"InstanceType\": \"ml.c5.4xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"threshold\" : \"0.5\"\n",
- " }\n",
- " }\n",
- "]\n",
- "```"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### CloudWatch Event Integration for Rules\n",
- "When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus [CloudWatch events](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) are emitted.\n",
- "\n",
- "After GA, you can configure a CloudWatch event rule to receive and process these events by setting up a target (Lambda function, SNS) as follows:\n",
- "\n",
- "- The SageMaker TrainingJobStatus CW event (https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#sagemaker_event_types) will include rule job statuses associated with the training job\n",
- "- A CW event will be emitted when a RuleStatus changes\n",
- "- Customer can create a CloudWatch event rule that monitors the Training Job customer started\n",
- "- Customer can set a Target (Lambda funtion, SQS) for the CloudWatch event rule that processes the event, and triggers an alarm for the customer based on the RuleStatus. \n",
- "\n",
- "Refer [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) for more details. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Custom rule: Single Feature Importance\n",
- "\n",
- "In this case you need to define a custom rule class which inherits from `smdebug.rules.Rule` class.\n",
- "You need to provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`.\n",
- "Again, you can pass any arguments taken by this rule through the RuntimeConfigurations dictionary. \n",
- "Note that the custom rules can only have arguments which expect a string as the value except the two arguments \n",
- "specifying trials to the Rule. Refer [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more.\n",
- "\n",
- "In the following code cell, we write a custom rule named `SingleFeatureImportance`\n",
- "that checks if any feature importance in a given collection goes out of the\n",
- "specified range."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%writefile /tmp/custom_feature_importance_rule.py\n",
- "\n",
- "from smdebug.rules.rule import Rule\n",
- "\n",
- "class SingleFeatureImportance(Rule):\n",
- " def __init__(\n",
- " self,\n",
- " base_trial,\n",
- " collection_name,\n",
- " num_features=None,\n",
- " min_importance_ratio=0,\n",
- " max_importance_ratio=1\n",
- " ):\n",
- " \"\"\"\n",
- " This rule checks the following statement:\n",
- " - In a given collection, each feature should have importance\n",
- " satisfying the following conditions:\n",
- " a) min_importance*(1/feature_no) <= feature importance\n",
- " b) feature_importance <= max_importance*(1/feature_no)\n",
- "\n",
- " :param base_trial: the trial whose execution will invoke the rule\n",
- " :param min_importance_ratio: the minimum allowed importance (as a proportion of 1/feature_no)\n",
- " :param max_importance_ratio: the maximum allowed importance (as a proportion of 1/feature_no)\n",
- " \"\"\"\n",
- " self.collection_name = collection_name\n",
- " self.tensor_names = base_trial.collection(self.collection_name).tensor_names\n",
- " self.num_features = len(self.tensor_names) if num_features is None else int(num_features)\n",
- " self.min_importance_ratio = float(min_importance_ratio)\n",
- " self.max_importance_ratio = float(max_importance_ratio)\n",
- "\n",
- " super().__init__(base_trial, other_trials=None)\n",
- " \n",
- " self.logger.info(\"FeatureImportance rule created.\")\n",
- "\n",
- " def invoke_at_step(self, step, **kwargs):\n",
- " \n",
- " min_importance = self.min_importance_ratio * (1 / self.num_features)\n",
- " max_importance = self.max_importance_ratio * (1 / self.num_features)\n",
- "\n",
- " failed = []\n",
- "\n",
- " for name in self.tensor_names:\n",
- " \n",
- " if step not in self.base_trial.tensor(name).steps():\n",
- " importance = 0\n",
- " else:\n",
- " importance = self.base_trial.tensor(name).value(step)\n",
- "\n",
- " if importance < min_importance:\n",
- " self.logger.debug(f\"Step {step} feature {name} has importance {importance}<{min_importance}\")\n",
- " failed.append((name, importance))\n",
- " elif max_importance < importance:\n",
- " self.logger.debug(f\"Step {step} feature {name} has importance {importance}>{max_importance}\")\n",
- " failed.append((name, importance))\n",
- "\n",
- " self.logger.info(failed)\n",
- " self.logger.info(f\"Step {step} had {len(failed)} features with out-of-band values\")\n",
- "\n",
- " return True if failed else False"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We need to upload this to a bucket in the same region where we want to run the job. We have chosen a default bucket below. Please change it to the bucket you want. We will now create this bucket if it does not exist, and upload this file. We will then specify this path when starting the job as `SourceS3Uri`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "ACCOUNT_ID = boto3.client('sts').get_caller_identity().get('Account')\n",
- "BUCKET = f'tornasole-resources-{ACCOUNT_ID}-{REGION}'\n",
- "\n",
- "CUSTOM_RULE_PATH = '/tmp/custom_feature_importance_rule.py'\n",
- "\n",
- "PREFIX = os.path.join('rules', os.path.basename(CUSTOM_RULE_PATH))\n",
- "\n",
- "import os\n",
- "s3 = boto3.resource('s3')\n",
- "bucket = s3.Bucket(BUCKET)\n",
- "if not bucket.creation_date:\n",
- " s3.create_bucket(Bucket=BUCKET, CreateBucketConfiguration={'LocationConstraint': REGION})\n",
- "s3.Object(BUCKET, PREFIX).put(Body=open(CUSTOM_RULE_PATH, 'rb'))\n",
- "SOURCE_S3_URI = f's3://{BUCKET}/{PREFIX}'\n",
- "print(f\"Upload to {SOURCE_S3_URI}\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator = Estimator(\n",
- " base_job_name=\"xgboost-tornasole-feature-importance\",\n",
- " hyperparameters=hyperparameters,\n",
- " image_name=docker_image_name,\n",
- " role=ROLE,\n",
- " train_instance_count=1,\n",
- " train_instance_type=\"ml.m4.4xlarge\",\n",
- " debug=True,\n",
- " rules_specification = [\n",
- " {\n",
- " \"RuleName\": \"SingleFeatureImportance\",\n",
- " \"SourceS3Uri\": SOURCE_S3_URI,\n",
- " \"InstanceType\": \"ml.c5.xlarge\",\n",
- " \"VolumeSizeInGB\": 10,\n",
- " \"RuntimeConfigurations\": {\n",
- " \"collection_name\": \"average_shap\",\n",
- " \"num_features\": \"8\",\n",
- " \"min_importance_ratio\": \"0.0\",\n",
- " \"max_importance_ratio\": \"2.0\"\n",
- " }\n",
- " }\n",
- " ]\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.fit(wait=False)\n",
- "\n",
- "job_name = estimator.latest_training_job.name\n",
- "client = estimator.sagemaker_session.sagemaker_client\n",
- "description = client.describe_training_job(TrainingJobName=job_name)\n",
- "\n",
- "description[\"TrainingJobStatus\"]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "description[\"DebugConfig\"]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As we can visually verify in the [Data Analysis](#Data-Analysis---Manual) section above,\n",
- "the SHAP value of feature will become significant, and we expect our rule to throw a\n",
- "`RuleEvaluationConditionMet` exception."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.describe_rule_execution_jobs()"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "conda_python3",
- "language": "python",
- "name": "conda_python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.5"
- },
- "pycharm": {
- "stem_cell": {
- "cell_type": "raw",
- "metadata": {
- "collapsed": false
- },
- "source": []
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py b/examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py
deleted file mode 100644
index ec5551d1e..000000000
--- a/examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py
+++ /dev/null
@@ -1,125 +0,0 @@
-# Standard Library
-import argparse
-import os
-import random
-import tempfile
-import urllib.request
-
-# Third Party
-import xgboost
-
-# First Party
-from smdebug import SaveConfig
-from smdebug.xgboost import Hook
-
-
-def parse_args():
-
- parser = argparse.ArgumentParser()
-
- parser.add_argument("--max_depth", type=int, default=5)
- parser.add_argument("--eta", type=float, default=0.2)
- parser.add_argument("--gamma", type=int, default=4)
- parser.add_argument("--min_child_weight", type=int, default=6)
- parser.add_argument("--subsample", type=float, default=0.7)
- parser.add_argument("--silent", type=int, default=0)
- parser.add_argument("--objective", type=str, default="reg:squarederror")
- parser.add_argument("--num_round", type=int, default=50)
- parser.add_argument("--smdebug_path", type=str, default=None)
- parser.add_argument("--save_frequency", type=int, default=1)
- parser.add_argument(
- "--output_uri",
- type=str,
- default="/opt/ml/output/tensors",
- help="S3 URI of the bucket where tensor data will be stored.",
- )
-
- parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
- parser.add_argument("--validation", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))
-
- args = parser.parse_args()
-
- return args
-
-
-def load_abalone(train_split=0.8, seed=42):
-
- if not (0 < train_split <= 1):
- raise ValueError("'train_split' must be between 0 and 1.")
-
- url = "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/abalone"
-
- response = urllib.request.urlopen(url).read().decode("utf-8")
- lines = response.strip().split("\n")
- n = sum(1 for line in lines)
- indices = list(range(n))
- random.seed(seed)
- random.shuffle(indices)
- train_indices = set(indices[: int(n * 0.8)])
-
- with tempfile.NamedTemporaryFile(mode="w", delete=False) as train_file:
- with tempfile.NamedTemporaryFile(mode="w", delete=False) as valid_file:
- for idx, line in enumerate(lines):
- if idx in train_indices:
- train_file.write(line + "\n")
- else:
- valid_file.write(line + "\n")
-
- return train_file.name, valid_file.name
-
-
-def create_hook(out_dir, train_data=None, validation_data=None, frequency=1):
-
- save_config = SaveConfig(save_interval=frequency)
- hook = Hook(
- out_dir=out_dir,
- save_config=save_config,
- train_data=train_data,
- validation_data=validation_data,
- )
-
- return hook
-
-
-def main():
-
- args = parse_args()
-
- if args.train and args.validation:
- train, validation = args.train, args.validation
- else:
- train, validation = load_abalone()
-
- dtrain = xgboost.DMatrix(train)
- dval = xgboost.DMatrix(validation)
-
- watchlist = [(dtrain, "train"), (dval, "validation")]
-
- params = {
- "max_depth": args.max_depth,
- "eta": args.eta,
- "gamma": args.gamma,
- "min_child_weight": args.min_child_weight,
- "subsample": args.subsample,
- "silent": args.silent,
- "objective": args.objective,
- }
-
- # The output_uri is a the URI for the s3 bucket where the metrics will be
- # saved.
- output_uri = args.smdebug_path if args.smdebug_path is not None else args.output_uri
-
- hook = create_hook(out_dir=output_uri, frequency=args.save_frequency, train_data=dtrain)
-
- bst = xgboost.train(
- params=params,
- dtrain=dtrain,
- evals=watchlist,
- num_boost_round=args.num_round,
- callbacks=[hook],
- )
-
-
-if __name__ == "__main__":
-
- main()
diff --git a/examples/xgboost/scripts/xgboost_mnist_basic_hook_demo.py b/examples/xgboost/scripts/xgboost_mnist_basic_hook_demo.py
deleted file mode 100644
index 6221a7320..000000000
--- a/examples/xgboost/scripts/xgboost_mnist_basic_hook_demo.py
+++ /dev/null
@@ -1,132 +0,0 @@
-# Standard Library
-import argparse
-import bz2
-import os
-import random
-import tempfile
-import urllib.request
-
-# Third Party
-import xgboost
-
-# First Party
-from smdebug import SaveConfig
-from smdebug.xgboost import Hook
-
-
-def parse_args():
-
- parser = argparse.ArgumentParser()
-
- parser.add_argument("--max_depth", type=int, default=5)
- parser.add_argument("--eta", type=float, default=0.05) # 0.2
- parser.add_argument("--gamma", type=int, default=4)
- parser.add_argument("--min_child_weight", type=int, default=6)
- parser.add_argument("--silent", type=int, default=0)
- parser.add_argument("--objective", type=str, default="multi:softmax")
- parser.add_argument("--num_class", type=int, default=10)
- parser.add_argument("--num_round", type=int, default=10)
- parser.add_argument("--smdebug_path", type=str, default=None)
- parser.add_argument("--save_frequency", type=int, default=1)
- parser.add_argument(
- "--output_uri",
- type=str,
- default="/opt/ml/output/tensors",
- help="S3 URI of the bucket where tensor data will be stored.",
- )
-
- parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
- parser.add_argument("--validation", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))
-
- args = parser.parse_args()
-
- return args
-
-
-def load_mnist(train_split=0.8, seed=42):
-
- if not (0 < train_split <= 1):
- raise ValueError("'train_split' must be between 0 and 1.")
-
- url = "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2"
-
- with tempfile.NamedTemporaryFile(mode="wb", delete=False) as mnist_bz2:
- urllib.request.urlretrieve(url, mnist_bz2.name)
-
- with bz2.open(mnist_bz2.name, "r") as fin:
- content = fin.read().decode("utf-8")
- lines = content.strip().split("\n")
- n = sum(1 for line in lines)
- indices = list(range(n))
- random.seed(seed)
- random.shuffle(indices)
- train_indices = set(indices[: int(n * 0.8)])
-
- with tempfile.NamedTemporaryFile(mode="w", delete=False) as train_file:
- with tempfile.NamedTemporaryFile(mode="w", delete=False) as valid_file:
- for idx, line in enumerate(lines):
- if idx in train_indices:
- train_file.write(line + "\n")
- else:
- valid_file.write(line + "\n")
-
- return train_file.name, valid_file.name
-
-
-def create_hook(out_dir, train_data=None, validation_data=None, frequency=1):
-
- save_config = SaveConfig(save_interval=frequency)
- hook = Hook(
- out_dir=out_dir,
- save_config=save_config,
- train_data=train_data,
- validation_data=validation_data,
- )
-
- return hook
-
-
-def main():
-
- args = parse_args()
-
- if args.train and args.validation:
- train, validation = args.train, args.validation
- else:
- train, validation = load_mnist()
-
- dtrain = xgboost.DMatrix(train)
- dval = xgboost.DMatrix(validation)
-
- watchlist = [(dtrain, "train"), (dval, "validation")]
-
- params = {
- "max_depth": args.max_depth,
- "eta": args.eta,
- "gamma": args.gamma,
- "min_child_weight": args.min_child_weight,
- "silent": args.silent,
- "objective": args.objective,
- "num_class": args.num_class,
- }
-
- # The output_uri is a the URI for the s3 bucket where the metrics will be
- # saved.
- output_uri = args.smdebug_path if args.smdebug_path is not None else args.output_uri
-
- hook = create_hook(
- out_dir=output_uri, frequency=args.save_frequency, train_data=dtrain, validation_data=dval
- )
-
- bst = xgboost.train(
- params=params,
- dtrain=dtrain,
- evals=watchlist,
- num_boost_round=args.num_round,
- callbacks=[hook],
- )
-
-
-if __name__ == "__main__":
-
- main()
diff --git a/tests/tensorflow/hooks/test_training_end.py b/tests/tensorflow/hooks/test_training_end.py
index 541b6be56..ce59c14dc 100644
--- a/tests/tensorflow/hooks/test_training_end.py
+++ b/tests/tensorflow/hooks/test_training_end.py
@@ -16,15 +16,13 @@ def test_training_job_has_ended(out_dir):
subprocess.check_call(
[
sys.executable,
- "examples/tensorflow/scripts/simple.py",
- "--smdebug_path",
+ "examples/tensorflow/local/simple.py",
+ "--out_dir",
out_dir,
"--steps",
"10",
- "--save_frequency",
+ "--save_interval",
"5",
- "--script-mode",
- "y",
],
env={"CUDA_VISIBLE_DEVICES": "-1", "SMDEBUG_LOG_LEVEL": "debug"},
)