diff --git a/examples/mxnet/README.md b/examples/mxnet/README.md
index 9818d371a..a282a073a 100644
--- a/examples/mxnet/README.md
+++ b/examples/mxnet/README.md
@@ -1,2 +1,22 @@
-## Example Notebooks
+# Examples
+## Example notebooks
Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger)
+
+## Example scripts
+The above notebooks come with example scripts which can be used through SageMaker. Some more example scripts are here in [scripts/](scripts/)
+
+## Example configurations for saving tensors through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
+Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md)
+
+## Example configurations for running rules through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
+Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md)
+
+## Example for running rule locally
+
+```
+from smdebug.rules import invoke_rule
+from smdebug.trials import create_trial
+trial = create_trial('s3://bucket/prefix')
+rule_obj = CustomRule(trial, param=value)
+invoke_rule(rule_obj, start_step=0, end_step=10)
+```
diff --git a/examples/mxnet/notebooks/mxnet-tensor-plot.ipynb b/examples/mxnet/notebooks/mxnet-tensor-plot.ipynb
deleted file mode 100644
index a2e65675c..000000000
--- a/examples/mxnet/notebooks/mxnet-tensor-plot.ipynb
+++ /dev/null
@@ -1,310 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Visualizing Tensors "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Overview"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "SageMaker Debugger is a new capability of Amazon SageMaker that allows debugging machine learning models. \n",
- "It lets you go beyond just looking at scalars like losses and accuracies during training and gives \n",
- "you full visibility into all the tensors 'flowing through the graph' during training. SageMaker Debugger helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected an inconsistency in the training flow.\n",
- "\n",
- "Using SageMaker Debugger is a two step process: Saving tensors and Analysis. In this notebook we will run an MXNet training job and configure SageMaker Debugger to store all tensors from this job. Afterwards we will visualize those tensors in our notebook.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Dependencies\n",
- "Before we begin, let us install the library plotly if it is not already present in the environment.\n",
- "If the below cell installs the library for the first time, you'll have to restart the kernel and come back to the notebook."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "! pip install plotly"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Configure and run the training job\n",
- "\n",
- "Now we'll call the Sagemaker MXNet Estimator to kick off a training job along with the VanishingGradient rule to monitor the job.\n",
- "\n",
- "The 'entry_point_script' points to the MXNet training script that has the SageMaker DebuggerHook integrated.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "entry_point_script = '../scripts/mnist_gluon_save_all_demo.py'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import boto3\n",
- "import os\n",
- "import sagemaker\n",
- "from sagemaker.mxnet import MXNet\n",
- "\n",
- "\n",
- "REGION='us-west-2'\n",
- "TAG='latest'\n",
- "\n",
- "estimator = MXNet(role=sagemaker.get_execution_role(),\n",
- " base_job_name='mxnet-trsl-test-nb',\n",
- " train_instance_count=1,\n",
- " train_instance_type='ml.m4.xlarge',\n",
- " entry_point=entry_point_script,\n",
- " framework_version='1.4.1',\n",
- " debug=True,\n",
- " py_version='py3')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Start the training job:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "estimator.fit()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Get S3 location of tensors\n",
- "\n",
- "We can check the status of the training job by running the following command:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "job_name = estimator.latest_training_job.name\n",
- "\n",
- "client = estimator.sagemaker_session.sagemaker_client\n",
- "\n",
- "description = client.describe_training_job(TrainingJobName=job_name)\n",
- "\n",
- "print('downloading tensors from training job: ', job_name)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can retrieve the S3 location of the tensors by accessing the dictionary `description`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "path = description['DebugConfig']['DebugHookConfig']['S3OutputPath']\n",
- "\n",
- "print('Tensors are stored in: ', path)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Download tensors from S3\n",
- "\n",
- "Now we will download the tensors from S3, so that we can visualize them in our notebook."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "folder_name = path.split(\"/\")[-1]\n",
- "os.system(\"aws s3 cp --recursive \" + path + \" \" + folder_name)\n",
- "print('Downloading tensors into folder: ', folder_name)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Visualize\n",
- "The main purpose of this class (TensorPlot) is to visualise the tensors in your network. This could be to determine dead or saturated activations, or the features maps the network.\n",
- "\n",
- "To use this class (TensorPlot), you will need to supply the argument regex with the tensors you are interested in. e.g., if you are interested in activation outputs, then you need to supply the following regex .*relu|.*tanh|.*sigmoid.\n",
- "\n",
- "Another important argument is the `sample_batch_id`, which allows you to specify the index of the batch size to display. For example, given an input tensor of size (batch_size, channel, width, height), `sample_batch_id = n` will display (n, channel, width, height). If you set sample_batch_id = -1 then the tensors will be summed over the batch dimension (i.e., `np.sum(tensor, axis=0)`). If batch_sample_id is None then each sample will be plotted as separate layer in the figure.\n",
- "\n",
- "Here are some interesting use cases:\n",
- "\n",
- "1) If you want to determine dead or saturated activations for instance ReLus that are always outputting zero, then you would want to sum the batch dimension (sample_batch_id=-1). The sum gives an indication which parts of the network are inactive across a batch.\n",
- "\n",
- "2) If you are interested in the feature maps for the first image in the batch, then you should provide batch_sample_id=0. This can be helpful if your model is not performing well for certain set of samples and you want to understand which activations are leading to misprediction.\n",
- "\n",
- "An example visualization of layer outputs:\n",
- "\n",
- "\n",
- "\n",
- "`TensorPlot` normalizes tensor values to the range 0 to 1 which means colorscales are the same across layers. Blue indicates value close to 0 and yellow indicates values close to 1. This class has been designed to plot convolutional networks that take 2D images as input and predict classes or produce output images. You can use this for other types of networks like RNNs, but you may have to adjust the class as it is currently neglecting tensors that have more than 4 dimensions.\n",
- "\n",
- "Let's plot Relu output activations for the given MNIST training example."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- " \n",
- " "
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[2019-11-13 10:52:31.483 186590da3c0d.ant.amazon.com:3874 INFO local_trial.py:35] Loading trial at path /tmp/mxnet3/\n",
- "[2019-11-13 10:52:31.516 186590da3c0d.ant.amazon.com:3874 INFO trial.py:191] Training has ended, will refresh one final time in 1 sec.\n",
- "[2019-11-13 10:52:32.519 186590da3c0d.ant.amazon.com:3874 INFO trial.py:203] Loaded all steps\n"
- ]
- }
- ],
- "source": [
- "import tensor_plot \n",
- "\n",
- "visualization = tensor_plot.TensorPlot(\n",
- " regex=\".*relu_output\", \n",
- " path=folder_name,\n",
- " steps=10, \n",
- " batch_sample_id=0,\n",
- " color_channel = 1,\n",
- " title=\"Relu outputs\",\n",
- " label=\".*sequential0_input_0\",\n",
- " prediction=\".*sequential0_output_0\"\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "If we plot too many layers, it can crash the notebook. If you encounter performance or out of memroy issues, then either try to reduce the layers to plot by changing the `regex` or run this Notebook in JupyterLab instead of Jupyter. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "visualization.fig.show(renderer=\"iframe\")"
- ]
- }
- ],
- "metadata": {
- "hide_input": false,
- "kernelspec": {
- "display_name": "Python [conda env:.conda-tf1x] *",
- "language": "python",
- "name": "conda-env-.conda-tf1x-py"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.9"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/examples/mxnet/notebooks/tensor_plot.py b/examples/mxnet/notebooks/tensor_plot.py
deleted file mode 100644
index 31eaa94d1..000000000
--- a/examples/mxnet/notebooks/tensor_plot.py
+++ /dev/null
@@ -1,331 +0,0 @@
-# Third Party
-import numpy as np
-import plotly.graph_objects as go
-import plotly.offline as py
-
-# First Party
-from smdebug.trials import create_trial
-
-py.init_notebook_mode(connected=True)
-
-# This class provides methods to plot tensors as 3 dimensional objects. It is intended for plotting convolutional
-# neural networks and expects that inputs are images and that outputs are class labels or images.
-class TensorPlot:
- def __init__(
- self,
- regex,
- path,
- steps=10,
- batch_sample_id=None,
- color_channel=1,
- title="",
- label=None,
- prediction=None,
- ):
- """
-
- :param regex: tensor regex
- :param path:
- :param steps:
- :param batch_sample_id:
- :param color_channel:
- :param title:
- :param label:
- :param prediction:
- """
- self.trial = create_trial(path)
- self.regex = regex
- self.steps = steps
- self.batch_sample_id = batch_sample_id
- self.color_channel = color_channel
- self.title = title
- self.label = label
- self.prediction = prediction
- self.max_dim = 0
- self.dist = 0
- self.tensors = {}
- self.output = {}
- self.input = {}
- self.load_tensors()
- self.set_figure()
- self.plot_network()
- self.set_frames()
-
- # Loads all tensors into a dict where the key is the step.
- # If batch_sample_id is None then batch dimension is plotted as a seperate dimension
- # if batch_sample_id is -1 then tensors are summed over batch dimension. Otherwise
- # the corresponding sample is plotted in the figure, and all the remaining samples
- # in the batch are dropped.
- def load_tensors(self):
- available_steps = self.trial.steps()
- for step in available_steps[0 : self.steps]:
- self.tensors[step] = []
-
- # input image into the neural network
- if self.label is not None:
- for tname in self.trial.tensor_names(regex=self.label):
- tensor = self.trial.tensor(tname).value(step)
- if self.color_channel == 1:
- self.input[step] = tensor[0, 0, :, :]
- elif self.color_channel == 3:
- self.input[step] = tensor[0, :, :, 3]
-
- # iterate over tensors that match the regex
- for tname in self.trial.tensor_names(regex=self.regex):
- tensor = self.trial.tensor(tname).value(step)
- # get max value of tensors to set axis dimension accordingly
- for dim in tensor.shape:
- if dim > self.max_dim:
- self.max_dim = dim
-
- # layer inputs/outputs have as first dimension batch size
- if self.batch_sample_id != None:
- # sum over batch dimension
- if self.batch_sample_id == -1:
- tensor = np.sum(tensor, axis=0) / tensor.shape[0]
- # plot item from batch
- elif self.batch_sample_id >= 0 and self.batch_sample_id <= tensor.shape[0]:
- tensor = tensor[self.batch_sample_id]
- # plot first item from batch
- else:
- tensor = tensor[0]
-
- # normalize tensor values between 0 and 1 so that all tensors have same colorscheme
- tensor = tensor - np.min(tensor)
- if np.max(tensor) != 0:
- tensor = tensor / np.max(tensor)
- if len(tensor.shape) == 3:
- for l in range(tensor.shape[self.color_channel - 1]):
- if self.color_channel == 1:
- self.tensors[step].append([tname, tensor[l, :, :]])
- elif self.color_channel == 3:
- self.tensors[step].append([tname, tensor[:, :, l]])
- elif len(tensor.shape) == 1:
- self.tensors[step].append([tname, tensor])
- else:
- # normalize tensor values between 0 and 1 so that all tensors have same colorscheme
- tensor = tensor - np.min(tensor)
- if np.max(tensor) != 0:
- tensor = tensor / np.max(tensor)
- if len(tensor.shape) == 4:
- for i in range(tensor.shape[0]):
- for l in range(tensor.shape[1]):
- if self.color_channel == 1:
- self.tensors[step].append([tname, tensor[i, l, :, :]])
- elif self.color_channel == 3:
- self.tensors[step].append([tname, tensor[i, :, :, l]])
- elif len(tensor.shape) == 2:
- self.tensors[step].append([tname, tensor])
-
- # model output
- if self.prediction is not None:
- for tname in self.trial.tensor_names(regex=self.prediction):
- tensor = self.trial.tensor(tname).value(step)
- # predicted class (batch size, propabilities per clas)
- if len(tensor.shape) == 2:
- self.output[step] = np.array([np.argmax(tensor, axis=1)[0]])
- # predict an image (batch size, color channel, weidth, height)
- elif len(tensor.shape) == 4:
- # MXNet has color channel in dim1
- if self.color_channel == 1:
- self.output[step] = tensor[0, 0, :, :]
- # TF has color channel in dim 3
- elif self.color_channel == 3:
- self.output[step] = tensor[0, :, :, 0]
-
- # Configure the plot layout
- def set_figure(self):
- self.fig = go.Figure(
- layout=go.Layout(
- autosize=False,
- title=self.title,
- width=1000,
- height=800,
- template="plotly_dark",
- font=dict(color="gray"),
- showlegend=False,
- updatemenus=[
- dict(
- type="buttons",
- buttons=[
- dict(
- label="Play",
- method="animate",
- args=[
- None,
- {
- "frame": {"duration": 1, "redraw": True},
- "fromcurrent": True,
- "transition": {"duration": 1},
- },
- ],
- )
- ],
- )
- ],
- scene=dict(
- xaxis=dict(
- range=[-self.max_dim / 2, self.max_dim / 2],
- autorange=False,
- gridcolor="black",
- zerolinecolor="black",
- showgrid=False,
- showline=False,
- showticklabels=False,
- showspikes=False,
- ),
- yaxis=dict(
- range=[-self.max_dim / 2, self.max_dim / 2],
- autorange=False,
- gridcolor="black",
- zerolinecolor="black",
- showgrid=False,
- showline=False,
- showticklabels=False,
- showspikes=False,
- ),
- zaxis=dict(
- gridcolor="black",
- zerolinecolor="black",
- showgrid=False,
- showline=False,
- showticklabels=False,
- showspikes=False,
- ),
- ),
- )
- )
-
- # Create a sequence of frames: tensors from same step will be stored in the same frame
- def set_frames(self):
- frames = []
- available_steps = self.trial.steps()
- for step in available_steps[0 : self.steps]:
- layers = []
- if self.label is not None:
- if len(self.input[step].shape) == 2:
- # plot predicted image
- layers.append({"type": "surface", "surfacecolor": self.input[step]})
- for i in range(len(self.tensors[step])):
- if len(self.tensors[step][i][1].shape) == 1:
- # set color of fully connected layer for corresponding step
- layers.append(
- {"type": "scatter3d", "marker": {"color": self.tensors[step][i][1]}}
- )
- elif len(self.tensors[step][i][1].shape) == 2:
- # set color of convolutional/pooling layer for corresponding step
- layers.append({"type": "surface", "surfacecolor": self.tensors[step][i][1]})
- if self.prediction is not None:
- if len(self.output[step].shape) == 1:
- # plot predicted class for first input in batch
- layers.append(
- {
- "type": "scatter3d",
- "text": "Predicted class " + str(self.output[step][0]),
- "textfont": {"size": 40},
- }
- )
- elif len(self.output[step].shape) == 2:
- # plot predicted image
- layers.append({"type": "surface", "surfacecolor": self.output[step]})
- frames.append(go.Frame(data=layers))
-
- self.fig.frames = frames
-
- # Plot the different neural network layers
- # if ignore_batch_dimension is True then convolutions are plotted as
- # Surface and dense layers are plotted as Scatter3D
- # if ignore_batch_dimension is False then convolutions and dense layers
- # are plotted as Surface. We don't plot biases.
- # If convolution has shape [batch_size, 10, 24, 24] and ignore_batch_dimension==True
- # then this function will plot 10 Surface layers in the size of 24x24
- def plot_network(self):
- tensors = []
- dist = 0
- counter = 0
-
- first_step = self.trial.steps()[0]
- if self.label is not None:
- tensor = self.input[first_step].shape
- if len(tensor) == 2:
- tensors.append(
- go.Surface(
- z=np.zeros((tensor[0], tensor[1])) + self.dist,
- y=np.arange(-tensor[0] / 2, tensor[0] / 2),
- x=np.arange(-tensor[1] / 2, tensor[1] / 2),
- surfacecolor=self.input[first_step],
- showscale=False,
- colorscale="gray",
- opacity=0.7,
- )
- )
- self.dist += 2
- prev_name = None
- for tname, layer in self.tensors[first_step]:
- tensor = layer.shape
-
- if len(tensor) == 2:
- tensors.append(
- go.Surface(
- z=np.zeros((tensor[0], tensor[1])) + self.dist,
- y=np.arange(-tensor[0] / 2, tensor[0] / 2),
- x=np.arange(-tensor[1] / 2, tensor[1] / 2),
- text=tname,
- surfacecolor=layer,
- showscale=False,
- # colorscale='gray',
- opacity=0.7,
- )
- )
-
- elif len(tensor) == 1:
- tensors.append(
- go.Scatter3d(
- z=np.zeros(tensor[0]) + self.dist,
- y=np.zeros(tensor[0]),
- x=np.arange(-tensor[0] / 2, tensor[0] / 2),
- text=tname,
- mode="markers",
- marker=dict(size=3, opacity=0.7, color=layer),
- )
- )
- if tname == prev_name:
- self.dist += 0.2
- else:
- self.dist += 1
- counter += 1
- prev_name = tname
- # plot model output
- if self.prediction is not None:
- # model predicts a class label (batch_size, class propabilities)
- if len(self.output[first_step].shape) == 1:
- tensors.append(
- go.Scatter3d(
- z=np.array([self.dist + 0.2]),
- x=np.array([0]),
- y=np.array([0]),
- text="Predicted class",
- mode="markers+text",
- marker=dict(size=3, color="black"),
- textfont=dict(size=18),
- opacity=0.7,
- )
- )
- # model predicts an output image (batch size, color channel, width, height)
- elif len(self.output[first_step].shape) == 2:
- tensor = self.output[first_step].shape
- tensors.append(
- go.Surface(
- z=np.zeros((tensor[0], tensor[1])) + self.dist + 3,
- y=np.arange(-tensor[0] / 2, tensor[0] / 2),
- x=np.arange(-tensor[1] / 2, tensor[1] / 2),
- text="Predicted image",
- surfacecolor=self.output[first_step],
- showscale=False,
- colorscale="gray",
- opacity=0.7,
- )
- )
-
- # add list of tensors to figure
- self.fig.add_traces(tensors)
diff --git a/examples/mxnet/notebooks/tensorplot.gif b/examples/mxnet/notebooks/tensorplot.gif
deleted file mode 100644
index 0f0df6e89..000000000
Binary files a/examples/mxnet/notebooks/tensorplot.gif and /dev/null differ
diff --git a/examples/mxnet/scripts/mnist_gluon_all_zero_demo.py b/examples/mxnet/scripts/mnist_gluon_all_zero_demo.py
deleted file mode 100644
index 669577d65..000000000
--- a/examples/mxnet/scripts/mnist_gluon_all_zero_demo.py
+++ /dev/null
@@ -1,189 +0,0 @@
-# Standard Library
-import argparse
-import random
-import time
-import uuid
-
-# Third Party
-import mxnet as mx
-import numpy as np
-from mxnet import autograd, gluon, init
-from mxnet.gluon import nn
-from mxnet.gluon.data.vision import datasets, transforms
-
-# First Party
-from smdebug.mxnet import Hook, SaveConfig, modes
-
-
-def parse_args():
- parser = argparse.ArgumentParser(
- description="Train a mxnet gluon model for FashonMNIST dataset"
- )
- parser.add_argument("--batch-size", type=int, default=256, help="Batch size")
- parser.add_argument(
- "--smdebug_path",
- type=str,
- default=f"s3://smdebug-testing/outputs/all-zero-hook/trial-{uuid.uuid4()}",
- help="S3 URI of the bucket where tensor data will be stored.",
- )
- parser.add_argument("--learning_rate", type=float, default=0.1)
- parser.add_argument("--random_seed", type=bool, default=False)
- parser.add_argument(
- "--num_steps",
- type=int,
- help="Reduce the number of training "
- "and evaluation steps to the give number if desired."
- "If this is not passed, trains for one epoch "
- "of training and validation data",
- )
- opt = parser.parse_args()
- return opt
-
-
-def acc(output, label):
- return (output.argmax(axis=1) == label.astype("float32")).mean().asscalar()
-
-
-def train_model(batch_size, net, train_data, valid_data, lr, hook, num_steps=None):
- softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
- trainer = gluon.Trainer(net.collect_params(), "sgd", {"learning_rate": lr})
- # Start the training.
- for epoch in range(1):
- train_loss, train_acc, valid_acc = 0.0, 0.0, 0.0
- tic = time.time()
- hook.set_mode(modes.TRAIN)
- for i, (data, label) in enumerate(train_data):
- if num_steps is not None and num_steps < i:
- break
- data = data.as_in_context(mx.cpu(0))
- # forward + backward
- with autograd.record():
- output = net(data)
- loss = softmax_cross_entropy(output, label)
- loss.backward()
- # update parameters
- trainer.step(batch_size)
- # calculate training metrics
- train_loss += loss.mean().asscalar()
- train_acc += acc(output, label)
- # calculate validation accuracy
- hook.set_mode(modes.EVAL)
- for i, (data, label) in enumerate(valid_data):
- if num_steps is not None and num_steps < i:
- break
- data = data.as_in_context(mx.cpu(0))
- valid_acc += acc(net(data), label)
- print(
- "Epoch %d: loss %.3f, train acc %.3f, test acc %.3f, in %.1f sec"
- % (
- epoch,
- train_loss / len(train_data),
- train_acc / len(train_data),
- valid_acc / len(valid_data),
- time.time() - tic,
- )
- )
-
-
-def prepare_data(batch_size):
- mnist_train = datasets.FashionMNIST(
- train=True, transform=lambda data, label: (data.astype(np.float32) * 0, label)
- )
- X, y = mnist_train[0]
- ("X shape: ", X.shape, "X dtype", X.dtype, "y:", y)
- text_labels = [
- "t-shirt",
- "trouser",
- "pullover",
- "dress",
- "coat",
- "sandal",
- "shirt",
- "sneaker",
- "bag",
- "ankle boot",
- ]
- X, y = mnist_train[0:10]
- transformer = transforms.Compose([transforms.ToTensor(), transforms.Normalize(0.0, 0.1)])
- mnist_train = mnist_train.transform_first(transformer)
- train_data = gluon.data.DataLoader(
- mnist_train, batch_size=batch_size, shuffle=True, num_workers=4
- )
- mnist_valid = gluon.data.vision.FashionMNIST(train=False)
- valid_data = gluon.data.DataLoader(
- mnist_valid.transform_first(transformer), batch_size=batch_size, num_workers=4
- )
- return train_data, valid_data
-
-
-# Create a model using gluon API. The hook is currently
-# supports MXNet gluon models only.
-def create_gluon_model():
- # Create Model in Gluon
- net = nn.HybridSequential()
- net.add(
- nn.Conv2D(channels=6, kernel_size=5, activation="relu"),
- nn.MaxPool2D(pool_size=2, strides=2),
- nn.Conv2D(channels=16, kernel_size=3, activation="relu"),
- nn.MaxPool2D(pool_size=2, strides=2),
- nn.Flatten(),
- nn.Dense(120, activation="relu"),
- nn.Dense(84, activation="relu"),
- nn.Dense(10),
- )
- net.initialize(init=init.Xavier(), ctx=mx.cpu())
- return net
-
-
-# Create a hook. The initialization of hook determines which tensors
-# are logged while training is in progress.
-# Following function shows the default initialization that enables logging of
-# weights, biases and gradients in the model.
-def create_hook(output_s3_uri):
- # With the following SaveConfig, we will save tensors for steps 0, 1, 2 and 3
- # (indexing starts with 0).
- save_config = SaveConfig(save_steps=[0, 1, 2, 3])
- # Create a hook that logs weights, biases and gradients while training the model.
- hook = Hook(
- out_dir=output_s3_uri,
- save_config=save_config,
- include_collections=["ReluActivation", "weights", "biases", "gradients"],
- )
- hook.get_collection("ReluActivation").include(["relu*", "input_*"])
- return hook
-
-
-def main():
- opt = parse_args()
-
- # these random seeds are only intended for test purpose.
- # for now, 128,12,2 could promise no assert failure with running tests.
- # if you wish to change the number, notice that certain steps' tensor value may be capable of variation
- if opt.random_seed:
- mx.random.seed(128)
- random.seed(12)
- np.random.seed(2)
-
- # Create a Gluon Model.
- net = create_gluon_model()
-
- # Create a hook for logging the desired tensors.
- # The output_s3_uri is a the URI for the s3 bucket where the tensors will be saved.
- # The trial_id is used to store the tensors from different trials separately.
- output_uri = opt.smdebug_path
- hook = create_hook(output_uri)
-
- # Register the hook to the top block.
- hook.register_hook(net)
-
- # Start the training.
- batch_size = opt.batch_size
- train_data, valid_data = prepare_data(batch_size)
-
- train_model(
- batch_size, net, train_data, valid_data, opt.learning_rate, hook, num_steps=opt.num_steps
- )
-
-
-if __name__ == "__main__":
- main()
diff --git a/examples/mxnet/scripts/mnist_gluon_block_input_output_demo.py b/examples/mxnet/scripts/mnist_gluon_block_input_output_demo.py
deleted file mode 100644
index 58da53a2e..000000000
--- a/examples/mxnet/scripts/mnist_gluon_block_input_output_demo.py
+++ /dev/null
@@ -1,185 +0,0 @@
-# Standard Library
-import argparse
-import time
-import uuid
-
-# Third Party
-import mxnet as mx
-from mxnet import autograd, gluon, init
-from mxnet.gluon import nn
-from mxnet.gluon.data.vision import datasets, transforms
-
-# First Party
-from smdebug.mxnet import Hook, SaveConfig, modes
-
-
-def parse_args():
- parser = argparse.ArgumentParser(
- description="Train a mxnet gluon model for FashionMNIST dataset"
- )
- parser.add_argument("--batch-size", type=int, default=256, help="Batch size")
- parser.add_argument(
- "--output-s3-uri",
- type=str,
- default=f"s3://smdebug-testing/outputs/block-io-mxnet-hook-{uuid.uuid4()}",
- help="S3 URI of the bucket where tensor data will be stored.",
- )
- parser.add_argument(
- "--smdebug_path",
- type=str,
- default=None,
- help="S3 URI of the bucket where tensor data will be stored.",
- )
- opt = parser.parse_args()
- return opt
-
-
-def acc(output, label):
- return (output.argmax(axis=1) == label.astype("float32")).mean().asscalar()
-
-
-def train_model(batch_size, net, train_data, valid_data, hook):
- softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
- trainer = gluon.Trainer(net.collect_params(), "sgd", {"learning_rate": 0.1})
- # Start the training.
- for epoch in range(1):
- train_loss, train_acc, valid_acc = 0.0, 0.0, 0.0
- tic = time.time()
- hook.set_mode(modes.TRAIN)
- for data, label in train_data:
- data = data.as_in_context(mx.cpu(0))
- # forward + backward
- with autograd.record():
- output = net(data)
- loss = softmax_cross_entropy(output, label)
- loss.backward()
- # update parameters
- trainer.step(batch_size)
- # calculate training metrics
- train_loss += loss.mean().asscalar()
- train_acc += acc(output, label)
- # calculate validation accuracy
- hook.set_mode(modes.EVAL)
- for data, label in valid_data:
- data = data.as_in_context(mx.cpu(0))
- valid_acc += acc(net(data), label)
- print(
- "Epoch %d: loss %.3f, train acc %.3f, test acc %.3f, in %.1f sec"
- % (
- epoch,
- train_loss / len(train_data),
- train_acc / len(train_data),
- valid_acc / len(valid_data),
- time.time() - tic,
- )
- )
-
-
-def prepare_data(batch_size):
- mnist_train = datasets.FashionMNIST(train=True)
- X, y = mnist_train[0]
- ("X shape: ", X.shape, "X dtype", X.dtype, "y:", y)
- text_labels = [
- "t-shirt",
- "trouser",
- "pullover",
- "dress",
- "coat",
- "sandal",
- "shirt",
- "sneaker",
- "bag",
- "ankle boot",
- ]
- X, y = mnist_train[0:10]
- transformer = transforms.Compose([transforms.ToTensor(), transforms.Normalize(0.13, 0.31)])
- mnist_train = mnist_train.transform_first(transformer)
- train_data = gluon.data.DataLoader(
- mnist_train, batch_size=batch_size, shuffle=True, num_workers=4
- )
- mnist_valid = gluon.data.vision.FashionMNIST(train=False)
- valid_data = gluon.data.DataLoader(
- mnist_valid.transform_first(transformer), batch_size=batch_size, num_workers=4
- )
- return train_data, valid_data
-
-
-# Create a model using gluon API. The hook is currently
-# supports MXNet gluon models only.
-def create_gluon_model():
- # Create Model in Gluon
- child_blocks = []
- net = nn.HybridSequential()
- conv2d_0 = nn.Conv2D(channels=6, kernel_size=5, activation="relu")
- child_blocks.append(conv2d_0)
- maxpool2d_0 = nn.MaxPool2D(pool_size=2, strides=2)
- child_blocks.append(maxpool2d_0)
- conv2d_1 = nn.Conv2D(channels=16, kernel_size=3, activation="relu")
- child_blocks.append(conv2d_1)
- maxpool2d_1 = nn.MaxPool2D(pool_size=2, strides=2)
- child_blocks.append(maxpool2d_1)
- flatten_0 = nn.Flatten()
- child_blocks.append(flatten_0)
- dense_0 = nn.Dense(120, activation="relu")
- child_blocks.append(dense_0)
- dense_1 = nn.Dense(84, activation="relu")
- child_blocks.append(dense_1)
- dense_2 = nn.Dense(10)
- child_blocks.append(dense_2)
-
- net.add(conv2d_0, maxpool2d_0, conv2d_1, maxpool2d_1, flatten_0, dense_0, dense_1, dense_2)
- net.initialize(init=init.Xavier(), ctx=mx.cpu())
- return net, child_blocks
-
-
-# Create a hook. The initialization of hook determines which tensors
-# are logged while training is in progress.
-# Following function shows the hook initialization that enables logging of
-# weights, biases and gradients in the model along with the inputs and output of the given
-# child block.
-def create_hook(output_s3_uri, block):
- # Create a SaveConfig that determines tensors from which steps are to be stored.
- # With the following SaveConfig, we will save tensors for steps 1, 2 and 3.
- save_config = SaveConfig(save_steps=[1, 2, 3])
-
- # Create a hook that logs weights, biases, gradients and inputs outputs of model while training.
- hook = Hook(
- out_dir=output_s3_uri,
- save_config=save_config,
- include_collections=["weights", "gradients", "biases", block.name],
- )
-
- # The names of input and output tensors of a block are in following format
- # Inputs : _input_, and
- # Output : _output
- # In order to log the inputs and output of a model, we will create a collection as follows
- hook.get_collection(block.name).add_block_tensors(block, inputs=True, outputs=True)
- return hook
-
-
-def main():
- opt = parse_args()
- # Create a Gluon Model.
- net, child_blocks = create_gluon_model()
-
- # Create a hook for logging the desired tensors.
- # The output_s3_uri is a the URI for the s3 bucket where the tensors will be saved.
- output_s3_uri = opt.smdebug_path if opt.smdebug_path is not None else opt.output_s3_uri
-
- # For creating a hook that can log inputs and output of the specific child block in the model,
- # we will pass the desired block object to the create_hook function.
- # In the following case, we are attempting log inputs and output of the first Conv2D block.
- hook = create_hook(output_s3_uri, child_blocks[0])
-
- # Register the hook to the top block.
- hook.register_hook(net)
-
- # Start the training.
- batch_size = opt.batch_size
- train_data, valid_data = prepare_data(batch_size)
-
- train_model(batch_size, net, train_data, valid_data, hook)
-
-
-if __name__ == "__main__":
- main()
diff --git a/examples/mxnet/scripts/mnist_gluon_model_input_output_demo.py b/examples/mxnet/scripts/mnist_gluon_model_input_output_demo.py
deleted file mode 100644
index 1bfc8e4c1..000000000
--- a/examples/mxnet/scripts/mnist_gluon_model_input_output_demo.py
+++ /dev/null
@@ -1,174 +0,0 @@
-# Standard Library
-import argparse
-import time
-import uuid
-
-# Third Party
-import mxnet as mx
-from mxnet import autograd, gluon, init
-from mxnet.gluon import nn
-from mxnet.gluon.data.vision import datasets, transforms
-
-# First Party
-from smdebug.mxnet import Hook, SaveConfig, modes
-
-
-def parse_args():
- parser = argparse.ArgumentParser(
- description="Train a mxnet gluon model for FashonMNIST dataset"
- )
- parser.add_argument("--batch-size", type=int, default=256, help="Batch size")
- parser.add_argument(
- "--output-s3-uri",
- type=str,
- default=f"s3://smdebug-testing/outputs/model-io-mxnet-hook-{uuid.uuid4()}",
- help="S3 URI of the bucket where tensor data will be stored.",
- )
- parser.add_argument(
- "--smdebug_path",
- type=str,
- default=None,
- help="S3 URI of the bucket where tensor data will be stored.",
- )
- opt = parser.parse_args()
- return opt
-
-
-def acc(output, label):
- return (output.argmax(axis=1) == label.astype("float32")).mean().asscalar()
-
-
-def train_model(batch_size, net, train_data, valid_data, hook):
- softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
- trainer = gluon.Trainer(net.collect_params(), "sgd", {"learning_rate": 0.1})
- # Start the training.
- for epoch in range(1):
- train_loss, train_acc, valid_acc = 0.0, 0.0, 0.0
- tic = time.time()
- hook.set_mode(modes.TRAIN)
- for data, label in train_data:
- data = data.as_in_context(mx.cpu(0))
- # forward + backward
- with autograd.record():
- output = net(data)
- loss = softmax_cross_entropy(output, label)
- loss.backward()
- # update parameters
- trainer.step(batch_size)
- # calculate training metrics
- train_loss += loss.mean().asscalar()
- train_acc += acc(output, label)
- # calculate validation accuracy
- hook.set_mode(modes.EVAL)
- for data, label in valid_data:
- data = data.as_in_context(mx.cpu(0))
- valid_acc += acc(net(data), label)
- print(
- "Epoch %d: loss %.3f, train acc %.3f, test acc %.3f, in %.1f sec"
- % (
- epoch,
- train_loss / len(train_data),
- train_acc / len(train_data),
- valid_acc / len(valid_data),
- time.time() - tic,
- )
- )
-
-
-def prepare_data(batch_size):
- mnist_train = datasets.FashionMNIST(train=True)
- X, y = mnist_train[0]
- ("X shape: ", X.shape, "X dtype", X.dtype, "y:", y)
- text_labels = [
- "t-shirt",
- "trouser",
- "pullover",
- "dress",
- "coat",
- "sandal",
- "shirt",
- "sneaker",
- "bag",
- "ankle boot",
- ]
- X, y = mnist_train[0:10]
- transformer = transforms.Compose([transforms.ToTensor(), transforms.Normalize(0.13, 0.31)])
- mnist_train = mnist_train.transform_first(transformer)
- train_data = gluon.data.DataLoader(
- mnist_train, batch_size=batch_size, shuffle=True, num_workers=4
- )
- mnist_valid = gluon.data.vision.FashionMNIST(train=False)
- valid_data = gluon.data.DataLoader(
- mnist_valid.transform_first(transformer), batch_size=batch_size, num_workers=4
- )
- return train_data, valid_data
-
-
-# Create a model using gluon API. The hook is currently
-# supports MXNet gluon models only.
-def create_gluon_model():
- # Create Model in Gluon
- net = nn.HybridSequential()
- net.add(
- nn.Conv2D(channels=6, kernel_size=5, activation="relu"),
- nn.MaxPool2D(pool_size=2, strides=2),
- nn.Conv2D(channels=16, kernel_size=3, activation="relu"),
- nn.MaxPool2D(pool_size=2, strides=2),
- nn.Flatten(),
- nn.Dense(120, activation="relu"),
- nn.Dense(84, activation="relu"),
- nn.Dense(10),
- )
- net.initialize(init=init.Xavier(), ctx=mx.cpu())
- return net
-
-
-# Create a hook. The initialization of hook determines which tensors
-# are logged while training is in progress.
-# Following function shows the hook initialization that enables logging of
-# weights, biases and gradients in the model along with the inputs and outputs of the model.
-def create_hook(output_s3_uri, block):
- # Create a SaveConfig that determines tensors from which steps are to be stored.
- # With the following SaveConfig, we will save tensors for steps 1, 2 and 3.
- save_config = SaveConfig(save_steps=[1, 2, 3])
-
- # Create a hook that logs weights, biases, gradients and inputs outputs of model while training.
- hook = Hook(
- out_dir=output_s3_uri,
- save_config=save_config,
- include_collections=["weights", "gradients", "biases", "TopBlock"],
- )
-
- # The names of input and output tensors of a block are in following format
- # Inputs : _input_, and
- # Output : _output
- # In order to log the inputs and output of a model, we will create a collection as follows:
- hook.get_collection("TopBlock").add_block_tensors(block, inputs=True, outputs=True)
- return hook
-
-
-def main():
- opt = parse_args()
- # Create a Gluon Model.
- net = create_gluon_model()
-
- # Create a hook for logging the desired tensors.
- # The output_s3_uri is a the URI for the s3 bucket where the tensors will be saved.
- output_s3_uri = opt.smdebug_path if opt.smdebug_path is not None else opt.output_s3_uri
-
- # For creating a hook that can log inputs and output of the model,
- # we will pass the top block object to the create_hook function.
- hook = create_hook(output_s3_uri, net)
-
- # Register the hook to the top block.
- hook.register_hook(net)
-
- # Start the training.
- batch_size = opt.batch_size
- train_data, valid_data = prepare_data(batch_size)
-
- train_model(batch_size, net, train_data, valid_data, hook)
-
-
-if __name__ == "__main__":
- main()
diff --git a/examples/mxnet/scripts/mnist_gluon_realtime_visualize_demo.py b/examples/mxnet/scripts/mnist_gluon_realtime_visualize_demo.py
deleted file mode 100644
index 23994a0f2..000000000
--- a/examples/mxnet/scripts/mnist_gluon_realtime_visualize_demo.py
+++ /dev/null
@@ -1,151 +0,0 @@
-# Standard Library
-import argparse
-import time
-
-# Third Party
-import mxnet as mx
-from mxnet import autograd, gluon, init
-from mxnet.gluon import nn
-from mxnet.gluon.data.vision import datasets, transforms
-
-# First Party
-from smdebug import SaveConfig, modes
-from smdebug.mxnet import Hook
-
-
-def parse_args():
- parser = argparse.ArgumentParser(
- description="Train a mxnet gluon model for FashonMNIST dataset"
- )
- parser.add_argument("--batch-size", type=int, default=256, help="Batch size")
- parser.add_argument(
- "--epochs", type=int, default=50, help="Amount of epochs to run training loop"
- )
- parser.add_argument("--learning_rate", type=float, default=0.1)
- opt = parser.parse_args()
- return opt
-
-
-def acc(output, label):
- return (output.argmax(axis=1) == label.astype("float32")).mean().asscalar()
-
-
-def train_model(batch_size, net, train_data, valid_data, lr, epochs, hook):
- softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
- trainer = gluon.Trainer(net.collect_params(), "sgd", {"learning_rate": lr})
- # Start the training.
- for epoch in range(epochs):
- train_loss, train_acc, valid_acc = 0.0, 0.0, 0.0
- tic = time.time()
- hook.set_mode(modes.TRAIN)
- for data, label in train_data:
- data = data.as_in_context(mx.cpu(0))
- # forward + backward
- with autograd.record():
- output = net(data)
- loss = softmax_cross_entropy(output, label)
- loss.backward()
- # update parameters
- trainer.step(batch_size)
- # calculate training metrics
- train_loss += loss.mean().asscalar()
- train_acc += acc(output, label)
- # calculate validation accuracy
- hook.set_mode(modes.EVAL)
- for data, label in valid_data:
- data = data.as_in_context(mx.cpu(0))
- valid_acc += acc(net(data), label)
- print(
- "Epoch %d: loss %.3f, train acc %.3f, test acc %.3f, in %.1f sec"
- % (
- epoch,
- train_loss / len(train_data),
- train_acc / len(train_data),
- valid_acc / len(valid_data),
- time.time() - tic,
- )
- )
-
-
-def prepare_data(batch_size):
- mnist_train = datasets.FashionMNIST(train=True)
- X, y = mnist_train[0]
- ("X shape: ", X.shape, "X dtype", X.dtype, "y:", y)
- text_labels = [
- "t-shirt",
- "trouser",
- "pullover",
- "dress",
- "coat",
- "sandal",
- "shirt",
- "sneaker",
- "bag",
- "ankle boot",
- ]
- X, y = mnist_train[0:10]
- transformer = transforms.Compose([transforms.ToTensor(), transforms.Normalize(0.13, 0.31)])
- mnist_train = mnist_train.transform_first(transformer)
- train_data = gluon.data.DataLoader(
- mnist_train, batch_size=batch_size, shuffle=True, num_workers=4
- )
- mnist_valid = gluon.data.vision.FashionMNIST(train=False)
- valid_data = gluon.data.DataLoader(
- mnist_valid.transform_first(transformer), batch_size=batch_size, num_workers=4
- )
- return train_data, valid_data
-
-
-# Create a model using gluon API. The hook is currently
-# supports MXNet gluon models only.
-def create_gluon_model():
- # Create Model in Gluon
- net = nn.HybridSequential(prefix="sequential_")
- net.add(
- nn.Conv2D(channels=6, kernel_size=5, activation="relu"),
- nn.MaxPool2D(pool_size=2, strides=2),
- nn.Conv2D(channels=16, kernel_size=3, activation="relu"),
- nn.MaxPool2D(pool_size=2, strides=2),
- nn.Flatten(),
- nn.Dense(120, activation="relu"),
- nn.Dense(84, activation="relu"),
- nn.Dense(10),
- )
- net.initialize(init=init.Xavier(), ctx=mx.cpu())
- return net
-
-
-# Create a hook. The initialization of hook determines which tensors
-# are logged while training is in progress.
-# Following function shows the default initialization that enables logging of
-# all tensors in the model.
-def create_hook():
- # With the following SaveConfig, we will save tensors for every 100 steps
- save_config = SaveConfig(save_interval=100)
-
- # Create a hook that logs weights, biases and gradients while training the model.
- hook = Hook(save_config=save_config, save_all=True)
- return hook
-
-
-def main():
- opt = parse_args()
-
- # Create a Gluon Model.
- net = create_gluon_model()
-
- # Create a hook for logging all tensors.
- hook = create_hook()
-
- # Register the hook to the top block.
- hook.register_hook(net)
-
- # Start the training.
- batch_size = opt.batch_size
- train_data, valid_data = prepare_data(batch_size)
-
- train_model(batch_size, net, train_data, valid_data, opt.learning_rate, opt.epochs, hook)
-
-
-if __name__ == "__main__":
- main()
diff --git a/examples/mxnet/scripts/mnist_gluon_save_all_demo.py b/examples/mxnet/scripts/mnist_gluon_save_all_demo.py
deleted file mode 100644
index c24d1562a..000000000
--- a/examples/mxnet/scripts/mnist_gluon_save_all_demo.py
+++ /dev/null
@@ -1,161 +0,0 @@
-# Standard Library
-import argparse
-import time
-import uuid
-
-# Third Party
-import mxnet as mx
-from mxnet import autograd, gluon, init
-from mxnet.gluon import nn
-from mxnet.gluon.data.vision import datasets, transforms
-
-# First Party
-from smdebug.mxnet import Hook, SaveConfig, modes
-
-
-def parse_args():
- parser = argparse.ArgumentParser(
- description="Train a mxnet gluon model for FashonMNIST dataset"
- )
- parser.add_argument("--batch-size", type=int, default=256, help="Batch size")
- parser.add_argument(
- "--output-s3-uri",
- type=str,
- default=f"s3://smdebug-testing/outputs/saveall-mxnet-hook-{uuid.uuid4()}",
- help="S3 URI of the bucket where tensor data will be stored.",
- )
- parser.add_argument(
- "--smdebug_path",
- type=str,
- default=None,
- help="S3 URI of the bucket where tensor data will be stored.",
- )
- opt = parser.parse_args()
- return opt
-
-
-def acc(output, label):
- return (output.argmax(axis=1) == label.astype("float32")).mean().asscalar()
-
-
-def train_model(batch_size, net, train_data, valid_data, hook):
- softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
- trainer = gluon.Trainer(net.collect_params(), "sgd", {"learning_rate": 0.1})
- # Start the training.
- for epoch in range(1):
- train_loss, train_acc, valid_acc = 0.0, 0.0, 0.0
- tic = time.time()
- hook.set_mode(modes.TRAIN)
- for data, label in train_data:
- data = data.as_in_context(mx.cpu(0))
- # forward + backward
- with autograd.record():
- output = net(data)
- loss = softmax_cross_entropy(output, label)
- loss.backward()
- # update parameters
- trainer.step(batch_size)
- # calculate training metrics
- train_loss += loss.mean().asscalar()
- train_acc += acc(output, label)
- # calculate validation accuracy
- hook.set_mode(modes.EVAL)
- for data, label in valid_data:
- data = data.as_in_context(mx.cpu(0))
- valid_acc += acc(net(data), label)
- print(
- "Epoch %d: loss %.3f, train acc %.3f, test acc %.3f, in %.1f sec"
- % (
- epoch,
- train_loss / len(train_data),
- train_acc / len(train_data),
- valid_acc / len(valid_data),
- time.time() - tic,
- )
- )
-
-
-def prepare_data(batch_size):
- mnist_train = datasets.FashionMNIST(train=True)
- X, y = mnist_train[0]
- ("X shape: ", X.shape, "X dtype", X.dtype, "y:", y)
- text_labels = [
- "t-shirt",
- "trouser",
- "pullover",
- "dress",
- "coat",
- "sandal",
- "shirt",
- "sneaker",
- "bag",
- "ankle boot",
- ]
- X, y = mnist_train[0:10]
- transformer = transforms.Compose([transforms.ToTensor(), transforms.Normalize(0.13, 0.31)])
- mnist_train = mnist_train.transform_first(transformer)
- train_data = gluon.data.DataLoader(
- mnist_train, batch_size=batch_size, shuffle=True, num_workers=4
- )
- mnist_valid = gluon.data.vision.FashionMNIST(train=False)
- valid_data = gluon.data.DataLoader(
- mnist_valid.transform_first(transformer), batch_size=batch_size, num_workers=4
- )
- return train_data, valid_data
-
-
-# Create a model using gluon API. The hook is currently
-# supports MXNet gluon models only.
-def create_gluon_model():
- # Create Model in Gluon
- net = nn.HybridSequential()
- net.add(
- nn.Conv2D(channels=6, kernel_size=5, activation="relu"),
- nn.MaxPool2D(pool_size=2, strides=2),
- nn.Conv2D(channels=16, kernel_size=3, activation="relu"),
- nn.MaxPool2D(pool_size=2, strides=2),
- nn.Flatten(),
- nn.Dense(120, activation="relu"),
- nn.Dense(84, activation="relu"),
- nn.Dense(10),
- )
- net.initialize(init=init.Xavier(), ctx=mx.cpu())
- return net
-
-
-# Create a hook. The initialization of hook determines which tensors
-# are logged while training is in progress.
-# Following function shows the initialization of hook that enables logging of
-# all the tensors in the model.
-def create_hook(output_s3_uri):
- # Create a SaveConfig that determines tensors from which steps are to be stored.
- # With the following SaveConfig, we will save tensors for steps 1, 2 and 3.
- save_config = SaveConfig(save_steps=[1, 2, 3])
-
- # Create a hook that logs all the tensors seen while training the model.
- hook = Hook(out_dir=output_s3_uri, save_config=save_config, save_all=True)
- return hook
-
-
-def main():
- opt = parse_args()
- # Create a Gluon Model.
- net = create_gluon_model()
-
- # Create a hook for logging the desired tensors.
- # The output_s3_uri is a the URI for the s3 bucket where the tensors will be saved.
- output_s3_uri = opt.smdebug_path if opt.smdebug_path is not None else opt.output_s3_uri
- hook = create_hook(output_s3_uri)
-
- # Register the hook to the top block.
- hook.register_hook(net)
-
- # Start the training.
- batch_size = opt.batch_size
- train_data, valid_data = prepare_data(batch_size)
-
- train_model(batch_size, net, train_data, valid_data, hook)
-
-
-if __name__ == "__main__":
- main()
diff --git a/examples/mxnet/scripts/mnist_mxnet.py b/examples/mxnet/scripts/mnist_mxnet.py
deleted file mode 100644
index 321bb2a1e..000000000
--- a/examples/mxnet/scripts/mnist_mxnet.py
+++ /dev/null
@@ -1,173 +0,0 @@
-# Standard Library
-import argparse
-import random
-import time
-
-# Third Party
-import mxnet as mx
-import numpy as np
-from mxnet import autograd, gluon, init
-from mxnet.gluon import nn
-from mxnet.gluon.data.vision import datasets, transforms
-
-# First Party
-from smdebug import SaveConfig, modes
-from smdebug.mxnet import Hook
-
-
-def parse_args():
- parser = argparse.ArgumentParser(
- description="Train a mxnet gluon model for FashonMNIST dataset"
- )
- parser.add_argument("--batch-size", type=int, default=256, help="Batch size")
- parser.add_argument(
- "--output-uri",
- type=str,
- default="/opt/ml/output/tensors/smdebug",
- help="S3 URI of the bucket where tensor data will be stored.",
- )
- parser.add_argument("--learning_rate", type=float, default=0.1)
- parser.add_argument("--random_seed", type=bool, default=False)
- opt = parser.parse_args()
- return opt
-
-
-def acc(output, label):
- return (output.argmax(axis=1) == label.astype("float32")).mean().asscalar()
-
-
-def train_model(batch_size, net, train_data, valid_data, lr, hook):
- softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
- trainer = gluon.Trainer(net.collect_params(), "sgd", {"learning_rate": lr})
- # Start the training.
- for epoch in range(50):
- train_loss, train_acc, valid_acc = 0.0, 0.0, 0.0
- tic = time.time()
- hook.set_mode(modes.TRAIN)
- for data, label in train_data:
- data = data.as_in_context(mx.cpu(0))
- # forward + backward
- with autograd.record():
- output = net(data)
- loss = softmax_cross_entropy(output, label)
- loss.backward()
- # update parameters
- trainer.step(batch_size)
- # calculate training metrics
- train_loss += loss.mean().asscalar()
- train_acc += acc(output, label)
- # calculate validation accuracy
- hook.set_mode(modes.EVAL)
- for data, label in valid_data:
- data = data.as_in_context(mx.cpu(0))
- valid_acc += acc(net(data), label)
- print(
- "Epoch %d: loss %.3f, train acc %.3f, test acc %.3f, in %.1f sec"
- % (
- epoch,
- train_loss / len(train_data),
- train_acc / len(train_data),
- valid_acc / len(valid_data),
- time.time() - tic,
- )
- )
-
-
-def prepare_data(batch_size):
- mnist_train = datasets.FashionMNIST(train=True)
- X, y = mnist_train[0]
- ("X shape: ", X.shape, "X dtype", X.dtype, "y:", y)
- text_labels = [
- "t-shirt",
- "trouser",
- "pullover",
- "dress",
- "coat",
- "sandal",
- "shirt",
- "sneaker",
- "bag",
- "ankle boot",
- ]
- X, y = mnist_train[0:10]
- transformer = transforms.Compose([transforms.ToTensor(), transforms.Normalize(0.13, 0.31)])
- mnist_train = mnist_train.transform_first(transformer)
- train_data = gluon.data.DataLoader(
- mnist_train, batch_size=batch_size, shuffle=True, num_workers=4
- )
- mnist_valid = gluon.data.vision.FashionMNIST(train=False)
- valid_data = gluon.data.DataLoader(
- mnist_valid.transform_first(transformer), batch_size=batch_size, num_workers=4
- )
- return train_data, valid_data
-
-
-# Create a model using gluon API. The hook is currently
-# supports MXNet gluon models only.
-def create_gluon_model():
- # Create Model in Gluon
- net = nn.HybridSequential()
- net.add(
- nn.Conv2D(channels=6, kernel_size=5, activation="relu"),
- nn.MaxPool2D(pool_size=2, strides=2),
- nn.Conv2D(channels=16, kernel_size=3, activation="relu"),
- nn.MaxPool2D(pool_size=2, strides=2),
- nn.Flatten(),
- nn.Dense(120, activation="relu"),
- nn.Dense(84, activation="relu"),
- nn.Dense(10),
- )
- net.initialize(init=init.Xavier(), ctx=mx.cpu())
- return net
-
-
-# Create a hook. The initialization of hook determines which tensors
-# are logged while training is in progress.
-# Following function shows the default initialization that enables logging of
-# weights, biases and gradients in the model.
-def create_hook(output_uri):
- # With the following SaveConfig, we will save tensors for steps 1, 2 and 3
- # (indexing starts with 0).
- save_config = SaveConfig(save_interval=1)
-
- # Create a hook that logs weights, biases and gradients while training the model.
- hook = Hook(
- out_dir=output_uri,
- save_config=save_config,
- include_collections=["weights", "gradients", "biases"],
- )
- return hook
-
-
-def main():
- opt = parse_args()
-
- # these random seeds are only intended for test purpose.
- # for now, 128,12,2 could promise no assert failure with running tests
- # if you wish to change the number, notice that certain steps' tensor value may be capable of variation
- if opt.random_seed:
- mx.random.seed(128)
- random.seed(12)
- np.random.seed(2)
-
- # Create a Gluon Model.
- net = create_gluon_model()
-
- # Create a hook for logging the desired tensors.
- # The output_s3_uri is a the URI for the s3 bucket where the tensors will be saved.
- # The trial_id is used to store the tensors from different trials separately.
- output_uri = opt.output_uri
- hook = create_hook(output_uri)
-
- # Register the hook to the top block.
- hook.register_hook(net)
-
- # Start the training.
- batch_size = opt.batch_size
- train_data, valid_data = prepare_data(batch_size)
-
- train_model(batch_size, net, train_data, valid_data, opt.learning_rate, hook)
-
-
-if __name__ == "__main__":
- main()
diff --git a/examples/mxnet/scripts/mnist_mxnet_hvd.py b/examples/mxnet/scripts/mnist_mxnet_hvd.py
deleted file mode 100644
index d0f9447c8..000000000
--- a/examples/mxnet/scripts/mnist_mxnet_hvd.py
+++ /dev/null
@@ -1,203 +0,0 @@
-# Standard Library
-import argparse
-import logging
-import os
-import time
-import zipfile
-
-# Third Party
-import horovod.mxnet as hvd
-import mxnet as mx
-from mxnet import autograd, gluon
-from mxnet.test_utils import download
-
-# First Party
-from smdebug import SaveConfig, modes
-from smdebug.mxnet import Hook
-
-# Training settings
-parser = argparse.ArgumentParser(description="MXNet MNIST Example")
-
-parser.add_argument("--batch-size", type=int, default=64, help="training batch size (default: 64)")
-parser.add_argument(
- "--dtype", type=str, default="float32", help="training data type (default: float32)"
-)
-parser.add_argument("--epochs", type=int, default=5, help="number of training epochs (default: 5)")
-parser.add_argument("--lr", type=float, default=0.01, help="learning rate (default: 0.01)")
-parser.add_argument("--momentum", type=float, default=0.9, help="SGD momentum (default: 0.9)")
-parser.add_argument(
- "--no-cuda", action="store_true", default=False, help="disable training on GPU (default: False)"
-)
-parser.add_argument(
- "--output-uri",
- type=str,
- default="/opt/ml/output/tensors",
- help="S3 URI of the bucket where tensor data will be stored.",
-)
-args = parser.parse_args()
-
-if not args.no_cuda:
- # Disable CUDA if there are no GPUs.
- if not mx.test_utils.list_gpus():
- args.no_cuda = True
-
-logging.basicConfig(level=logging.INFO)
-logging.info(args)
-
-
-# Function to get mnist iterator given a rank
-def get_mnist_iterator(rank):
- data_dir = "data-%d" % rank
- if not os.path.isdir(data_dir):
- os.makedirs(data_dir)
- zip_file_path = download("http://data.mxnet.io/mxnet/data/mnist.zip", dirname=data_dir)
- with zipfile.ZipFile(zip_file_path) as zf:
- zf.extractall(data_dir)
-
- input_shape = (1, 28, 28)
- batch_size = args.batch_size
-
- train_iter = mx.io.MNISTIter(
- image="%s/train-images-idx3-ubyte" % data_dir,
- label="%s/train-labels-idx1-ubyte" % data_dir,
- input_shape=input_shape,
- batch_size=batch_size,
- shuffle=True,
- flat=False,
- num_parts=hvd.size(),
- part_index=hvd.rank(),
- )
-
- val_iter = mx.io.MNISTIter(
- image="%s/t10k-images-idx3-ubyte" % data_dir,
- label="%s/t10k-labels-idx1-ubyte" % data_dir,
- input_shape=input_shape,
- batch_size=batch_size,
- flat=False,
- )
-
- return train_iter, val_iter
-
-
-# Function to define neural network
-def conv_nets():
- net = gluon.nn.HybridSequential()
- with net.name_scope():
- net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation="relu"))
- net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
- net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation="relu"))
- net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
- net.add(gluon.nn.Flatten())
- net.add(gluon.nn.Dense(512, activation="relu"))
- net.add(gluon.nn.Dense(10))
- return net
-
-
-# Function to evaluate accuracy for a model
-def evaluate(model, data_iter, context):
- data_iter.reset()
- metric = mx.metric.Accuracy()
- for _, batch in enumerate(data_iter):
- data = batch.data[0].as_in_context(context)
- label = batch.label[0].as_in_context(context)
- output = model(data.astype(args.dtype, copy=False))
- metric.update([label], [output])
-
- return metric.get()
-
-
-# Initialize Horovod
-hvd.init()
-
-# Horovod: pin context to local rank
-context = mx.cpu(hvd.local_rank()) if args.no_cuda else mx.gpu(hvd.local_rank())
-num_workers = hvd.size()
-
-# Load training and validation data
-train_data, val_data = get_mnist_iterator(hvd.rank())
-
-# Build model
-model = conv_nets()
-model.cast(args.dtype)
-model.hybridize()
-
-# Create optimizer
-optimizer_params = {"momentum": args.momentum, "learning_rate": args.lr * hvd.size()}
-opt = mx.optimizer.create("sgd", **optimizer_params)
-
-# Initialize parameters
-initializer = mx.init.Xavier(rnd_type="gaussian", factor_type="in", magnitude=2)
-model.initialize(initializer, ctx=context)
-
-# Horovod: fetch and broadcast parameters
-params = model.collect_params()
-if params is not None:
- hvd.broadcast_parameters(params, root_rank=0)
-
-# Horovod: create DistributedTrainer, a subclass of gluon.Trainer
-trainer = hvd.DistributedTrainer(params, opt)
-
-# Create loss function and train metric
-loss_fn = gluon.loss.SoftmaxCrossEntropyLoss()
-metric = mx.metric.Accuracy()
-
-
-def create_hook():
- # With the following SaveConfig, we will save tensors for steps 1, 2 and 3
- # (indexing starts with 0).
- save_config = SaveConfig(save_interval=1)
-
- # Create a hook that logs weights, biases and gradients while training the model.
- ts_hook = Hook(
- out_dir=args.output_uri,
- save_config=save_config,
- include_collections=["weights", "gradients", "biases"],
- )
- return ts_hook
-
-
-# Train model
-for epoch in range(args.epochs):
- tic = time.time()
- train_data.reset()
- metric.reset()
-
- # Create Hook
- hook = create_hook()
- hook.register_hook(model)
-
- for nbatch, batch in enumerate(train_data, start=1):
- hook.set_mode(modes.TRAIN)
- data = batch.data[0].as_in_context(context)
- label = batch.label[0].as_in_context(context)
- with autograd.record():
- output = model(data.astype(args.dtype, copy=False))
- loss = loss_fn(output, label)
- loss.backward()
- trainer.step(args.batch_size)
- metric.update([label], [output])
-
- if nbatch % 100 == 0:
- name, acc = metric.get()
- logging.info("[Epoch %d Batch %d] Training: %s=%f" % (epoch, nbatch, name, acc))
-
- if hvd.rank() == 0:
- elapsed = time.time() - tic
- speed = nbatch * args.batch_size * hvd.size() / elapsed
- logging.info("Epoch[%d]\tSpeed=%.2f samples/s\tTime cost=%f", epoch, speed, elapsed)
-
- # Evaluate model accuracy
- hook.set_mode(modes.EVAL)
- _, train_acc = metric.get()
- name, val_acc = evaluate(model, val_data, context)
- if hvd.rank() == 0:
- logging.info(
- "Epoch[%d]\tTrain: %s=%f\tValidation: %s=%f", epoch, name, train_acc, name, val_acc
- )
-
- if hvd.rank() == 0 and epoch == args.epochs - 1:
- assert val_acc > 0.96, (
- "Achieved accuracy (%f) is lower than expected\
- (0.96)"
- % val_acc
- )
diff --git a/examples/mxnet/scripts/mnist_gluon_vg_demo.py b/examples/mxnet/scripts/mxnet_fashionmnist_w_g_b.py
similarity index 88%
rename from examples/mxnet/scripts/mnist_gluon_vg_demo.py
rename to examples/mxnet/scripts/mxnet_fashionmnist_w_g_b.py
index 1c8fe2981..becceeedd 100644
--- a/examples/mxnet/scripts/mnist_gluon_vg_demo.py
+++ b/examples/mxnet/scripts/mxnet_fashionmnist_w_g_b.py
@@ -1,7 +1,14 @@
+"""
+This script is a simple FashionMNIST training script which uses MXNet's.
+It has been orchestrated with SageMaker Debugger hook to allow saving tensors during training.
+Here, the hook has been created using its constructor to allow running this locally for your experimentation.
+When you want to run this script in SageMaker, it is recommended to create the hook from json file.
+Please see scripts in either /examples/tensorflow/sagemaker_byoc or /examples/tensorflow/sagemaker_official_container
+folder based on your use case.
+"""
# Standard Library
import argparse
import random
-import uuid
# Third Party
import mxnet as mx
@@ -17,18 +24,7 @@ def parse_args():
parser = argparse.ArgumentParser(
description="Train a mxnet gluon model for FashionMNIST dataset"
)
- parser.add_argument(
- "--output-uri",
- type=str,
- default=f"s3://smdebug-testing/outputs/vg-demo-{uuid.uuid4()}",
- help="S3 URI of the bucket where tensor data will be stored.",
- )
- parser.add_argument(
- "--smdebug_path",
- type=str,
- default=None,
- help="S3 URI of the bucket where tensor data will be stored.",
- )
+ parser.add_argument("--output-uri", type=str, help="Folder where tensor data will be stored.")
parser.add_argument("--random_seed", type=bool, default=False)
parser.add_argument(
"--num_steps",
@@ -135,7 +131,7 @@ def create_gluon_model():
# Create a hook. The initialization of hook determines which tensors
# are logged while training is in progress.
-# Following function shows the default initialization that enables logging of
+# Following function shows the initialization that enables logging of
# weights, biases and gradients in the model.
def create_hook(output_uri, save_frequency):
# With the following SaveConfig, we will save tensors with the save_interval 100.
@@ -166,7 +162,7 @@ def main():
# Create a hook for logging the desired tensors.
# The output_uri is a the URI where the tensors will be saved. It can be local or s3://bucket/prefix
- output_uri = opt.smdebug_path if opt.smdebug_path is not None else opt.output_uri
+ output_uri = opt.output_uri
hook = create_hook(output_uri, opt.save_frequency)
# Register the hook to the top block.
diff --git a/examples/pytorch/README.md b/examples/pytorch/README.md
index 9818d371a..a282a073a 100644
--- a/examples/pytorch/README.md
+++ b/examples/pytorch/README.md
@@ -1,2 +1,22 @@
-## Example Notebooks
+# Examples
+## Example notebooks
Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger)
+
+## Example scripts
+The above notebooks come with example scripts which can be used through SageMaker. Some more example scripts are here in [scripts/](scripts/)
+
+## Example configurations for saving tensors through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
+Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md)
+
+## Example configurations for running rules through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
+Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md)
+
+## Example for running rule locally
+
+```
+from smdebug.rules import invoke_rule
+from smdebug.trials import create_trial
+trial = create_trial('s3://bucket/prefix')
+rule_obj = CustomRule(trial, param=value)
+invoke_rule(rule_obj, start_step=0, end_step=10)
+```
diff --git a/examples/tensorflow/README.md b/examples/tensorflow/README.md
index 6a93b72df..a282a073a 100644
--- a/examples/tensorflow/README.md
+++ b/examples/tensorflow/README.md
@@ -5,10 +5,10 @@ Please refer to the example notebooks in [Amazon SageMaker Examples repository](
## Example scripts
The above notebooks come with example scripts which can be used through SageMaker. Some more example scripts are here in [scripts/](scripts/)
-## Example configurations for saving tensors through SageMaker pySDK
+## Example configurations for saving tensors through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md)
-## Example configurations for running rules through SageMaker pySDK
+## Example configurations for running rules through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md)
## Example for running rule locally
diff --git a/tests/mxnet/test_training_end.py b/tests/mxnet/test_training_end.py
index 1520d1381..7bab0da0d 100644
--- a/tests/mxnet/test_training_end.py
+++ b/tests/mxnet/test_training_end.py
@@ -20,7 +20,7 @@ def test_end_local_training():
subprocess.check_call(
[
sys.executable,
- "examples/mxnet/scripts/mnist_gluon_basic_hook_demo.py",
+ "tests/resources/mxnet/mnist_gluon_basic_hook_demo.py",
"--output-uri",
out_dir,
"--num_steps",
@@ -41,7 +41,7 @@ def test_end_s3_training():
subprocess.check_call(
[
sys.executable,
- "examples/mxnet/scripts/mnist_gluon_basic_hook_demo.py",
+ "tests/resources/mxnet/mnist_gluon_basic_hook_demo.py",
"--output-uri",
out_dir,
"--num_steps",
diff --git a/examples/mxnet/scripts/mnist_gluon_basic_hook_demo.py b/tests/resources/mxnet/mnist_gluon_basic_hook_demo.py
similarity index 100%
rename from examples/mxnet/scripts/mnist_gluon_basic_hook_demo.py
rename to tests/resources/mxnet/mnist_gluon_basic_hook_demo.py