Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ information to effectively respond to your bug report or contribution.

We welcome you to use the GitHub issue tracker to report bugs or suggest features.

When filing an issue, please check [existing open](https://github.com/awslabs/tornasole_core/issues), or [recently closed](https://github.com/awslabs/tornasole_core/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already
When filing an issue, please check [existing open](https://github.com/awslabs/sagemaker-debugger/issues), or [recently closed](https://github.com/awslabs/sagemaker-debugger/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already
reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:

* A reproducible test case or series of steps
Expand Down Expand Up @@ -40,18 +40,18 @@ GitHub provides additional document on [forking a repository](https://help.githu
[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).


## Developing Tornasole
To develop Tornasole on your machine, here are some tips:
1. Uninstall all existing Tornasole installs:
## Developing SageMaker Debugger
To develop on your machine, here are some tips:
1. Remove any existing installation:
```
pip uninstall tornasole
pip uninstall smdebug
```
2. Clone a copy of Tornasole from source:
2. Clone the package from source:
```
git clone https://github.com/awslabs/tornasole_core
cd tornasole_core
git clone https://github.com/awslabs/sagemaker-debugger
cd sagemaker-debugger
```
3. Install Tornasole in `develop` mode:
3. Installing in `develop` mode:
```
python setup.py develop
```
Expand All @@ -62,7 +62,7 @@ pre-commit install
```

## Finding contributions to work on
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/awslabs/tornasole_core/labels/help%20wanted) issues is a great place to start.
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/awslabs/sagemaker-debugger/labels/help%20wanted) issues is a great place to start.


## Code of Conduct
Expand All @@ -77,6 +77,6 @@ If you discover a potential security issue in this project we ask that you notif

## Licensing

See the [LICENSE](https://github.com/awslabs/tornasole_core/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.

We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
2 changes: 1 addition & 1 deletion config/buildspec.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ phases:
- cd $CODEBUILD_SRC_DIR && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh
- pip install -U pip
- pip install -q pytest wheel pyYaml pytest-html keras==2.3.1 tensorflow==1.15.0 mxnet torch xgboost pre-commit tensorflow_datasets
- pip uninstall -y boto3 awscli botocore
- pip uninstall -y boto3 botocore

pre_build:
commands:
Expand Down
6 changes: 3 additions & 3 deletions config/tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ export SMDEBUG_LOG_LEVEL=info

export OUT_DIR=upload/$CURRENT_COMMIT_PATH
export REPORT_DIR=$OUT_DIR/pytest_reports
python -m pytest -W=ignore --html=$REPORT_DIR/report_analysis.html --self-contained-html tests/analysis
python -m pytest -W=ignore --html=$REPORT_DIR/report_core.html --self-contained-html tests/core
python -m pytest -v -W=ignore --html=$REPORT_DIR/report_analysis.html --self-contained-html tests/analysis
python -m pytest -v -W=ignore --html=$REPORT_DIR/report_core.html --self-contained-html tests/core

if [ "$run_pytest_xgboost" = "enable" ] ; then
run_for_framework xgboost
Expand All @@ -45,7 +45,7 @@ fi
check_logs $REPORT_DIR/*

# Only look at newly added files
if [ -n "$(git status --porcelain | grep ^?? | grep -v tornasolecodebuildtest | grep -v upload)" ]; then
if [ -n "$(git status --porcelain | grep ^?? | grep -v smdebugcodebuildtest | grep -v upload)" ]; then
echo "ERROR: Test artifacts were created. Please place these in /tmp."
exit 1
fi
10 changes: 5 additions & 5 deletions config/upload_on_end.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,27 @@ cat $CODEBUILD_SRC_DIR/upload/$CURRENT_COMMIT_PATH/pytest_reports/*.html >> $COD
upload_dirs() {
for var in "$@"
do
aws s3 sync --quiet $CODEBUILD_SRC_DIR/upload/$CURRENT_COMMIT_PATH/$var s3://tornasolecodebuildtest/$CURRENT_COMMIT_PATH/$var
aws s3 sync --quiet $CODEBUILD_SRC_DIR/upload/$CURRENT_COMMIT_PATH/$var s3://smdebugcodebuildtest/$CURRENT_COMMIT_PATH/$var
done
}

del_dirs() {
for var in "$@"
do
aws s3 rm --recursive --quiet s3://tornasolecodebuildtest/$CURRENT_COMMIT_PATH/$var
aws s3 rm --recursive --quiet s3://smdebugcodebuildtest/$CURRENT_COMMIT_PATH/$var
done
}

PR_ID=$(echo $CODEBUILD_WEBHOOK_TRIGGER | cut -d '/' -f 2-)
export GITHUB_PR_URL=https://github.com/awslabs/$CURRENT_REPO_NAME/pull/$PR_ID
export S3_TEST_REPORT_URL=https://s3.console.aws.amazon.com/s3/object/tornasolecodebuildtest/$CURRENT_COMMIT_PATH/pytest_reports/all_tests.html?region=us-west-1
export S3_TEST_REPORT_URL=https://s3.console.aws.amazon.com/s3/object/smdebugcodebuildtest/$CURRENT_COMMIT_PATH/pytest_reports/all_tests.html?region=us-west-1

if [ $CODEBUILD_BUILD_SUCCEEDING -eq 0 ]
then
upload_dirs local_trials integration_tests_logs pytest_reports
echo "ERROR BUILD FAILED , ACCESS BUILD LOGS THROUGH GITHUB OR TROUGH THE LINK PR:$GITHUB_PR_URL . CODEBUILD:$CODEBUILD_BUILD_URL . Test logs are on S3 here:$S3_TEST_REPORT_URL"
echo "ERROR BUILD FAILED , ACCESS BUILD LOGS THROUGH GITHUB OR TROUGH THE LINK PR: $GITHUB_PR_URL . CODEBUILD: $CODEBUILD_BUILD_URL . Test logs are on S3 here: $S3_TEST_REPORT_URL"
else
del_dirs s3_trials
upload_dirs integration_tests_logs pytest_reports wheels
echo "INFO BUILD SUCCEEDED !!! , ACCESS BUILD LOGS THROUGH GITHUB OR TROUGH THE LINK PR:$GITHUB_PR_URL . CODEBUILD:$CODEBUILD_BUILD_URL. Test logs are on S3 here:$S3_TEST_REPORT_URL"
echo "INFO BUILD SUCCEEDED!!! , ACCESS BUILD LOGS THROUGH GITHUB OR TROUGH THE LINK PR: $GITHUB_PR_URL . CODEBUILD: $CODEBUILD_BUILD_URL . Test logs are on S3 here: $S3_TEST_REPORT_URL"
fi
2 changes: 1 addition & 1 deletion docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,7 @@ Sample JSON file:
In SageMaker environment, the presence of this JSON is necessary to log any Tensorboard artifact.
By default, this path is set to point to a pre-defined location in SageMaker.

tensorboard_dir can also be passed while creating the hook [Creating a hook](###Hook from Python) using the API or
tensorboard_dir can also be passed while creating the hook using the API or
in the JSON specified in SMDEBUG_CONFIG_FILE_PATH. For this, export_tensorboard should be set to True.
This option to set tensorboard_dir is available in both, SageMaker and non-SageMaker environments.

Expand Down
11 changes: 4 additions & 7 deletions examples/mxnet/notebooks/mxnet-tensor-plot.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning models. \n",
"SageMaker Debugger is a new capability of Amazon SageMaker that allows debugging machine learning models. \n",
"It lets you go beyond just looking at scalars like losses and accuracies during training and gives \n",
"you full visibility into all the tensors 'flowing through the graph' during training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected an inconsistency in the training flow.\n",
"you full visibility into all the tensors 'flowing through the graph' during training. SageMaker Debugger helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected an inconsistency in the training flow.\n",
"\n",
"Using Tornasole is a two step process: Saving tensors and Analysis. In this notebook we will run an MXNet training job and configure Tornasole to store all tensors from this job. Afterwards we will visualize those tensors in our notebook.\n"
"Using SageMaker Debugger is a two step process: Saving tensors and Analysis. In this notebook we will run an MXNet training job and configure SageMaker Debugger to store all tensors from this job. Afterwards we will visualize those tensors in our notebook.\n"
]
},
{
Expand Down Expand Up @@ -51,7 +51,7 @@
"\n",
"Now we'll call the Sagemaker MXNet Estimator to kick off a training job along with the VanishingGradient rule to monitor the job.\n",
"\n",
"The 'entry_point_script' points to the MXNet training script that has the TornasoleHook integrated.\n"
"The 'entry_point_script' points to the MXNet training script that has the SageMaker DebuggerHook integrated.\n"
]
},
{
Expand All @@ -78,13 +78,10 @@
"REGION='us-west-2'\n",
"TAG='latest'\n",
"\n",
"docker_image_name= '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-mxnet-1.4.1-cpu:{}'.format(REGION, TAG)\n",
"\n",
"estimator = MXNet(role=sagemaker.get_execution_role(),\n",
" base_job_name='mxnet-trsl-test-nb',\n",
" train_instance_count=1,\n",
" train_instance_type='ml.m4.xlarge',\n",
" image_name=docker_image_name,\n",
" entry_point=entry_point_script,\n",
" framework_version='1.4.1',\n",
" debug=True,\n",
Expand Down
3 changes: 2 additions & 1 deletion examples/mxnet/scripts/mnist_gluon_all_zero_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import argparse
import random
import time
import uuid

# Third Party
import mxnet as mx
Expand All @@ -22,7 +23,7 @@ def parse_args():
parser.add_argument(
"--smdebug_path",
type=str,
default="s3://tornasole-testing/all-zero-hook/trial-3",
default=f"s3://smdebug-testing/outputs/all-zero-hook/trial-{uuid.uuid4()}",
help="S3 URI of the bucket where tensor data will be stored.",
)
parser.add_argument("--learning_rate", type=float, default=0.1)
Expand Down
3 changes: 2 additions & 1 deletion examples/mxnet/scripts/mnist_gluon_basic_hook_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import argparse
import random
import time
import uuid

# Third Party
import mxnet as mx
Expand All @@ -22,7 +23,7 @@ def parse_args():
parser.add_argument(
"--output-uri",
type=str,
default="s3://tornasole-testing/basic-mxnet-hook",
default=f"s3://smdebug-testing/outputs/basic-mxnet-hook-{uuid.uuid4()}",
help="S3 URI of the bucket where tensor data will be stored.",
)
parser.add_argument(
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Standard Library
import argparse
import time
import uuid

# Third Party
import mxnet as mx
Expand All @@ -20,7 +21,7 @@ def parse_args():
parser.add_argument(
"--output-s3-uri",
type=str,
default="s3://tornasole-testing/block-io-mxnet-hook",
default=f"s3://smdebug-testing/outputs/block-io-mxnet-hook-{uuid.uuid4()}",
help="S3 URI of the bucket where tensor data will be stored.",
)
parser.add_argument(
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Standard Library
import argparse
import time
import uuid

# Third Party
import mxnet as mx
Expand All @@ -20,7 +21,7 @@ def parse_args():
parser.add_argument(
"--output-s3-uri",
type=str,
default="s3://tornasole-testing/model-io-mxnet-hook",
default=f"s3://smdebug-testing/outputs/model-io-mxnet-hook-{uuid.uuid4()}",
help="S3 URI of the bucket where tensor data will be stored.",
)
parser.add_argument(
Expand Down
3 changes: 2 additions & 1 deletion examples/mxnet/scripts/mnist_gluon_save_all_demo.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Standard Library
import argparse
import time
import uuid

# Third Party
import mxnet as mx
Expand All @@ -20,7 +21,7 @@ def parse_args():
parser.add_argument(
"--output-s3-uri",
type=str,
default="s3://tornasole-testing/saveall-mxnet-hook",
default=f"s3://smdebug-testing/outputs/saveall-mxnet-hook-{uuid.uuid4()}",
help="S3 URI of the bucket where tensor data will be stored.",
)
parser.add_argument(
Expand Down
3 changes: 2 additions & 1 deletion examples/mxnet/scripts/mnist_gluon_vg_demo.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Standard Library
import argparse
import random
import uuid

# Third Party
import mxnet as mx
Expand All @@ -19,7 +20,7 @@ def parse_args():
parser.add_argument(
"--output-uri",
type=str,
default="s3://tornasole-testing/vg-demo",
default=f"s3://smdebug-testing/outputs/vg-demo-{uuid.uuid4()}",
help="S3 URI of the bucket where tensor data will be stored.",
)
parser.add_argument(
Expand Down
2 changes: 1 addition & 1 deletion examples/mxnet/scripts/mnist_mxnet.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def parse_args():
parser.add_argument(
"--output-uri",
type=str,
default="/opt/ml/output/tensors/tornasole",
default="/opt/ml/output/tensors/smdebug",
help="S3 URI of the bucket where tensor data will be stored.",
)
parser.add_argument("--learning_rate", type=float, default=0.1)
Expand Down
2 changes: 1 addition & 1 deletion examples/mxnet/scripts/mnist_mxnet_hvd.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ def create_hook():
train_data.reset()
metric.reset()

# Create Tornasole Hook
# Create Hook
hook = create_hook()
hook.register_hook(model)

Expand Down
2 changes: 1 addition & 1 deletion examples/pytorch/scripts/pytorch_hook_demos.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ def main():
"--output-uri",
type=str,
help="output directory to save data in",
default="./tornasole-testing/demo/",
default="/tmp/testing/demo/",
)
parser.add_argument(
"--hook-type",
Expand Down
6 changes: 3 additions & 3 deletions smdebug/core/hook.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ def __init__(
Attributes
----------
out_dir : str
represents a path into which tornasole outputs will be written to
represents a path into which outputs will be written to
dry_run : bool
when dry run is set, behavior is only described in the log file.
tensors are not actually saved.
Expand Down Expand Up @@ -196,7 +196,7 @@ def __init__(
self.logger.info("Saving to {}".format(self.out_dir))
atexit.register(self._cleanup)

# Check if there is any last saved tornasole state. Initialize the hook based last saved state.
# Check if there is any last saved state. Initialize the hook based last saved state.
self.training_run = 0
self._initialize_to_last_saved_state()

Expand Down Expand Up @@ -633,7 +633,7 @@ def _save_for_tensor(self, tensor_name, tensor_value, check_before_write=True):
called if tensor should not be saved for this step.
:param tensor_name: str
The name of tensor. In TensorFlow's case, this is graph name of tensor
and will be converted to Tornasole name in write_for_tensor.
and will be converted to internal name in write_for_tensor.
:param tensor_value: dtype is tensor class of corresponding framework
value of the tensor to be saved
:param check_before_write: bool
Expand Down
20 changes: 10 additions & 10 deletions smdebug/core/json_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,36 +118,36 @@ def collect_hook_config_params(params_dict) -> Dict:
# Build params dictionary from the json file

# Declare defaults
tornasole_params_dict = {
parsed_params_dict = {
CONFIG_RDN_CFG_KEY: None,
CONFIG_REDUCTION_CONFIGS_KEY: {},
CONFIG_SAVE_CONFIGS_KEY: {},
CONFIG_INCLUDE_REGEX_KEY: None,
}
# Set top-level path parameters
# SageMaker doesn't have any way to specify this for now, so default to using their path
tornasole_params_dict["out_dir"] = params_dict.get(CONFIG_OUTDIR_KEY, DEFAULT_SAGEMAKER_OUTDIR)
parsed_params_dict["out_dir"] = params_dict.get(CONFIG_OUTDIR_KEY, DEFAULT_SAGEMAKER_OUTDIR)

# Get the main HookParameters; pass these as defaults
hook_params = params_dict.get(CONFIG_HOOK_PARAMS_KEY, {})
# If we have {"HookParameters": null}, replace null with {}.
hook_params = {} if hook_params is None else hook_params
base_config_modes = parse_save_config_modes_dict(params=hook_params)
tornasole_params_dict["save_config_modes"] = base_config_modes
parsed_params_dict["save_config_modes"] = base_config_modes
# If we pass reduction=None, then the full tensor is saved by default
if "reductions" in hook_params:
tornasole_params_dict[CONFIG_RDN_CFG_KEY] = ReductionConfig.from_dict(hook_params)
parsed_params_dict[CONFIG_RDN_CFG_KEY] = ReductionConfig.from_dict(hook_params)
if "save_all" in hook_params:
tornasole_params_dict[CONFIG_SAVE_ALL_KEY] = parse_bool(hook_params["save_all"], False)
parsed_params_dict[CONFIG_SAVE_ALL_KEY] = parse_bool(hook_params["save_all"], False)
if "include_regex" in hook_params:
tornasole_params_dict[CONFIG_INCLUDE_REGEX_KEY] = split(hook_params["include_regex"])
parsed_params_dict[CONFIG_INCLUDE_REGEX_KEY] = split(hook_params["include_regex"])
if CONFIG_INCLUDE_WORKERS_KEY in hook_params:
tornasole_params_dict[CONFIG_INCLUDE_WORKERS_KEY] = hook_params[CONFIG_INCLUDE_WORKERS_KEY]
tornasole_params_dict[EXPORT_TENSORBOARD_KEY] = parse_bool(
parsed_params_dict[CONFIG_INCLUDE_WORKERS_KEY] = hook_params[CONFIG_INCLUDE_WORKERS_KEY]
parsed_params_dict[EXPORT_TENSORBOARD_KEY] = parse_bool(
hook_params.get(EXPORT_TENSORBOARD_KEY, False), False
)
tornasole_params_dict[TENSORBOARD_DIR_KEY] = hook_params.get(TENSORBOARD_DIR_KEY, None)
return tornasole_params_dict
parsed_params_dict[TENSORBOARD_DIR_KEY] = hook_params.get(TENSORBOARD_DIR_KEY, None)
return parsed_params_dict


def get_include_collections(params_dict):
Expand Down
Loading