diff --git a/docs/rules/README.md b/docs/rules/README.md deleted file mode 100644 index 44c28e3ef..000000000 --- a/docs/rules/README.md +++ /dev/null @@ -1,483 +0,0 @@ -# Tornasole Analysis -Tornasole is an upcoming AWS service designed to be a debugger for machine learning models. -It lets you go beyond just looking at scalars like losses and accuracies during training and gives you -full visibility into all tensors 'flowing through the graph' during training or inference. - -Tornasole's analysis module helps you analyze tensors saved from machine learning jobs. -It allows you to run Rules on these tensors as well as anything else you might want to do with -access to raw tensors such as inspection or visualization. It provides access to the tensors in the form of numpy arrays. - -## The Programming Model -The library is organized using the following constructs. - -### Trial -Trial the construct which lets you query for tensors for a given Tornasole run, specified by the path in which Tornasole artifacts are being saved or were saved. -You can pass a path which holds data for a past run (which has ended) as well as a path for a current run (to which tensors are being written). -Trial is capable of loading new tensors as and when they become available at the given location. - -There are two types of trials you can create: LocalTrial or S3Trial. -We provide a wrapper method to create the appropriate trial. - -The parameters you have to provide are: -- `name`: name can be any string. It is to help you manage different trials. -Make sure to give it a unique name to prevent confusion. -- `path`: path can be a local path or an S3 path of the form `s3://bucket/prefix`. This path should be where Tornasole hooks (TF or MXNet) save data to. -You should see the directory `events` and the file `collections.json` in this path. - -##### Creating local trial -``` -from smdebug.trials import create_trial -trial = create_trial(path='/home/ubuntu/tornasole_outputs/train', - name='resnet_training_run') -``` -##### Creating S3 trial -``` -from smdebug.trials import create_trial -trial = create_trial(path='s3://tornasole-testing-bucket/outputs/resnet', - name='resnet_training_run') -``` -###### Restricting analysis to a range of steps -To any of these methods you can optionally pass `range_steps` to restrict your analysis to a certain range of steps. -Note that if you do so, Trial will not load data from other steps. - -*Examples* -- `range_steps=(100, None)`: This will load all steps after 100 -- `range_steps=(None, 100)`: This will load all steps before 100 -- `range_steps=(100, 200)` : This will load steps between 100 and 200 -- `range_steps=None`: This will load all steps - -``` -lt = create_trial(path='ts_outputs/resnet', name='resnet_training', - range_steps=(100, 200)) -``` - - -### Mode -A machine learning job can be executing steps in multiple modes, such as training, evaluating, or predicting. -Tornasole provides you the construct of a `mode` to keep data from these modes separate -and make it easy for analysis. To leverage this functionality you have to -call the `set_mode` function of hook such as the following call `hook.set_mode(modes.TRAIN)`. -The different modes available are `modes.TRAIN`, `modes.EVAL` and `modes.PREDICT`. - -When you set a mode, steps in that mode have a sequence. We refer to these numbers -as `mode_step`. Each `mode_step` has a global step number associated with it, which represents the -sequence of steps across all modes executed by the job. - -For example, your job executes 10 steps, out of which the first 4 are training steps, 5th is evaluation step, 6-9 are training steps, and 10th is evaluation step. -Please note that indexing starts from 0. -In such a case, when you query for the global steps as below: -``` -trial.steps() -``` -you will see `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`. - -If you query for training steps as below: -``` -from tornasole_rules import modes -trial.steps(modes.TRAIN) -``` - you will see `[0, 1, 2, 3, 4, 5, 6, 7, 8]` because there were 8 training step. -The training step with mode_step 4 here refers to the global step number 5. -You can query this as follows: -``` -trial.global_step(mode=modes.TRAIN, mode_step=4) -``` - -If you did not explicitly set a mode during the running of the job, -the steps are global steps, and are in the `modes.GLOBAL` mode. -In such a case, `global_step` is the same as `mode_step` where mode is `modes.GLOBAL`. - -Below, we describe the above functions and others that the Trial API provides. - -#### Trial API -Once you have a trial object you can do the following - -**See names of all tensors available** - -``` -trial.tensors() -``` -This returns tensors seen for any mode if mode was set during the machine learning job. - -**See all steps seen by the Trial for a particular mode** - -The returned list is the step number within that mode. -Each of these mode steps has a global step number associated with it. -The global step represents the sequence of steps across all modes executed by the job. - -``` -from smdebug import modes -trial.steps(mode=modes.TRAIN) -``` - -**See all global steps seen by the Trial** - -This is the list of steps across all modes. - -``` -trial.steps() -``` - -**Get the mode and step number within mode for a given global step** - -You can get the `mode` of `global_step` 100 as follows: - -``` -mode = trial.mode(global_step=100) -``` - -You can get the `mode_step` for `global_step` 100 as follows: - -``` -mode_step = trial.mode_step(global_step=100) -``` - -**Know the global step number for a given mode step** - -``` -from smdebug import modes -global_step_num = trial.global_step(modes.TRAIN, mode_step=10) -``` - -**See all modes for which the trial has data** - -``` -trial.modes() -``` - -**Access a particular tensor** - -A tensor is identified by a string which represents its name. - -``` -trial.tensor('relu_activation:0') -``` - -**See the global steps for which tensor's value was saved** - -``` -trial.tensor('relu_activation:0').steps() -``` - -**See the steps for a given mode when tensor's value was saved** - -This returns the mode steps for those steps when this tensor's value was saved for this mode. - -``` -from smdebug import modes -trial.tensor('relu_activation:0').steps(mode=modes.TRAIN) -``` - -**Get the value of the tensor at a global step** - -This returns the tensor value as a numpy array for the 10th global step. - -``` -trial.tensor('relu_activation:0').value(10) -``` - -Please note that this can raise exceptions if the step is not available. -Please see [this section](#when-a-tensor-is-not-available-during-rule-execution) for more details on the different exceptions that can be raised. - -**Get the value of the tensor at a step number for a given mode** - -This returns the tensor value as a numpy array for the 10th training step. - -``` -from smdebug import modes -trial.tensor('relu_activation:0').value(10, mode=modes.TRAIN) -``` - -Please note that this can raise exceptions if the step is not available. -Please see [this section](#when-a-tensor-is-not-available-during-rule-execution) for more details on the different exceptions that can be raised. - -**Get reduction value of a tensor at a step** - -Tornasole provides a few reductions out of the box that you can query with the following API. -This below returns the mean of the absolute values at step 10. - -``` -trial.tensor('relu:0').reduction_value(10, 'mean', abs=True) -``` - -The different reductions you can query for are the same as what are allowed in [ReductionConfig](https://github.com/awslabs/tornasole_tf/blob/master/docs/api.md) when saving tensors. -This API thus allows you to access the reduction you might have saved instead of the full tensor. -If you had saved the full tensor, it will calculate the requested reduction now and cache it. - -- `min`, `max`, `mean`, `prod`, `std`, `sum`, `variance` -- `l1`, `l2` norms - -Each of these can be retrieved for the absolute value of the tensor or the original tensor. -Above was an example to get the mean of the absolute value of the tensor. -`abs` can be set to `False` if you want to see the `mean` of the actual tensor. - -*Note that if you had only saved a particular reduction, you will not be able -to access the full tensor value or any other reduction during analysis. -This also applies to the `abs` flag, meaning that if you had saved the -`mean` of `abs` values of the tensor you can not query for the non absolute values mean. -If you do so, Tornasole will return `None`.* - -If you had saved the tensor without any reduction, then you can retrieve the actual tensor -as a numpy array and compute any function you might be interested in. - -Please note that this can raise exceptions if the step is not available. -Please see [this section](#when-a-tensor-is-not-available-during-rule-execution) for more details on the different exceptions that can be raised. - -**Get names of tensors matching regex** - -This method takes a regex pattern or a list of regex patterns. -Each regex pattern is a python style regex pattern string. - -``` -trail.tensors_matching_regex(['relu_activation*']) -``` - -**List tensors in a collection** - -This returns names of all tensors saved in a given collection. -`gradients` below is the name of the collection we are interested in. - -``` -trial.tensors_in_collection('gradients') -``` - -**List collections** - -Below returns all collections belonging to the trial as a dictionary. -This dictionary is indexed by the name of the collection, and the value is the collection object. - -``` -trial.collections() -``` - -**Refresh or do not refresh tensors** - -By default Tornasole refreshes tensors each time you try to query the tensor. -It looks for whether this tensor is saved for new steps and if so fetches them. -If you know the saved data will not change (stopped the machine learning job), or -are not interested in the latest data, you can stop the refreshing of tensors as follows: - -`no_refresh` takes a trial or a list of trials, which should not be refreshed. -Anything executed inside the with `no_refresh` block will not be refreshed. - -``` -from smdebug.analysis.utils import no_refresh -with no_refresh(trials): - pass -``` - -Similarly if you want to refresh tensors only within a block, you can do: - -``` -from smdebug.analysis.utils import refresh -with refresh(trials): - pass -``` - -#### When a tensor is not available -Tornasole is designed to be aware that tensors required to execute a rule may not be available at every step. -Hence it raises a few exceptions which allow us to control what happens when a tensor is missing. -These are available in the `smdebug.exceptions` module. You can import them as follows: - -``` -from smdebug.exceptions import * -``` - -Here are the exceptions and their meanings: - -- `TensorUnavailableForStep` : This means that the tensor requested is not available for the step. This might mean that -this step might not be saved at all by the hook, or that this step might have saved some tensors but the requested -tensor is not part of them. Note that when you see this exception, it means that this tensor can never become available -for this step in the future. - -- `TensorUnavailable` : This means that this tensor is not being saved or has not been saved by smdebug. This means -that this tensor will never be seen for any step in smdebug. - -- `StepUnavailable`: This means that the step was not saved and Tornasole has no data from the step. - -- `StepNotYetAvailable`: This means that the step has not yet been seen by smdebug. It may be available in the future if the training is still going on. -Tornasole automatically loads new data as and when it becomes available. - -- `NoMoreData` : This will be raised when the training ends. Once you see this, you will know that there will be no more steps and no more tensors saved. - -### Rules -Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the jobs. -A rule is assigned to a trial and can be invoked at each new step of the trial. -It can also access other trials for its execution. -You can evaluate a rule using tensors from the current step or any step before the current step. -Please ensure your logic respects these semantics, else you will get a `TensorUnavailableForStep` -exception as the data would not yet be available. - -#### Writing a rule -Writing a rule involves implementing the [Rule interface](../../smdebug/rules/rule.py). - -##### Constructor -Creating a rule involves first inheriting from the base Rule class Tornasole provides. -For this rule here we do not need to look at any other trials, so we set `other_trials` to None. - -``` -from smdebug.rules import Rule - -class VanishingGradientRule(Rule): - def __init__(self, base_trial, threshold=0.0000001): - super().__init__(base_trial, other_trials=None) - self.threshold = float(threshold) -``` - -Please note that apart from `base_trial` and `other_trials` (if required), we require all -arguments of the rule constructor to take a string as value. You can parse them to the type -that you want from the string. This means if you want to pass -a list of strings, you might want to pass them as a comma separated string. This restriction is -being enforced so as to let you create and invoke rules from json using Sagemaker's APIs. - -##### Function to invoke at a given step -In this function you can implement the core logic of what you want to do with these tensors. - -It should return a boolean value `True` or `False`. -This can be used to define actions that you might want to take based on the output of the rule. - -A simplified version of the actual invoke function for `VanishingGradientRule` is below: -``` - def invoke_at_step(self, step): - for tensor in self.base_trial.tensors_in_collection('gradients'): - abs_mean = tensor.reduction_value(step, 'mean', abs=True) - if abs_mean < self.threshold: - return True - else: - return False -``` - -##### Optional: RequiredTensors - -This is an optional construct that allows Tornasole to bulk-fetch all tensors that you need to -execute the rule. This helps the rule invocation be more performant so it does not fetch tensor values from S3 one by one. To use this construct, you need to implement a method which lets Tornasole know what tensors you are interested in for invocation at a given step. -This is the `set_required_tensors` method. - -Before we look at how to define this method, let us look at the API for `RequiredTensors` class which -needs to be used by this method. An object of this class is provided as a member of the rule class, so you can access it as `self.req_tensors`. - -**[RequiredTensors](../../smdebug/rules/req_tensors.py) API** - -***Adding a required tensor*** -When invoking a rule at a given step, you might require the values of a tensor at certain steps. -This method allows you to specify these steps as required for the tensor. -``` -self.req_tensors.add(name=tname, - steps=[step_num], - trial=None, - should_match_regex=False) -``` - -The arguments are described below: - -- `name`: name of the tensor -- `steps`: list of integers representing global step numbers at which this rule requires the values of this tensor -- `trial`: the trial whose tensor values are required. If this argument is None, it is assumed to -take the value of `self.base_trial` in the rule class. None is the default value for this argument. -- `should_match_regex`: boolean which when True means that the given name is treated as a regex pattern. -In such a case, all tensor names in the trial which match that regex pattern are treated as required -for the invocation of the rule at the given step. - -***Fetching required tensors*** - -If required tensors were added inside `set_required_tensors`, during rule invocation it is -automatically used to fetch all tensors at once by calling `req_tensors.fetch()`. -It can raise the exceptions `TensorUnavailable` and `TensorUnavailableForStep` if the trial does not have that tensor, or if the tensor value is not available for the requested step. - - -If required tensors were added elsewhere, or later, you can call the `req_tensors.fetch()` method -yourself to fetch all tensors at once. - -***Querying required tensors*** - -You can then query the required tensors -*Get names of required tensors* - -This method returns the names of the required tensors for a given trial. -``` -self.req_tensors.get_names(trial=None) -``` -- `trial`: the trial whose required tensors are being queried. If this argument is None, it is assumed to -take the value of `self.base_trial` in the rule class. None is the default value for this argument. - -*Get steps for a given required tensor* - -This method returns the steps for which the tensor is required to execute the rule at this step. -``` -self.req_tensors.get_tensor_steps(name, trial=None) -``` -- `trial`: the trial whose required tensors are being queried. If this argument is None, it is assumed to -take the value of `self.base_trial` in the rule class. None is the default value for this argument. - - -*Get required tensors* - -This method returns the list of required tensors for a given trial as `Tensor` objects. -``` -self.req_tensors.get(trial=None) -``` -- `trial`: the trial whose required tensors are being queried. If this argument is None, it is assumed to -take the value of `self.base_trial` in the rule class. None is the default value for this argument. - - -###### Declare required tensors -Here, let us define the `set_required_tensors` method to declare the required tensors -to execute the rule at a given `step`. -If we require the gradients of the base_trial to execute the rule at a given step, -then it would look as follows: -``` - def set_required_tensors(self, step): - for tname in self.base_trial.tensors_in_collection('gradients'): - self.req_tensors.add(tname, steps=[step]) -``` - -This function will be used by the rule execution engine to fetch all the -required tensors before it executes the rule. -The rule invoker executes the `set_required_tensors` and `invoke_at_step` -methods within a single `no_refresh` block, hence you are guaranteed that the -tensor values or steps numbers will stay the same during multiple calls. - -#### Executing a rule -Now that you have written a rule, here's how you can execute it. We provide a function to invoke rules easily. -Refer [smdebug/rules/rule_invoker.py](../../smdebug/rules/rule_invoker.py) -The invoke function has the following syntax. -It takes a instance of a Rule and invokes it for a series of steps one after the other. - -``` -from smdebug.rules import invoke_rule -invoke_rule(rule_obj, start_step=0, end_step=None) -``` - -You can invoking the VanishingGradientRule is -``` -trial_obj = create_trial(trial_dir) -vr = VanishingGradientRule(base_trial=trial_obj, threshold=0.0000001) -invoke_rule(vr, start_step=0, end_step=1000) -``` - -For first party Rules (see below) that we provide a rule_invoker module that you can use to run them as follows. You can pass any arguments that the rule takes as command line arguments. - -``` -python -m smdebug.rules.rule_invoker --trial-dir ~/ts_outputs/vanishing_gradients --rule-name VanishingGradient --threshold 0.0000000001 -``` - -``` -python -m smdebug.rules.rule_invoker --trial-dir s3://tornasole-runes/trial0 --rule-name UnchangedTensor --tensor_regex .* --num_steps 10 -``` - -#### First party rules -We provide a few rules which we built. These are supposed to be general purpose rules that you can use easily. -We also hope these serve as examples for you to build your own rules. These are described in [FirstPartyRules](FirstPartyRules.md). - - -## Examples - -We have end-to-end flow example from saving tensors to plotting using saved tensors for [MXNet](../../examples/mxnet/notebooks) and [PyTorch](../../examples/pytorch/notebooks). - - -## ContactUs -We would like to hear from you. If you have any question or feedback, -please reach out to us tornasole-users@amazon.com - -## License -This library is licensed under the Apache 2.0 License. diff --git a/documentation/analysis.md b/documentation/analysis.md index ad807de07..ee557d907 100644 --- a/documentation/analysis.md +++ b/documentation/analysis.md @@ -1,3 +1,532 @@ -# Analysis +# Programming Model for Analysis -TODO: Describe rules and trials. Merge this with Rahul's PR. +This page describes the programming model that SageMaker Debugger provides for your analysis, and introduces you to the constructs of Trial, Tensor and Rule. + +## Table of Contents +* [Trial](#Trial) + * [Path of trial](#Path-of-trial) + * [SageMaker training job](#SageMaker-training-job) + * [Non SageMaker training jobs](#Non-SageMaker-training-jobs) + * [Creating a trial object](#Creating-a-trial-object) + * [Creating S3 trial](#Creating-S3-trial) + * [Creating local trial](#Creating-local-trial) + * [Restricting analysis to a range of steps](#Restricting-analysis-to-a-range-of-steps) + * [Trial API](#Trial-API) + * [tensor_names](#tensor_names) + * [tensor](#tensor) + * [has_tensor](#has_tensor) + * [steps](#steps) + * [modes](#modes) + * [mode](#mode) + * [mode_step](#mode_step) + * [global_step](#global_step) + * [workers](#workers) + * [collections](#collections) + * [collection](#collection) + * [wait\_for\_steps](#wait\_for\_steps) + * [has\_passed\_step](#has\_passed\_step) +* [Tensor](#Tensor-1) + * [Tensor API](#Tensor-API) + * [steps](#steps-1) + * [value](#value) + * [reduction_value](#reduction_value) + * [reduction_values](#reduction_values) + * [values](#values) + * [workers](#workers-1) + * [prev_steps](#prev_steps) +* [Rules](#Rules) + * [Built In Rules](#Built-In-Rules) + * [Writing a custom rule](#Writing-a-custom-rule) + * [Constructor](#Constructor) + * [Function to invoke at a given step](#Function-to-invoke-at-a-given-step) + * [Invoking a rule](#Invoking-a-rule) + * [invoke_rule](#invoke_rule) +* [Exceptions](#Exceptions) +* [Utils](#Utils) + * [Enable or disable refresh of tensors in a trial](#Enable-or-disable-refresh-of-tensors-in-a-trial) + +## Trial +Trial is an object which lets you query for tensors for a given training job, specified by the path where smdebug's artifacts are saved. +Trial is capable of loading new tensors as and when they become available at the given path, allowing you to do both offline as well as realtime analysis. + +### Path of trial +#### SageMaker training job +When running a SageMaker job this path is on S3. SageMaker saves data from your training job locally on the training instance first and uploads them to an S3 location in your account. When you start a SageMaker training job with the python SDK, you can control this path using the parameter `s3_output_path` in the `DebuggerHookConfig` object. This is an optional parameter, if you do not pass this the python SDK will populate a default location for you. If you do pass this, make sure the bucket is in the same region as where the training job is running. If you're not using the python SDK, set this path for the parameter `S3OutputPath` in the `DebugHookConfig` section of `CreateTrainingJob` API. SageMaker takes this path and appends training_job_name and debug-output to it to ensure we have a unique path for each training job. + +#### Non SageMaker training jobs +If you are not running a SageMaker training job, this is the path you pass as out_dir when you create a smdebug [`Hook`](hook.md). Just like when creating the hook, you can pass either a local path or an S3 path (as `s3://bucket/prefix`). + +### Creating a trial object +There are two types of trials you can create: LocalTrial or S3Trial depending on the path. We provide a wrapper method to create the appropriate trial. + +The parameters you have to provide are: +- `path`: path can be a local path or an S3 path of the form `s3://bucket/prefix`. You should see directories such as `collections`, `events` and `index` at this path once the training job starts. +- `name`: name can be any string. It is to help you manage different trials. This is an optional parameter, which defaults to the basename of the path if not passed. Please make sure to give it a unique name to prevent confusion. + +#### Creating S3 trial +```python +from smdebug.trials import create_trial +trial = create_trial(path='s3://smdebug-testing-bucket/outputs/resnet', name='resnet_training_run') +``` + +#### Creating local trial +```python +from smdebug.trials import create_trial +trial = create_trial(path='/home/ubuntu/smdebug_outputs/resnet', name='resnet_training_run') +``` + +#### Restricting analysis to a range of steps +You can optionally pass `range_steps` to restrict your analysis to a certain range of steps. +Note that if you do so, Trial will not load data from other steps. + +*Examples* +- `range_steps=(100, None)`: This will load all steps after 100 +- `range_steps=(None, 100)`: This will load all steps before 100 +- `range_steps=(100, 200)` : This will load steps between 100 and 200 +- `range_steps=None`: This will load all steps + +```python +from smdebug.trials import create_trial +tr = create_trial(path='s3://smdebug-testing-bucket/outputs/resnet', name='resnet_training', + range_steps=(100, 200)) +``` + +### Trial API + +Here's a list of methods that the Trial API provides which helps you load data for analysis. Please click on the method to see all the parameters it takes and a detailed description. If you are not familiar with SMDebug constructs, you might want to review [this doc](#todo add link) before going through this page. + +| Method | Description | +| ------------- |-------------| +| [trial.tensor_names()](#tensor_names) | See names of all tensors available | +| [trial.tensor(name)](#tensor) | Retrieve smdebug Tensor object | +| [trial.has_tensor(name)](#has_tensor) | Query for whether tensor was saved | +| [trial.steps()](#steps) | Query steps for which data was saved | +| [trial.modes()](#modes) | Query modes for which data was saved | +| [trial.mode(step)](#mode) | Query the mode for a given global step | +| [trial.global_step(mode, step)](#global_step) | Query global step for a given step and mode | +| [trial.mode_step(step)](#mode_step) | Query the mode step for a given global step | +| [trial.workers()](#workers) | Query list of workers from the data saved | +| [trial.collections()](#collections) | Query list of collections saved from the training job | +| [trial.collection(name)](#collection) | Retrieve a single collection saved from the training job | +| [trial.wait\_for\_steps(steps)](#wait\_for\_steps) | Wait till the requested steps are available | +| [trial.has\_passed\_step(step)](#has\_passed\_step) | Query whether the requested step is available | + + +#### tensors +Retrieves names of tensors saved +```python +trial.tensor_names(step= None, + mode=modes.GLOBAL, + regex=None, + collection=None) +``` + +###### Arguments +All arguments to this method are optional. You are required to pass any of these arguments as keyword arguments. + +- `step (int)` If you want to retrieve the list of tensors saved at a particular step, pass the step number as an integer. This step number will be treated as step number corresponding to the mode passed below. By default it is treated as global step. +- `mode (smdebug.modes enum value)` If you want to retrieve the list of tensors saved for a particular mode, pass the mode here as `smd.modes.TRAIN`, `smd.modes.EVAL`, `smd.modes.PREDICT`, or `smd.modes.GLOBAL`. +- `regex (str or list[str])` You can filter tensors matching regex expressions by passing a regex expressions as a string or list of strings. You can only pass one of `regex` or `collection` parameters. +- `collection (Collection or str)` You can filter tensors belonging to a collection by either passing a collection object or the name of collection as a string. You can only pass one of `regex` or `collection` parameters. + +###### Returns +`list[str]`: List of strings representing names of tensors matching the given arguments. Arguments are processed as follows: get the list of tensor names for given step and mode. saved for given step matching all the given arguments, i.e. intersection of tensors matching each of the parameters. + +###### Examples +- `trial.tensor_names()` Returns all tensors saved for any step or mode. +- `trial.tensor_names(step=10, mode=modes.TRAIN)` Returns tensors saved for training step 10 +- `trial.tensor_names(regex='relu')` Returns all tensors matching the regex pattern relu saved for any step or mode. +- `trial.tensor_names(collection='gradients')` Returns tensors from gradients collection +- `trial.tensor_names(step=10, mode=modes.TRAIN, regex='softmax')` Returns tensor saved for 10th training step which match the regex `softmax` + + +#### tensor +Retrieve the `smdebug.core.tensor.Tensor` object by the given name tname. You can review all the methods that this Tensor object provides [here](#Tensor). +```python +trial.tensor(tname) +``` +###### Arguments +- `tname (str)` Takes the name of tensor + +###### Returns +`smdebug.core.tensor.Tensor` object which has [this API](#Tensor) + +#### has_tensor +Query whether the trial has a tensor by the given name +```python +trial.has_tensor(tname) +``` + +###### Arguments +- `tname (str)` Takes the name of tensor + +###### Returns +`bool`: `True` if the tensor is seen by the trial so far, else `False`. + +#### steps +Retrieve a list of steps seen by the trial +```python +trial.steps(mode=None) +``` + +###### Arguments +- `mode (smdebug.modes enum value)` Passing a mode here allows you want to retrieve the list of steps seen by a trial for that mode +If this is not passed, returns steps for all modes. + +###### Returns +`list[int]` List of integers representing step numbers. If a mode was passed, this returns steps within that mode, i.e. mode steps. +Each of these mode steps has a global step number associated with it. The global step represents +the sequence of steps across all modes executed by the job. + +#### modes +Retrieve a list of modes seen by the trial +```python +trial.modes() +``` + +###### Returns +`list[smdebug.modes enum value]` List of modes for which data was saved from the training job across all steps seen. + +#### mode +Given a global step number you can identify the mode for that step using this method. +```python +trial.mode(global_step=100) +``` + +###### Arguments +- `global_step (int)` Takes the global step as an integer + +###### Returns +`smdebug.modes enum value` of the given global step + +#### mode_step +Given a global step number you can identify the mode_step for that step using this method. +```python +trial.mode_step(global_step=100) +``` + +###### Arguments +- `global_step (int)` Takes the global step as an integer + +###### Returns +`int`: An integer representing `mode_step` of the given global step. Typically used in conjunction with `mode` method. + +#### global_step +Given a mode and a mode_step number you can retrieve its global step using this method. +```python +trial.global_step(mode=modes.GLOBAL, mode_step=100) +``` + +###### Arguments +- `mode (smdebug.modes enum value)` Takes the mode as enum value +- `mode_step (int)` Takes the mode step as an integer + +###### Returns +`int` An integer representing `global_step` of the given mode and mode_step. + +#### workers +Query for all the worker processes from which data was saved by smdebug during multi worker training. +```python +trial.workers() +``` + +###### Returns +`list[str]` A sorted list of names of worker processes from which data was saved. If using TensorFlow Mirrored Strategy for multi worker training, these represent names of different devices in the process. For Horovod, torch.distributed and similar distributed training approaches, these represent names of the form `worker_0` where 0 is the rank of the process. + + +#### collections + +List the collections from the trial. Note that tensors part of these collections may not necessarily have been saved from the training job. Whether a collection was saved or not depends on the configuration of the Hook during training. + +```python +trial.collections() +``` + +###### Returns +`dict[str -> Collection]` A dictionary indexed by the name of the collection, with the Collection object as the value. Please refer [Collection API](api.md) for more details. #TODO fix link + +#### collection + +Get a specific collection from the trial. Note that tensors part of this collection may not necessarily have been saved from the training job. Whether this collection was saved or not depends on the configuration of the Hook during training. + +```python +trial.collection(coll_name) +``` +###### Arguments +- `coll_name (str)` Name of the collection + +###### Returns +`Collection` The requested Collection object. Please refer [Collection API](api.md) for more details. #TODO fix link + + +#### wait\_for\_steps +This method allows you to wait for steps before proceeding. You might want to use this method if you want to wait for smdebug to see the required steps so you can then query and analyze the tensors saved by that step. This method blocks till all data from the steps are seen by smdebug. +```python +trial.wait_for_steps(required_steps, mode=modes.GLOBAL) +``` + +###### Arguments +- `required_steps (list[int])` Step numbers to wait for +- `mode (smdebug.modes enum value)` The mode to which given step numbers correspond to. This defaults to modes.GLOBAL. + +###### Returns +None, but it only returns after we know definitely whether we have seen the steps. + +###### Exceptions raised +`StepUnavailable` and `NoMoreData`. See [Exceptions](#exceptions) section for more details. + +#### has\_passed\_step +```python +trial.has_passed_step(step, mode=modes.GLOBAL) +``` + +###### Arguments +- `step (int)` The step number to check if the trial has passed it +- `mode (smdebug.modes enum value)` The mode to which given step number corresponds to. This defaults to modes.GLOBAL. + +###### Returns +`smdebug.core.tensor.StepState enum value` which can take one of three values `UNAVAILABLE`, `AVAILABLE` and `NOT_YET_AVAILABLE`. + +TODO@Nihal describe these in detail + +## Tensor +An smdebug Tensor object can bee retrieved through the `trial.tensor(name)` API. It is uniquely identified by the string representing name. + It provides the following methods. + +| Method | Description| +| ---- | ----- | +| [steps()](#steps-1) | Query steps for which tensor was saved | +| [value(step)](#value) | Get the value of the tensor at a given step as a numpy array | +| [reduction_value(step)](#reduction_value) | Get the reduction value of the chosen tensor at a particular step | +| [reduction_values(step)](#reduction_values) | Get all reduction values saved for the chosen tensor at a particular step | +| [values(mode)](#values) | Get the values of the tensor for all steps of a given mode | +| [workers(step)](#workers-1) | Get all the workers for which this tensor was saved at a given step | +| [prev\_steps(step, n)](#prev_steps) | Get the last n step numbers of a given mode from a given step | + +### Tensor API +#### steps +Query for the steps at which the given tensor was saved +```python +trial.tensor(name).steps(mode=ModeKeys.GLOBAL, show_incomplete_steps=False) +``` + +###### Arguments +- `mode (smdebug.modes enum value)` The mode whose steps to return for the given tensor. Defaults to `modes.GLOBAL` +- `show_incomplete_steps (bool)` This parameter is relevant only for distributed training. By default this method only returns the steps which have been received from all workers. But if this parameter is set to True, this method will return steps received from at least one worker. + +###### Returns +`list[int]` A list of steps at which the given tensor was saved + +#### value +Get the value of the tensor at a given step as a numpy array +```python +trial.tensor(name).value(step_num, mode=ModeKeys.GLOBAL, worker=None) +``` + +###### Arguments +- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. +- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` +- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` + +###### Returns +`numpy.ndarray` The value of tensor at the given step and worker (if the training job saved data from multiple workers) + +#### reduction_value +Get the reduction value of the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. The different reductions you can query for are the same as what are allowed in [ReductionConfig](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md) when saving tensors. +This API thus allows you to access the reduction you might have saved instead of the full tensor. If you had saved the full tensor, it will calculate the requested reduction at the time of this call. + +Reduction names allowed are `min`, `max`, `mean`, `prod`, `std`, `sum`, `variance` and `l1`, `l2` representing the norms. + +Each of these can be retrieved for the absolute value of the tensor or the original tensor. Above was an example to get the mean of the absolute value of the tensor. `abs` can be set to `False` if you want to see the `mean` of the actual tensor. + +If you had saved the tensor without any reduction, then you can retrieve the actual tensor as a numpy array and compute any reduction you might be interested in. In such a case you do not need this method. + +```python +trial.tensor(name).reduction_value(step_num, reduction_name, + mode=modes.GLOBAL, worker=None, abs=False) + +``` +###### Arguments +- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. +- `reduction_name (str)` The name of the reduction to query for. This can be one of `min`, `max`, `mean`, `std`, `variance`, `sum`, `prod` and the norms `l1`, `l2`. +- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` +- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` +- `abs (bool)` If abs is True, this method tries to return the reduction passed through reduction_name after taking the absolute value of the tensor. It defaults to False. + +###### Returns +`numpy.ndarray` The reduction value of tensor at the given step and worker (if the training job saved data from multiple workers) as a 1x1 numpy array. If this reduction was saved for the tensor during training as part of specification through reduction config, it will be loaded and returned. If the given reduction was not saved then, but the full tensor was saved, the reduction will be computed on the fly and returned. If both the chosen reduction and full tensor are not available, this method raises TensorUnavailableForStep exception. + + +#### reduction_values +Get all reduction values saved for the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. Please go through the description of the method `reduction_value` for more details. + +```python +trial.tensor(name).reduction_values(step_num, mode=modes.GLOBAL, worker=None) +``` + +###### Arguments +- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. +- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` +- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` + +###### Returns +`dict[(str, bool) -> numpy.ndarray]` A dictionary with keys being tuples of the form `(reduction_name, abs)` to a 1x1 numpy ndarray value. `abs` here is a boolean that denotes whether the reduction was performed on the absolute value of the tensor or not. Note that this method only returns the reductions which were saved from the training job. It does not compute all known reductions and return them if only the raw tensor was saved. + +#### values +Get the values of the tensor for all steps of a given mode. + +```python +trial.tensor(name).values(mode=modes.GLOBAL, worker=None) +``` + +###### Arguments +- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` +- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` + +###### Returns +`dict[int -> numpy.ndarray]` A dictionary with step numbers as keys and numpy arrays representing the value of the tensor as values. + +#### workers +Get all the workers for which this tensor was saved at a given step + +```python +trial.tensor(name).workers(step_num, mode=modes.GLOBAL) +``` + +###### Arguments +- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. +- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` + +###### Returns +`list[str]` A list of worker names for which the tensor was saved at the given step. + +#### prev_steps +Get the last n step numbers of a given mode from a given step. + +```python +trial.tensor(name).prev_steps(step, n, mode=modes.GLOBAL) +``` +###### Arguments +- `step (int)` The step number whose value is to be returned for the mode passed. +- `n (int)` Number of previous steps to return +- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` + +###### Returns +`list[int]` A list of size at most n representing the previous steps for the given step and mode. Note that this list can be of size less than n if there were only less than n steps saved before the given step in this trial. + +## Rules +Rules are the medium by which SageMaker Debugger executes a certain piece of code regularly on different steps of the jobs. A rule is assigned to a trial and can be invoked at each new step of the trial. It can also access other trials for its execution. You can evaluate a rule using tensors from the current step or any step before the current step. Please ensure your logic respects these semantics, else you will get a `TensorUnavailableForStep` exception as the data would not yet be available for future steps. + +### Built In Rules +Please refer to the built-in rules that SageMaker provides here. (#todo add link) + +### Writing a custom rule +Writing a rule involves implementing the [Rule interface](../../smdebug/rules/rule.py). Below let us look at a simplified version of a VanishingGradient rule. + +##### Constructor +Creating a rule involves first inheriting from the base Rule class SMDebug provides. +For this rule here we do not need to look at any other trials, so we set `other_trials` to None. + +```python +from smdebug.rules import Rule + +class VanishingGradientRule(Rule): + def __init__(self, base_trial, threshold=0.0000001): + super().__init__(base_trial, other_trials=None) + self.threshold = float(threshold) +``` + +Please note that apart from `base_trial` and `other_trials` (if required), we require all +arguments of the rule constructor to take a string as value. You can parse them to the type +that you want from the string. This means if you want to pass a list of strings, you might want to pass them as a comma separated string. This restriction is +being enforced so as to let you create and invoke rules from json using Sagemaker's APIs. + +##### Function to invoke at a given step +In this function you can implement the core logic of what you want to do with these tensors. +It should return a boolean value `True` or `False`, where `True` means the rule evaluation condition has been met. When you invoke these rules through SageMaker, the rule job ends when the rule evaluation condition is met. SageMaker creates a Cloudwatch event for every rule job, which can be used to define actions that you might want to take based on the state of the rule. + +A simplified version of the actual invoke function for `VanishingGradientRule` is below: + +```python + def invoke_at_step(self, step): + for tensorname in self.base_trial.tensors(collection='gradients'): + tensor = self.base_trial.tensor(tensorname) + abs_mean = tensor.reduction_value(step, 'mean', abs=True) + if abs_mean < self.threshold: + return True + else: + return False +``` + +That's it, writing a rule is as simple as that. + +### Invoking a rule +Invoking the rule on SageMaker can be done in this way. (#todo add link here) + +#### invoke_rule +You might want to invoke the rule locally during development. We provide a function to invoke rules easily. Refer [smdebug/rules/rule_invoker.py](../../smdebug/rules/rule_invoker.py). The invoke function has the following syntax. It takes a instance of a Rule and invokes it for a series of steps one after the other. + +```python +from smdebug.rules import invoke_rule +from smdebug.trials import create_trial + +trial = create_trial('s3://smdebug-dev-test/mnist-job/') +rule_obj = VanishingGradientRule(trial, threshold=0.0001) +invoke_rule(rule_obj, start_step=0, end_step=None) +``` + +###### Arguments +- `rule_obj (Rule)` An instance of a subclass of `smdebug.rules.Rule` that you want to invoke. +- `start_step (int)` Global step number to start invoking the rule from. Note that this refers to a global step. This defaults to 0. +- `end_step (int or None)`: Global step number to end the invocation of rule before. To clarify, end_step is an exclusive bound. The rule is invoked at `end_step`. This defaults to `None` which means run till the end of the job. +- `raise_eval_cond (bool)` This parameter controls whether to raise the exception RuleEvaluationCondition when raised by the rule, or to catch it and log the Condition and move to the next step. Defaults to False, meaning it catches the exception and logs that the evaluation condition was met for a step and moves on to evaluate for the next step. + + +## Exceptions +SMDebug is designed to be aware that tensors required to execute a rule may not be available at every step. Hence it raises a few exceptions which allow us to control what happens when a tensor is missing. These are available in the `smdebug.exceptions` module. You can import them as follows: + +```python +from smdebug.exceptions import * +``` + +Here are the exceptions and their meanings: + +- `TensorUnavailableForStep` : This means that the tensor requested is not available for the step. It may be saved for a different step number. Note that when you see this exception, it means that this tensor can never become available for this step in the future. + +- `TensorUnavailable` : This means that this tensor has not been saved from the training job. Note that if you have a SaveConfig which saves a certain tensor only after the time you queried for the tensor, you might get a TensorUnavailable exception even if the tensor may become available later for some step. + +- `StepUnavailable`: This means that the step was not saved from the training job. No tensor will be available for this step. + +- `StepNotYetAvailable`: This means that the step has not yet been seen from the training job. It may be available in the future if the training is still going on. We automatically load new data as and when it becomes available. This step may either become available in the future, or the exception might change to `StepUnavailable`. + +- `NoMoreData` : This will be raised when the training ends. Once you see this, you will know that there will be no more steps and no more tensors saved. + +- `RuleEvaluationConditionMet`: This is raised when the rule invocation returns True for some step. + +## Utils + +### Enable or disable refresh of tensors in a trial + +By default SMDebug refreshes tensors each time you try to query the tensor. +It looks for whether this tensor is saved for new steps and if so fetches them. +If you know the saved data will not change (stopped the machine learning job), or +are not interested in the latest data, you can stop the refreshing of tensors as follows: + +`no_refresh` takes a trial or a list of trials, which should not be refreshed. +Anything executed inside the with `no_refresh` block will not be refreshed. + +```python +from smdebug.analysis.utils import no_refresh +with no_refresh(trials): + pass +``` + +Similarly if you want to refresh tensors only within a block, you can do: + +```python +from smdebug.analysis.utils import refresh +with refresh(trials): + pass +``` + +During rule invocation smdebug waits till the current step is available and then turns off refresh to ensure that you do not get different results for methods like `trial.tensor(name).steps()` and run into subtle issues. diff --git a/sagemaker-docs/DeveloperGuide_MXNet.md b/sagemaker-docs/DeveloperGuide_MXNet.md deleted file mode 100644 index 881675af5..000000000 --- a/sagemaker-docs/DeveloperGuide_MXNet.md +++ /dev/null @@ -1,262 +0,0 @@ -# Tornasole for MXNet -Tornasole is designed to be a debugger for machine learning models. It lets you go beyond just looking -at scalars like losses and accuracies during training and -gives you full visibility into all tensors 'flowing through the graph' -during training or inference. - - -## Quickstart -If you want to quickly run an end to end example, please refer to [mnist notebook example](examples/notebooks/mxnet.ipynb) to see tornasole working. - -Integrating Tornasole into the training job can be accomplished by following steps below. - -### Import the hook package -Import the SessionHook class along with other helper classes in your training script as shown below - -``` -from smdebug.mxnet.hook import SessionHook -from smdebug.mxnet import SaveConfig, Collection -``` - -### Instantiate and initialize hook - -``` - # Create SaveConfig that instructs engine to log graph tensors every 10 steps. - save_config = SaveConfig(save_interval=10) - # Create a hook that logs tensors of weights, biases and gradients while training the model. - output_s3_uri = 's3://my_mxnet_training_debug_bucket/12345678-abcd-1234-abcd-1234567890ab' - hook = SessionHook(out_dir=output_s3_uri, save_config=save_config) -``` - -Using the _Collection_ object and/or _include\_regex_ parameter of SessionHook , users can control which tensors will be stored by the SessionHook. -The section [How to save tensors](#how-to-save-tensors) explains various ways users can create _Collection_ object to store the required tensors. - -The _SaveConfig_ object controls when these tensors are stored. The tensors can be stored for specific steps or after certain interval of steps. If the save\_config parameter is not specified, the SessionHook will store tensors after every 100 steps. - -For additional details on SessionHook, SaveConfig and Collection please refer to the [API documentation](api.md) - -### Register Tornasole hook to the model before starting of the training. - -#### NOTE: The hook can only be registered to Gluon Non-hybrid models. - -After creating or loading the desired model, users can register the hook with the model as shown below. - -``` -net = create_gluon_model() - # Apply hook to the model (e.g. instruct engine to recognize hook configuration - # and enable mode in which engine will log graph tensors -hook.register_hook(net) -``` - -#### Set the mode -Set the mode you are running the job in. This helps you group steps by mode, -for easier analysis. -If you do not specify this, it saves steps under a `default` mode. -``` -hook.set_mode(smd.modes.TRAIN) -``` - -## API -Please refer to [this document](api.md) for description of all the functions and parameters that our APIs support - -#### Hook -SessionHook is the entry point for Tornasole into your program. -Some key parameters to consider when creating the SessionHook are the following: - -- `out_dir`: This represents the path to which the outputs of tornasole will be written to. Note that for Sagemaker, you always need to specify the out_dir as `/opt/ml/output/tensors`. In the future, we will make this the default in Sagemaker environments. -- `save_config`: This is an object of [SaveConfig](#saveconfig). The SaveConfig allows user to specify when the tensors are to be stored. User can choose to specify the number of steps or the intervals of steps when the tensors will be stored. If not specified, it defaults to a SaveConfig which saves every 100 steps. -- `include_collections`: This represents the [collections](#collection) to be saved. With this parameter, user can control which tensors are to be saved. -- `include_regex`: This represents the regex patterns of names of tensors to save. With this parameter, user can control which tensors are to be saved. - -**Examples** - -- Save weights and gradients every 100 steps to an S3 location - -``` -import smdebug.mxnet as smd -tm.SessionHook(out_dir='s3://tornasole-testing/trial_job_dir', - save_config=smd.SaveConfig(save_interval=100), - include_collections=['weights', 'gradients']) -``` - -- Save custom tensors by regex pattern to a local path - -``` -import smdebug.mxnet as smd -tm.SessionHook(out_dir='/home/ubuntu/tornasole-testing/trial_job_dir', - include_regex=['relu*']) -``` - -Refer [API](api.md) for all parameters available and detailed descriptions. - -### Mode -A machine learning job can be executing steps in multiple modes, such as training, evaluating, or predicting. -Tornasole provides you the construct of a `mode` to keep data from these modes separate -and make it easy for analysis. To leverage this functionality you have to -call the `set_mode` function of hook such as the following call `hook.set_mode(modes.TRAIN)`. -The different modes available are `modes.TRAIN`, `modes.EVAL` and `modes.PREDICT`. - - -If the mode was not set, all steps will be available together. - -You can choose to have different save configurations (SaveConfigMode) -for different modes. You can configure this by passing a -dictionary from mode to SaveConfigMode object. -The hook's `save_config` parameter accepts such a dictionary, as well as collection's `save_config` property. -``` -from smdebug.tensorflow import SessionHook, get_collection, modes, SaveConfigMode -scm = {modes.TRAIN: SaveConfigMode(save_interval=100), - modes.EVAL: SaveConfigMode(save_interval=10)} - -hook = SessionHook(..., - save_config=scm, - ...) -``` - -``` -from smdebug.tensorflow import get_collection, modes, SaveConfigMode -get_collection('weights').save_config = {modes.TRAIN: SaveConfigMode(save_interval=10), - modes.EVAL: SaveConfigMode(save_interval=1000)} -``` -#### Collection -Collection object helps group tensors for easier handling of tensors being saved. -A collection has its own list of tensors, include regex patterns, [reduction config](#reductionconfig) and [save config](#saveconfig). -This allows setting of different save and reduction configs for different tensors. -These collections are then also available during analysis. -Tornasole will save the value of tensors in collection, if the collection is included in `include_collections` param of the [hook](#hook). - -Refer [API](api.md) for all methods available when using collections such -as setting SaveConfig, -ReductionConfig for a specific collection, or retrieving all collections. - -Please refer to [creating a collection](#creating-a-collection) to get overview of how to -create collection and adding tensors to collection. - -#### SaveConfig -SaveConfig class allows you to customize the frequency of saving tensors. -The hook takes a SaveConfig object which is applied as -default to all tensors included. -A collection can also have its own SaveConfig object which is applied -to the tensors belonging to that collection. - -SaveConfig also allows you to save tensors when certain tensors become nan. -This list of tensors to watch for is taken as a list of strings representing names of tensors. - -The parameters taken by SaveConfig are: - -- `save_interval`: This allows you to save tensors every `n` steps; when `step_num % save_interval == 0`. -- `start_step`: The step at which to start saving (inclusive), defaults to 0. -- `end_step`: The step at which to stop saving (exclusive), default to None/Infinity. -- `save_steps`: Allows you to pass a list of step numbers at which tensors should be saved; overrides `save_interval`, `start_step`, and `end_step`. - -Refer [API](api.md) for all parameters available and detailed descriptions for them, as well as example SaveConfig objects. - -#### ReductionConfig -ReductionConfig allows the saving of certain reductions of tensors instead -of saving the full tensor. By reduction here we mean an operation that converts the tensor to a scalar. The motivation here is to reduce the amount of data -saved, and increase the speed in cases where you don't need the full tensor. -The reduction operations which are computed in the training process and then saved. -During analysis, these are available as reductions of the original tensor. -**Please note that using reduction config means that you will not have -the full tensor available during analysis, so this can restrict what you can do with the tensor saved.** -The hook takes a ReductionConfig object which is applied as default to all tensors included. -A collection can also have its own ReductionConfig object which is applied -to the tensors belonging to that collection. - -**Examples** - -- ```ReductionConfig(abs_reductions=['min','max','mean'])``` Save min, max, mean on absolute values of the tensors included - -- ```ReductionConfig(reductions=['min','max','mean'])``` Save min, max, mean of the tensors included - -- ```ReductionConfig(norms=['l1'])``` Saves l1 norm of the tensors included - - -These reduction config instances can be passed to the hook as follows - -``` - import smdebug.mxnet as smd - global_reduce_config = smd.ReductionConfig(reductions=["max", "mean"]) - hook = smd.SessionHook(out_dir=out_dir, save_config=global_save_config,reduction_config=global_reduce_config) -``` - -Or ReductionConfig can be specified for an individual collection as follows - -``` -import smdebug.mxnet as smd -tm.get_collection("ReluActivation").include(["relu*"]) -tm.get_collection("ReluActivation").save_config = SaveConfig(save_steps=[4,5,6]) -tm.get_collection("ReluActivation").reduction_config = ReductionConfig(reductions=["min"], abs_reductions=["max"]) -... -tm.get_collection("flatten").include(["flatten*"]) -tm.get_collection("flatten").save_config = SaveConfig(save_steps=[4,5,6]) -tm.get_collection("flatten").reduction_config = ReductionConfig(norms=["l1"], abs_norms=["l2"]) -hook = SessionHook(out_dir=out_dir, include_collections=['weights', 'biases','gradients', - 'default', 'ReluActivation', 'flatten']) -``` - -Refer [API](api.md) for a list of the reductions available as well as examples. - - -### How to save tensors - -There are different ways to save tensors when using smdebug. -Tornasole provides easy ways to save certain standard tensors by way of default collections (a Collection represents a group of tensors). -Examples of such collections are 'weights', 'gradients', 'biases' and 'default'. -Besides the tensors in above default collections, you can save tensors by name or regex patterns on those names. -Users can also specify a certain block in the model to save the inputs and outputs of that block. -This section will take you through these ways in more detail. - -#### Saving the tensors with _include\_regex_ -The SessionHook API supports _include\_regex_ parameter. The users can specify a regex pattern with this pattern. The SessionHook will store the tensors that match with the specified regex pattern. With this approach, users can store the tensors without explicitly creating a Collection object. The specified regex pattern will be associated with 'default' Collection and the SaveConfig object that is associated with the 'default' collection. - -#### Default Collections -Currently, the tornasole\_mxnet hook creates Collection objects for 'weights', 'gradients', 'biases' and 'default'. These collections contain the regex pattern that match with tensors of type weights, gradient and bias. The regex pattern for the 'default' collection is set when user specifies _include\_regex_ with SessionHook or sets the _SaveAll=True_. These collections use the SaveConfig parameter provided with the SessionHook initialization. The SessionHook will store the related tensors, if user does not specify any special collection with _include\_collections_ parameter. If user specifies a collection with _include\_collections_ the above default collections will not be in effect. - -#### Custom Collections -You can also create any other customized collection yourself. -You can create new collections as well as modify existing collections - -##### Creating a collection -Each collection should have a unique name (which is a string). You can create collections by invoking helper methods as described in the [API](api.md) documentation - -``` -import smdebug.mxnet as smd -tm.get_collection('weights').include(['weight']) -``` - -##### Adding tensors -Tensors can be added to a collection by either passing an include regex parameter to the collection. -If you don't know the name of the tensors you want to add, you can also add the tensors to the collection -by the variables representing the tensors in code. The following sections describe these two scenarios. - -###### Adding tensors by regex -If you know the name of the tensors you want to save and can write regex -patterns to match those tensornames, you can pass the regex patterns to the collection. -The tensors which match these patterns are included and added to the collection. - -``` -import smdebug.mxnet as smd -tm.get_collection('ReluActivation').include(["relu*", "input_*"]) -``` - -###### Adding tensors from Gluon block -If users want to log the inputs and outputs of a particular block in the Gluon model. They can do so by creating a collection as shown below. - -``` -import smdebug.mxnet as smd -tm.get_collection('Conv2DBlock').add_block_tensors(conv2d, inputs=True, outputs=True) -``` - -For creating this collection, users must have access to the block object whose inputs and outputs are to be logged. - -#### Saving All Tensors -Tornasole makes it easy to save all the tensors in the model. You just need to set the flag `save_all=True` when creating the hook. This creates a collection named 'all' and saves all the tensors under that collection. -**NOTE : Storing all the tensors will slow down the training and will increase the storage consumption.** - - -## ContactUs -We would like to hear from you. If you have any question or feedback, please reach out to us tornasole-users@amazon.com - -## License -This library is licensed under the Apache 2.0 License. diff --git a/sagemaker-docs/DeveloperGuide_PyTorch.md b/sagemaker-docs/DeveloperGuide_PyTorch.md deleted file mode 100644 index 0ce03de90..000000000 --- a/sagemaker-docs/DeveloperGuide_PyTorch.md +++ /dev/null @@ -1,367 +0,0 @@ -# Tornasole for Pytorch -Tornasole is an upcoming AWS service designed to be a debugger -for machine learning models. It lets you go beyond just looking -at scalars like losses and accuracies during training and -gives you full visibility into all tensors 'flowing through the graph' -during training or inference. - -Using Tornasole is a two step process: - -**Saving tensors** -This needs the `tornasole` package built for the appropriate framework. This package lets you collect the tensors you want at the frequency -that you want, and save them for analysis. Sagemaker containers provided to you already have this package installed. - -**Analysis** -Please refer to [this page](../../rules/DeveloperGuide_Rules.md) for more details about how to run rules and other analysis -on tensors collection from the job. The analysis of these tensors can be done on a separate machine -in parallel with the training job. - -## Quickstart -Integrating Tornasole into the training job can be accomplished by following steps below. - -### Import the hook package -Import the SessionHook class along with other helper classes in your training script as shown below - -``` -from smdebug.pytorch import SessionHook -from smdebug.pytorch import Collection -from smdebug import SaveConfig -import smdebug.pytorch as smd -``` - -### Instantiate and initialize hook -Then create the SessionHook by specifying what you want -to save, when you want to save them and -where you want to save them. Note that for Sagemaker, you always need to specify the out_dir as `/opt/ml/output/tensors`. In the future, we will make this the default in Sagemaker environments. - -``` - # Create SaveConfig that instructs engine to log graph tensors every 10 steps. - save_config = SaveConfig(save_interval=10) - # Create a hook that logs tensors of weights, biases and gradients while training the model. - hook = SessionHook(out_dir='/opt/ml/output/tensors', save_config=save_config) -``` - -For additional details on SessionHook, SaveConfig and Collection please refer to the [API documentation](api.md) - -### Register Tornasole hook to the model before starting of the training. - -Here is a sample PyTorch model you may use if you wish (this is enclosed in the -``` -class Net(nn.Module): - def __init__(self): - super(Net, self).__init__() - self.add_module('fc1', nn.Linear(20,500)) - self.add_module('relu1', nn.ReLU()) - self.add_module('fc2', nn.Linear(500, 10)) - self.add_module('relu2', nn.ReLU()) - self.add_module('fc3', nn.Linear(10, 4)) - def forward(self, x_in): - fc1_out = self.fc1(x_in) - relu1_out = self.relu1(fc1_out) - fc2_out = self.fc2(relu1_out) - relu2_out = self.relu2(fc2_out) - fc3_out = self.fc3(relu2_out) - out = F.log_softmax(fc3_out, dim=1) - return out - -def create_model(): - device = torch.device("cpu") - return Net().to(device) -``` -After creating or loading the desired model, users can register the hook with the model as shown below. - -``` -net = create_model() -# Apply hook to the model (e.g. instruct engine to recognize hook configuration -# and enable mode in which engine will log graph tensors -hook.register_hook(net) -``` - -#### Set the mode -Set the mode you are running the job in. This helps you group steps by mode, -for easier analysis. -If you do not specify this, it saves steps under a `default` mode. -``` -hook.set_mode(smd.modes.TRAIN) -``` - -## API -Please refer to [this document](api.md) for description of all the functions and parameters that our APIs support - -#### Hook -SessionHook is the entry point for Tornasole into your program. -Some key parameters to consider when creating the SessionHook are the following: - -- `outdir`: This represents the path to which the outputs of tornasole will be written to. Note that for Sagemaker, you always need to specify the out_dir as `/opt/ml/output/tensors`. In the future, we will make this the default in Sagemaker environments. -- `save_config`: This is an object of [SaveConfig](#saveconfig). The SaveConfig allows user to specify when the tensors are to be stored. User can choose to specify the number of steps or the intervals of steps when the tensors will be stored. -- `include_collections`: This represents the [collections](#collection) to be saved. Each collection can have its own SaveConfig item. - -Refer [API](api.md) for all parameters available and detailed descriptions. - -### Mode -A machine learning job can be executing steps in multiple modes, such as training, evaluating, or predicting. -Tornasole provides you the construct of a `mode` to keep data from these modes separate -and make it easy for analysis. To leverage this functionality you have to -call the `set_mode` function of hook such as the following call `hook.set_mode(modes.TRAIN)`. -The different modes available are `modes.TRAIN`, `modes.EVAL` and `modes.PREDICT`. - -If the mode was not set, all steps will be available together. - -You can choose to have different save configurations (SaveConfigMode) -for different modes. You can configure this by passing a -dictionary from mode to SaveConfigMode object. -The hook's `save_config` parameter accepts such a dictionary, as well as collection's `save_config` property. -``` -from smdebug.tensorflow import SessionHook, get_collection, modes, SaveConfigMode -scm = {modes.TRAIN: SaveConfigMode(save_interval=100), - modes.EVAL: SaveConfigMode(save_interval=10)} - -hook = SessionHook(..., - save_config=scm, - ...) -``` - -``` -from smdebug.tensorflow import get_collection, modes, SaveConfigMode -get_collection('weights').save_config = {modes.TRAIN: SaveConfigMode(save_interval=10), - modes.EVAL: SaveConfigMode(save_interval=1000)} -``` - -#### Collection -Collection object helps group tensors for easier handling of tensors being saved. -A collection has its own list of tensors, include regex patterns, [reduction config](#reductionconfig) and [save config](#saveconfig). -This allows setting of different save and reduction configs for different tensors. -These collections are then also available during analysis. -Tornasole will save the value of tensors in collection, if the collection is included in `include_collections` param of the [hook](#hook). - -Refer [API](api.md) for all methods available when using collections such as setting SaveConfig, -ReductionConfig for a specific collection, or retrieving all collections. - -Please refer to [creating a collection](#creating-a-collection) to get overview of how to create collection and adding tensors to collection. - -#### SaveConfig -SaveConfig class allows you to customize the frequency of saving tensors. -The hook takes a SaveConfig object which is applied as -default to all tensors included. -A collection can also have its own SaveConfig object which is applied -to the tensors belonging to that collection. - -SaveConfig also allows you to save tensors when certain tensors become nan. -This list of tensors to watch for is taken as a list of strings representing names of tensors. - -The parameters taken by SaveConfig are: - -- `save_interval`: This allows you to save tensors every `n` steps; when `step_num % save_interval == 0`. -- `start_step`: The step at which to start saving (inclusive), defaults to 0. -- `end_step`: The step at which to stop saving (exclusive), default to None/Infinity. -- `save_steps`: Allows you to pass a list of step numbers at which tensors should be saved; overrides `save_interval`, `start_step`, and `end_step`. - -Refer [API](api.md) for all parameters available and detailed descriptions for them, as well as example SaveConfig objects. - -#### ReductionConfig -ReductionConfig allows the saving of certain reductions of tensors instead -of saving the full tensor. By reduction here we mean an operation that converts the tensor to a scalar. The motivation here is to reduce the amount of data -saved, and increase the speed in cases where you don't need the full tensor. -The reduction operations which are computed in the training process and then saved. -During analysis, these are available as reductions of the original tensor. -**Please note that using reduction config means that you will not have -the full tensor available during analysis, so this can restrict what you can do with the tensor saved.** -The hook takes a ReductionConfig object which is applied as default to all tensors included. -A collection can also have its own ReductionConfig object which is applied -to the tensors belonging to that collection. - -**Examples** -- ```ReductionConfig(abs_reductions=['min','max','mean'])``` Save min, max, mean on absolute values of the tensors included - -- ```ReductionConfig(reductions=['min','max','mean'])``` Save min, max, mean of the tensors included - -- ```ReductionConfig(norms=['l1'])``` Saves l1 norm of the tensors included - -These reduction config instances can be passed to the hook as follows -``` -import smdebug.pytorch as smd -hook = smd.SessionHook(..., reduction_config=smd.ReductionConfig(norms=['l1']), ...) -``` -Refer [API](api.md) for a full list of the reductions available. - - -### How to save tensors - -There are different ways to save tensors when using smdebug. -Tornasole provides easy ways to save certain standard tensors by way of default collections (a Collection represents a group of tensors). -Examples of such collections are 'weights', 'gradients'. -Besides these tensors, you can save tensors by name or regex patterns on those names. -Users can also specify a certain module in the model to save the inputs and outputs of that module. -This section will take you through these ways in more detail. - -#### Default Collections -Currently, Tornasole creates Collection objects for 'weights' and 'gradients' by default for every run. -These collections store the tensors that are corresponding trainable parameters and their gradients. - -#### Custom Collections -You can also create any other customized collection yourself. -You can create new collections as well as modify existing collections - -##### Creating a collection -Each collection should have a unique name (which is a string). Users can create or retrieve the collection by name as follows. - -``` -weight_collection = smd.get_collection('weight') -``` - -##### Adding tensors -Tensors can be added to a collection by either passing an include regex parameter to the collection. -If you don't know the name of the tensors you want to add, you can also add the tensors to the collection -by the variables representing the tensors in code. The following sections describe these two scenarios. - -###### Adding tensors by regex -If you know the name of the tensors you want to save and can write regex -patterns to match those tensornames, you can pass the regex patterns to the collection. -The tensors which match these patterns are included and added to the collection. - -``` -custom_collect = smd.get_collection("ReluActivation") -custom_collect.include(["relu*", "input_*"]) -``` - -###### Adding tensors from torch.nn Module -If users want to log the inputs and outputs of a particular module, they can do so by creating a collection as shown below. For the example below, assume conv2d is the module we wish to log the inputs and outputs of - -``` -module_collection = smd.get_collection('Conv2DModule') -module_collection.add_module_tensors(conv2d, inputs=True, outputs=True) -``` - -For creating this collection, users must have access to the module object whose inputs and outputs are to be logged. - -#### Saving All Tensors -Tornasole makes it easy to save all the tensors in the model. You just need to set the flag `save_all=True` when creating the hook. -This creates a collection named 'all' and saves all the tensors under that collection. -**NOTE : Storing all the tensors will slow down the training and will increase the storage consumption.** - - -### More Examples - -#### Logging the weights, biases, and gradients of the model - -Here is how to create a hook for this purpose. - -``` - # Create Tornasole hook. The initializations of hook determines which tensors - # are logged while training is in progress. - # Following function shows the default initilization that enables logging of - # weights, biases and gradients in the model. - def create_hook(output_dir): - # Create a SaveConfig that determines tensors from which steps are to be stored. - # With the following SaveConfig, we will save tensors for steps 1, 2 and 3. - save_config = SaveConfig(save_steps=[1, 2, 3]) - # Create a hook that logs ONLY weights, biases, and gradients while training the model. - hook = SessionHook(out_dir=output_dir, save_config=save_config) - return hook -``` - -Here is how to register the hook - -``` -# Assume your model is called net -hook = create_hook(output_dir) -hook.register_hook(net) -``` - -#### Logging the inputs and output of a model along with weights and gradients -In order to achieve this we would need to create a collection as follows - -``` -# In order to log the inputs and output of a module, we will create a collection as follows: -get_collection('l_mod').add_module_tensors(module, inputs=True, outputs=True) -``` - -The name of the Collection is "l_mod". We have created it around the top level module of the model which represents the whole complete model itself to this collection. As a result this collection will contain tensors that were inputs and outputs of this module (e.g. the model itself) at corresponding training steps. -The following code shows how to initialize the hook with the above collection. - -``` -def create_hook(output_dir, module): - # The names of input and output tensors of a module are in following format - # Inputs : _input_, and - # Output : _output - # In order to log the inputs and output of a module, we will create a collection as follows: - assert module is not None - get_collection('l_mod').add_module_tensors(module, inputs=True, outputs=True) - - # Create a hook that logs weights, biases, gradients and inputs outputs of model while training. - hook = SessionHook(out_dir=output_dir, save_config=SaveConfig(save_steps=[i * 10 for i in range(5)]), - include_collections=['weights', 'gradients', 'biases','l_mod']) -``` - -Here is how to register the above hook. - -``` -# Assume your model is called net -hook = create_hook(output_dir=output_dir, module=net) -hook.register_hook(net) -``` - -#### Logging the inputs and output of a module in the model along with weights and gradients -Follow the same procedure as above; just pass -in the appropriate module into `create_hook`. - -#### Saving all tensors in the model -For saving all the tensors users not required to create a special collection. -Users can set the _save_all_ flag while creating a SessionHook object in the manner shown below. - -``` - # Create Tornasole hook. The initializations of hook determines which tensors - # are logged while training is in progress. - # Following function shows the default initilization that enables logging of - # weights, biases and gradients in the model. - def create_hook(output_dir): - # Create a SaveConfig that determines tensors from which steps are to be stored. - # With the following SaveConfig, we will save tensors for steps 1, 2 and 3. - save_config = SaveConfig(save_steps=[1, 2, 3]) - # Create a hook that logs weights, biases, gradients, module inputs, and module outputs of all layers while training the model. - hook = SessionHook(out_dir=output_dir, save_config=save_config, saveall=True) - return hook -``` - -Here is how to register the hook - -``` -# Assume your model is called net -hook = create_hook(output_dir) -hook.register_hook(net) -``` -All tensors will be saved as part of a collection named 'all'. - -## Analyzing the Results - -This library enables users to collect the desired tensors at desired frequency while the PyTorch job is running. -The tensor data generated during this job can be analyzed with various rules -that check for vanishing gradients, exploding gradients, etc. -For details regarding how to analyze the tensor data, usage of existing rules or -writing new rules, please refer to [Rules documentation](../../rules/DeveloperGuide_Rules.md). - - -## FAQ -#### Logging -You can control the logging from Tornasole by setting the appropriate -level for the python logger `tornasole` using either of the following approaches. - -**In Python code** -``` -import logging -logging.getLogger('tornasole').setLevel = logging.INFO -``` - -**Using environment variable** -You can also set the environment variable `SMDEBUG_LOG_LEVEL` as below - -``` -export SMDEBUG_LOG_LEVEL=INFO -``` -Log levels available are 'INFO', 'DEBUG', 'WARNING', 'ERROR', 'CRITICAL', 'OFF'. - -## ContactUs -We would like to hear from you. If you have any question or feedback, please reach out to us tornasole-users@amazon.com - -## License -This library is licensed under the Apache 2.0 License. diff --git a/sagemaker-docs/DeveloperGuide_Rules.md b/sagemaker-docs/DeveloperGuide_Rules.md deleted file mode 100644 index 32da1e42a..000000000 --- a/sagemaker-docs/DeveloperGuide_Rules.md +++ /dev/null @@ -1,502 +0,0 @@ -# Tornasole Analysis -Tornasole is an upcoming AWS service designed to be a debugger for machine learning models. -It lets you go beyond just looking at scalars like losses and accuracies during training and gives you -full visibility into all tensors 'flowing through the graph' during training or inference. - -Tornasole's analysis module helps you analyze tensors saved from machine learning jobs. -It allows you to run Rules on these tensors as well as anything else you might want to do with -access to raw tensors such as inspection or visualization. It provides access to the tensors in the form of numpy arrays. - - -## Installation -If you want to play around with data locally outside of the RuleExecution -containers that Sagemaker provides, you have to install the Tornasole binary for analysis. We recommend -that you spin up a Sagemaker notebook and installing the binary below as follows. - -#### Prerequisites -- **Python 3.6** - -#### Instructions -**Make sure that your aws account is whitelisted for smdebug. [ContactUs](#contactus)**. - -Once your account is whitelisted, you should be able to install the `tornasole` package -built for analysis as follows. Note that this is not the same as the -package installed in the sagemaker containers as those have support to - -``` -s3://tornasole-external-preview-use1/rules/binary tornasole_rules_binary/ -pip install tornasole_rules_binary/* -``` - -**Please note** : If, while installing tornasole, you get a version conflict issue -between botocore and boto3, you might need to run the following -``` -pip uninstall -y botocore boto3 aioboto3 aiobotocore && pip install botocore==1.12.91 boto3==1.9.91 aiobotocore==0.10.2 aioboto3==6.4.1 -``` - -## The Programming Model -The library is organized using the following constructs. - -### Trial -Trial the construct which lets you query for tensors for a given Tornasole run, specified by the path in which Tornasole artifacts are being saved or were saved. -You can pass a path which holds data for a past run (which has ended) as well as a path for a current run (to which tensors are being written). -Trial is capable of loading new tensors as and when they become available at the given location. - -There are two types of trials you can create: LocalTrial or S3Trial. -We provide a wrapper method to create the appropriate trial. - -The parameters you have to provide are: -- `name`: name can be any string. It is to help you manage different trials. -Make sure to give it a unique name to prevent confusion. -- `path`: path can be a local path or an S3 path of the form `s3://bucket/prefix`. This path should be where Tornasole hooks (TF or MXNet) save data to. -You should see the directory `events` and the file `collections.json` in this path. - -##### Creating local trial -``` -from smdebug.trials import create_trial -trial = create_trial(path='/home/ubuntu/tornasole_outputs/train', - name='resnet_training_run') -``` -##### Creating S3 trial -``` -from smdebug.trials import create_trial -trial = create_trial(path='s3://tornasole-testing-bucket/outputs/resnet', - name='resnet_training_run') -``` -###### Restricting analysis to a range of steps -To any of these methods you can optionally pass `range_steps` to restrict your analysis to a certain range of steps. -Note that if you do so, Trial will not load data from other steps. - -*Examples* -- `range_steps=(100, None)`: This will load all steps after 100 -- `range_steps=(None, 100)`: This will load all steps before 100 -- `range_steps=(100, 200)` : This will load steps between 100 and 200 -- `range_steps=None`: This will load all steps - -``` -lt = create_trial(path='ts_outputs/resnet', name='resnet_training', - range_steps=(100, 200)) -``` - - -### Mode -A machine learning job can be executing steps in multiple modes, such as training, evaluating, or predicting. -Tornasole provides you the construct of a `mode` to keep data from these modes separate -and make it easy for analysis. To leverage this functionality you have to -call the `set_mode` function of hook such as the following call `hook.set_mode(modes.TRAIN)`. -The different modes available are `modes.TRAIN`, `modes.EVAL` and `modes.PREDICT`. - -When you set a mode, steps in that mode have a sequence. We refer to these numbers -as `mode_step`. Each `mode_step` has a global step number associated with it, which represents the -sequence of steps across all modes executed by the job. - -For example, your job executes 10 steps, out of which the first 4 are training steps, 5th is evaluation step, 6-9 are training steps, and 10th is evaluation step. -Please note that indexing starts from 0. -In such a case, when you query for the global steps as below: -``` -trial.steps() -``` -you will see `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`. - -If you query for training steps as below: -``` -from tornasole_rules import modes -trial.steps(modes.TRAIN) -``` - you will see `[0, 1, 2, 3, 4, 5, 6, 7, 8]` because there were 8 training step. -The training step with mode_step 4 here refers to the global step number 5. -You can query this as follows: -``` -trial.global_step(mode=modes.TRAIN, mode_step=4) -``` - -If you did not explicitly set a mode during the running of the job, -the steps are global steps, and are in the `modes.GLOBAL` mode. -In such a case, `global_step` is the same as `mode_step` where mode is `modes.GLOBAL`. - -Below, we describe the above functions and others that the Trial API provides. - -#### Trial API -Once you have a trial object you can do the following - -**See names of all tensors available** - -``` -trial.tensors() -``` -This returns tensors seen for any mode if mode was set during the machine learning job. - -**See all steps seen by the Trial for a particular mode** - -The returned list is the step number within that mode. -Each of these mode steps has a global step number associated with it. -The global step represents the sequence of steps across all modes executed by the job. - -``` -from smdebug import modes -trial.steps(mode=modes.TRAIN) -``` - -**See all global steps seen by the Trial** - -This is the list of steps across all modes. - -``` -trial.steps() -``` - -**Get the mode and step number within mode for a given global step** - -You can get the `mode` of `global_step` 100 as follows: - -``` -mode = trial.mode(global_step=100) -``` - -You can get the `mode_step` for `global_step` 100 as follows: - -``` -mode_step = trial.mode_step(global_step=100) -``` - -**Know the global step number for a given mode step** - -``` -from smdebug import modes -global_step_num = trial.global_step(modes.TRAIN, mode_step=10) -``` - -**See all modes for which the trial has data** - -``` -trial.modes() -``` - -**Access a particular tensor** - -A tensor is identified by a string which represents its name. - -``` -trial.tensor('relu_activation:0') -``` - -**See the global steps for which tensor's value was saved** - -``` -trial.tensor('relu_activation:0').steps() -``` - -**See the steps for a given mode when tensor's value was saved** - -This returns the mode steps for those steps when this tensor's value was saved for this mode. - -``` -from smdebug import modes -trial.tensor('relu_activation:0').steps(mode=modes.TRAIN) -``` - -**Get the value of the tensor at a global step** - -This returns the tensor value as a numpy array for the 10th global step. - -``` -trial.tensor('relu_activation:0').value(10) -``` - -Please note that this can raise exceptions if the step is not available. -Please see [this section](#when-a-tensor-is-not-available-during-rule-execution) for more details on the different exceptions that can be raised. - -**Get the value of the tensor at a step number for a given mode** - -This returns the tensor value as a numpy array for the 10th training step. - -``` -from smdebug import modes -trial.tensor('relu_activation:0').value(10, mode=modes.TRAIN) -``` - -Please note that this can raise exceptions if the step is not available. -Please see [this section](#when-a-tensor-is-not-available-during-rule-execution) for more details on the different exceptions that can be raised. - -**Get reduction value of a tensor at a step** - -Tornasole provides a few reductions out of the box that you can query with the following API. -This below returns the mean of the absolute values at step 10. - -``` -trial.tensor('relu:0').reduction_value(10, 'mean', abs=True) -``` - -The different reductions you can query for are the same as what are allowed in [ReductionConfig](https://github.com/awslabs/tornasole_tf/blob/master/docs/api.md) when saving tensors. -This API thus allows you to access the reduction you might have saved instead of the full tensor. -If you had saved the full tensor, it will calculate the requested reduction now and cache it. - -- `min`, `max`, `mean`, `prod`, `std`, `sum`, `variance` -- `l1`, `l2` norms - -Each of these can be retrieved for the absolute value of the tensor or the original tensor. -Above was an example to get the mean of the absolute value of the tensor. -`abs` can be set to `False` if you want to see the `mean` of the actual tensor. - -*Note that if you had only saved a particular reduction, you will not be able -to access the full tensor value or any other reduction during analysis. -This also applies to the `abs` flag, meaning that if you had saved the -`mean` of `abs` values of the tensor you can not query for the non absolute values mean. -If you do so, Tornasole will return `None`.* - -If you had saved the tensor without any reduction, then you can retrieve the actual tensor -as a numpy array and compute any function you might be interested in. - -Please note that this can raise exceptions if the step is not available. -Please see [this section](#when-a-tensor-is-not-available-during-rule-execution) for more details on the different exceptions that can be raised. - -**Get names of tensors matching regex** - -This method takes a regex pattern or a list of regex patterns. -Each regex pattern is a python style regex pattern string. - -``` -trail.tensors_matching_regex(['relu_activation*']) -``` - -**List tensors in a collection** - -This returns names of all tensors saved in a given collection. -`gradients` below is the name of the collection we are interested in. - -``` -trial.tensors_in_collection('gradients') -``` - -**List collections** - -Below returns all collections belonging to the trial as a dictionary. -This dictionary is indexed by the name of the collection, and the value is the collection object. - -``` -trial.collections() -``` - -**Refresh or do not refresh tensors** - -By default Tornasole refreshes tensors each time you try to query the tensor. -It looks for whether this tensor is saved for new steps and if so fetches them. -If you know the saved data will not change (stopped the machine learning job), or -are not interested in the latest data, you can stop the refreshing of tensors as follows: - -`no_refresh` takes a trial or a list of trials, which should not be refreshed. -Anything executed inside the with `no_refresh` block will not be refreshed. - -``` -from smdebug.analysis.utils import no_refresh -with no_refresh(trials): - pass -``` - -Similarly if you want to refresh tensors only within a block, you can do: - -``` -from smdebug.analysis.utils import refresh -with refresh(trials): - pass -``` - -#### When a tensor is not available -Tornasole is designed to be aware that tensors required to execute a rule may not be available at every step. -Hence it raises a few exceptions which allow us to control what happens when a tensor is missing. -These are available in the `smdebug.exceptions` module. You can import them as follows: - -``` -from smdebug.exceptions import * -``` - -Here are the exceptions and their meanings: - -- `TensorUnavailableForStep` : This means that the tensor requested is not available for the step. This might mean that -this step might not be saved at all by the hook, or that this step might have saved some tensors but the requested -tensor is not part of them. Note that when you see this exception, it means that this tensor can never become available -for this step in the future. - -- `TensorUnavailable` : This means that this tensor is not being saved or has not been saved by smdebug. This means -that this tensor will never be seen for any step in smdebug. - -- `StepUnavailable`: This means that the step was not saved and Tornasole has no data from the step. - -- `StepNotYetAvailable`: This means that the step has not yet been seen by smdebug. It may be available in the future if the training is still going on. -Tornasole automatically loads new data as and when it becomes available. - -- `NoMoreData` : This will be raised when the training ends. Once you see this, you will know that there will be no more steps -and no more tensors saved. - -### Rules -Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the jobs. -A rule is assigned to a trial and can be invoked at each new step of the trial. -It can also access other trials for its execution. -You can evaluate a rule using tensors from the current step or any step before the current step. -Please ensure your logic respects these semantics, else you will get a `TensorUnavailableForStep` -exception as the data would not yet be available. - -#### Writing a rule -Writing a rule involves implementing the [Rule interface](../smdebug/rules/rule.py). - - -##### Constructor -Creating a rule involves first inheriting from the base Rule class Tornasole provides. -For this rule here we do not need to look at any other trials, so we set `other_trials` to None. - -``` -from smdebug.rules import Rule - -class VanishingGradientRule(Rule): - def __init__(self, base_trial, threshold=0.0000001): - super().__init__(base_trial, other_trials=None) - self.threshold = threshold -``` - -Please note that apart from `base_trial` and `other_trials` (if required), we require all -arguments of the rule constructor to take a string as value. This means if you want to pass -a list of strings, you might want to pass them as a comma separated string. This restriction is -being enforced so as to let you create and invoke rules from json using Sagemaker's APIs. - - -##### Function to invoke at a given step -In this function you can implement the core logic of what you want to do with these tensors. - -It should return a boolean value `True` or `False`. -This can be used to define actions that you might want to take based on the output of the rule. - -A simplified version of the actual invoke function for `VanishingGradientRule` is below: -``` - def invoke_at_step(self, step): - for tensor in self.base_trial.tensors_in_collection('gradients'): - abs_mean = tensor.reduction_value(step, 'mean', abs=True) - if abs_mean < self.threshold: - return True - else: - return False -``` - -##### Optional: RequiredTensors - -This is an optional construct that allows Tornasole to bulk-fetch all tensors that you need to -execute the rule. This helps the rule invocation be more performant so it does not fetch tensor values from S3 one by one. To use this construct, you need to implement a method which lets Tornasole know what tensors you are interested in for invocation at a given step. -This is the `set_required_tensors` method. - -Before we look at how to define this method, let us look at the API for `RequiredTensors` class which -needs to be used by this method. An object of this class is provided as a member of the rule class, so you can access it as `self.req_tensors`. - -**[RequiredTensors](../../smdebug/rules/req_tensors.py) API** - -***Adding a required tensor*** -When invoking a rule at a given step, you might require the values of a tensor at certain steps. -This method allows you to specify these steps as required for the tensor. -``` -self.req_tensors.add(name=tname, - steps=[step_num], - trial=None, - should_match_regex=False) -``` - -The arguments are described below: - -- `name`: name of the tensor -- `steps`: list of integers representing global step numbers at which this rule requires the values of this tensor -- `trial`: the trial whose tensor values are required. If this argument is None, it is assumed to -take the value of `self.base_trial` in the rule class. None is the default value for this argument. -- `should_match_regex`: boolean which when True means that the given name is treated as a regex pattern. -In such a case, all tensor names in the trial which match that regex pattern are treated as required -for the invocation of the rule at the given step. - -***Fetching required tensors*** - -If required tensors were added inside `set_required_tensors`, during rule invocation it is -automatically used to fetch all tensors at once by calling `req_tensors.fetch()`. -It can raise the exceptions `TensorUnavailable` and `TensorUnavailableForStep` if the trial does not have that tensor, or if the tensor value is not available for the requested step. - - -If required tensors were added elsewhere, or later, you can call the `req_tensors.fetch()` method -yourself to fetch all tensors at once. - -***Querying required tensors*** - -You can then query the required tensors -*Get names of required tensors* - -This method returns the names of the required tensors for a given trial. -``` -self.req_tensors.get_names(trial=None) -``` -- `trial`: the trial whose required tensors are being queried. If this argument is None, it is assumed to -take the value of `self.base_trial` in the rule class. None is the default value for this argument. - -*Get steps for a given required tensor* - -This method returns the steps for which the tensor is required to execute the rule at this step. -``` -self.req_tensors.get_tensor_steps(name, trial=None) -``` -- `trial`: the trial whose required tensors are being queried. If this argument is None, it is assumed to -take the value of `self.base_trial` in the rule class. None is the default value for this argument. - - -*Get required tensors* - -This method returns the list of required tensors for a given trial as `Tensor` objects. -``` -self.req_tensors.get(trial=None) -``` -- `trial`: the trial whose required tensors are being queried. If this argument is None, it is assumed to -take the value of `self.base_trial` in the rule class. None is the default value for this argument. - - -###### Declare required tensors -Here, let us define the `set_required_tensors` method to declare the required tensors -to execute the rule at a given `step`. -If we require the gradients of the base_trial to execute the rule at a given step, -then it would look as follows: -``` - def set_required_tensors(self, step): - for tname in self.base_trial.tensors_in_collection('gradients'): - self.req_tensors.add(tname, steps=[step]) -``` - -This function will be used by the rule execution engine to fetch all the -required tensors before it executes the rule. -The rule invoker executes the `set_required_tensors` and `invoke_at_step` -methods within a single `no_refresh` block, hence you are guaranteed that the -tensor values or steps numbers will stay the same during multiple calls. - -#### Executing a rule -Now that you have written a rule, here's how you can execute it. We provide a function to invoke rules easily. -Refer [rule_invoker.py](rules_package/rule_invoker.py) -The invoke function has the following syntax. -It takes a instance of a Rule and invokes it for a series of steps one after the other. - -``` -invoke(rule_obj, start_step=0, end_step=None) -``` - -For first party Rules (see below) that we provide a rule_invoker module that you can use to run them as follows - -``` -python -m smdebug.rules.rule_invoker --trial-dir ~/ts_outputs/vanishing_gradients --rule-name VanishingGradient -``` - -You can pass any arguments that the rule takes as command line arguments, like below: - -``` -python -m smdebug.rules.rule_invoker --trial-dir s3://tornasole-runes/trial0 --rule-name UnchangedTensor --tensor_regex .* --num_steps 10 -``` - -When running a Sagemaker job, Sagemaker will execute the rule for you. Refer Sagemaker notebook example for more on how this is done. - -#### First party rules -We provide a few rules which we built. These are supposed to be general purpose rules that you can use easily. -We also hope these serve as examples for you to build your own rules. These are described in [FirstPartyRules](FirstPartyRules.md). - -## ContactUs -We would like to hear from you. If you have any question or feedback, -please reach out to us tornasole-users@amazon.com - -## License -This library is licensed under the Apache 2.0 License. diff --git a/sagemaker-docs/DeveloperGuide_TF.md b/sagemaker-docs/DeveloperGuide_TF.md deleted file mode 100644 index cf89e978b..000000000 --- a/sagemaker-docs/DeveloperGuide_TF.md +++ /dev/null @@ -1,424 +0,0 @@ -# Tornasole for TensorFlow -Tornasole is an upcoming AWS service designed to be a debugger -for machine learning models. It lets you go beyond just looking -at scalars like losses and accuracies during training and -gives you full visibility into all tensors 'flowing through the graph' -during training or inference. - -Using Tornasole is a two step process: - -**Saving tensors** -This needs the `tornasole` package built for the appropriate framework. This package lets you collect the tensors you want at the frequency -that you want, and save them for analysis. Sagemaker containers provided to you already have this package installed. - -**Analysis** -Please refer to [this page](../../rules/DeveloperGuide_Rules.md) for more details about how to run rules and other analysis -on tensors collection from the job. The analysis of these tensors can be done on a separate machine -in parallel with the training job. - - -## Quickstart -If you want to quickly run an end to end example in Sagemaker, -you can jump to the notebook [examples/notebooks/tensorflow.ipynb](examples/notebooks/tensorflow.ipynb). - -Integrating Tornasole into your job is as easy as adding the following lines of code: - -### Session based training -We need to add Tornasole Hook and use it to create a monitored session for the job. -First, we need to import `smdebug.tensorflow`. -``` -import smdebug.tensorflow as smd -``` -Then create the SessionHook by specifying what you want -to save, when you want to save them and -where you want to save them. Note that for Sagemaker, -you always need to specify the out_dir as `/opt/ml/output/tensors`. In the future, -we will make this the default in Sagemaker environments. -``` -hook = smd.SessionHook(out_dir='/opt/ml/output/tensors', - include_collections=['weights','gradients'], - save_config=smd.SaveConfig(save_interval=2)) -``` - -Set the mode you are running the job in. This helps you group steps by mode, -for easier analysis. -If you do not specify this, it saves steps under a `GLOBAL` mode. -``` -hook.set_mode(smd.modes.TRAIN) -``` - -Wrap your optimizer with wrap_optimizer so that -Tornasole can identify your gradients and automatically -provide these tensors as part of the `gradients` collection. -Use this new optimizer to minimize the loss. -``` -optimizer = hook.wrap_optimizer(optimizer) -``` - -Create a monitored session with the above hook, and use this for executing your TensorFlow job. -``` -sess = tf.train.MonitoredSession(hooks=[hook]) -``` - -### Estimator based training -We need to create SessionHook and provide it to the estimator's train, predict or evaluate methods. -First, we need to import `smdebug.tensorflow`. -``` -import smdebug.tensorflow as smd -``` -Then create the SessionHook by specifying what you want -to save, when you want to save them and -where you want to save them. Note that for Sagemaker, -you always need to specify the out_dir as `/opt/ml/output/tensors`. In the future, -we will make this the default in Sagemaker environments. -``` -hook = smd.SessionHook(out_dir='/opt/ml/output/tensors', - include_collections = ['weights','gradients'], - save_config = smd.SaveConfig(save_interval=2)) -``` -Set the mode you are running the job in. This helps you group steps by mode, for easier -analysis. -If you do not specify this, it saves steps under a `GLOBAL` mode. -``` -hook.set_mode(smd.modes.TRAIN) -``` -Wrap your optimizer with wrap_optimizer so that -Tornasole can identify your gradients and automatically -provide these tensors as part of the `gradients` collection. -Use this new optimizer to minimize the loss. -``` -opt = hook.wrap_optimizer(opt) -``` -Now pass this hook to the estimator object's train, predict or evaluate methods, whichever ones you want to monitor. -``` -classifier = tf.estimator.Estimator(...) - -classifier.train(input_fn, hooks=[hook]) -classifier.predict(input_fn, hooks=[hook]) -classifier.evaluate(input_fn, hooks=[hook]) -``` -Refer [TF Estimator](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator) for information on the train, predict, evaluate functions. - -#### Note -**Keras** support is Work in Progress. Please stay tuned! -We will also support **Eager** mode in the future. - - -## Tornasole TensorFlow Concepts -In this section we briefly introduce the main constructs of the Tornasole TF API and some parameters important for their construction. -Please refer to [this document](api.md) for description of all the functions and parameters that our APIs support. - -#### Hook -SessionHook is the entry point for Tornasole into your program. -It's a subclass of `tf.train.SessionRunHook` and can be used where that is suitable, -such as MonitoredSession and Estimator's train/predict/evaluate methods. -Some key parameters to consider when creating the SessionHook are the following: -- `out_dir`: This represents the path to which the outputs of tornasole will be written to. Note that for Sagemaker, you always need to specify the out_dir as `/opt/ml/output/tensors`. In the future, we will make this the default in Sagemaker environments. -- `save_config`: The hook takes a SaveConfig object which controls when tensors are saved. -It defaults to a SaveConfig which saves every 100 steps. -- `include_regex`: This represents the regex patterns of names of tensors to save -- `include_collections`: This represents the collections to be saved - - -It also has an important method which can be used to set the appropriate mode. -Modes can refer to 'training', 'evaluation' or 'prediction'. They can be set as follows: -```hook.set_mode(smd.modes.TRAIN)```, ```hook.set_mode(smd.modes.EVAL)``` or ```hook.set_mode(smd.modes.PREDICT)```. -This allows you to group steps by mode which allows for clearer analysis. Tornasole -also allows you to see a global ordering of steps which makes it clear after how many training -steps did a particular evaluation step happen. If you do not set this mode, all steps are saved under -a `default` mode. - -**Examples** -- Save weights and gradients every 100 steps to an S3 location -``` -import smdebug.tensorflow as smd -smd.SessionHook(out_dir='/opt/ml/output/tensors', - save_config=smd.SaveConfig(save_interval=100), - include_collections=['weights', 'gradients']) -``` - -- Save custom tensors by regex pattern to a local path -``` -import smdebug.tensorflow as smd -smd.SessionHook(out_dir='/opt/ml/output/tensors', - include_regex=['loss*']) -``` -Refer [API](api.md) for all parameters available and their detailed descriptions. - -#### Mode -A machine learning job can be executing steps in multiple modes, such as training, evaluating, or predicting. -Tornasole provides you the construct of a `mode` to keep data from these modes separate -and make it easy for analysis. To leverage this functionality you have to -call the `set_mode` function of hook such as the following call `hook.set_mode(modes.TRAIN)`. -The different modes available are `modes.TRAIN`, `modes.EVAL` and `modes.PREDICT`. - -If the mode was not set, all steps will be available together. - -You can choose to have different save configurations (SaveConfigMode) -for different modes. You can configure this by passing a -dictionary from mode to SaveConfigMode object. -The hook's `save_config` parameter accepts such a dictionary, as well as collection's `save_config` property. -``` -from smdebug.tensorflow import SessionHook, modes, SaveConfigMode -scm = {modes.TRAIN: SaveConfigMode(save_interval=100), - modes.EVAL: SaveConfigMode(save_interval=10)} - -hook = SessionHook(..., - save_config=scm, - ...) -``` - -``` -from smdebug.tensorflow import modes, SaveConfigMode -get_collection('weights').save_config = {modes.TRAIN: SaveConfigMode(save_interval=10), modes.EVAL: SaveConfigMode(save_interval=1000)} -``` - -#### Collection -Collection object helps group tensors for easier handling of tensors being saved. -A collection has its own list of tensors, include/exclude regex patterns, reduction config and save config. -This allows setting of different save and reduction configs for different tensors. -These collections are then also available during analysis with `tornasole_rules`. -- Creating or accessing a collection: The following method allows you to access a collection. -It also creates the collection if it does not exist. Here `biases` is the name of the collection. -``` -import smdebug.tensorflow as smd -smd.get_collection('biases') -``` -- Adding to a collection -``` -import smdebug.tensorflow as smd -smd.add_to_collection('inputs', features) -``` - -- Passing regex pattern to collection -``` -import smdebug.tensorflow as smd -smd.get_collection(collection_name).include(['loss*']) -``` -Refer [API](api.md) for all methods available when using collections such as setting SaveConfig, -ReductionConfig for a specific collection, retrieving all collections, or resetting all collections. - -#### SaveConfig -SaveConfig class allows you to customize the frequency of saving tensors. -The hook takes a SaveConfig object which is applied as -default to all tensors included. -A collection can also have its own SaveConfig object which is applied -to the tensors belonging to that collection. - -SaveConfig also allows you to save tensors when certain tensors become nan. -This list of tensors to watch for is taken as a list of strings representing names of tensors. - -The parameters taken by SaveConfig are: - -- `save_interval`: This allows you to save tensors every `n` steps; when `step_num % save_interval == 0`. -- `start_step`: The step at which to start saving (inclusive), defaults to 0. -- `end_step`: The step at which to stop saving (exclusive), default to None/Infinity. -- `save_steps`: Allows you to pass a list of step numbers at which tensors should be saved; overrides `save_interval`, `start_step`, and `end_step`. - - -**Examples** -- ```SaveConfig(save_interval=10)``` Saving every 10 steps - -- ```SaveConfig(start_step=1000, save_interval=10)``` Save every 10 steps from the 1000th step - -- ```SaveConfig(save_steps=[10, 500, 10000, 20000])``` Saves only at the supplied steps - -These save config instances can be passed to the hook as follows -``` -import smdebug.tensorflow as smd -hook = smd.SessionHook(..., save_config=smd.SaveConfig(save_interval=10), ...) -``` -Refer [API](api.md) for all parameters available and detailed descriptions for them. - -#### ReductionConfig -ReductionConfig allows the saving of certain reductions of tensors instead -of saving the full tensor. By reduction here we mean an operation that converts the tensor to a scalar. -The motivation here is to reduce the amount of data -saved, and increase the speed in cases where you don't need the full tensor. -The reduction operations which are computed in the training process and then saved. -During analysis, these are available as reductions of the original tensor. -**Please note that using reduction config means that you will not have -the full tensor available during analysis, so this can restrict what you can do with the tensor saved.** -The hook takes a ReductionConfig object which is applied as default to all tensors included. -A collection can also have its own ReductionConfig object which is applied -to the tensors belonging to that collection. - -**Examples** -- ```ReductionConfig(abs_reductions=['min','max','mean'])``` Save min, max, mean on absolute values of the tensors included - -- ```ReductionConfig(reductions=['min','max','mean'])``` Save min, max, mean of the tensors included - -- ```ReductionConfig(norms=['l1'])``` Saves l1 norm of the tensors included - -These reduction config instances can be passed to the hook as follows -``` -import smdebug.tensorflow as smd -hook = smd.SessionHook(..., reduction_config=smd.ReductionConfig(norms=['l1']), ...) -``` -Refer [API](api.md) for a full list of the reductions available. - -## How to save tensors -There are different ways to save tensors when using smdebug. -Tornasole provides easy ways to save certain standard tensors by way of default collections (a Collection represents a group of tensors). -Examples of such collections are `weights`, `gradients`, `optimizer variables`. -Besides these tensors, you can save tensors by name or regex patterns on those names. -You can also save them by letting Tornasole know which variables in your code are to be saved. -This section will take you through these ways in more detail. - -### Default collections -Collection object helps group tensors for easier handling of tensors being saved. -These collections are then also available during analysis. - -Tornasole creates a few default collections and populates -them with the relevant tensors. - -#### Weights -Weights is a default collection managed by smdebug. -Saving weights is as easy as passing `weights` in the `include_collections` parameter of the hook. -``` -import smdebug.tensorflow as smd -hook = smd.SessionHook(..., include_collections = ['weights'], ...) -``` - -#### Gradients -We provide an easy way to populate the collection named `gradients` with the gradients wrt to the weights. -This can be done by wrapping around your optimizer with `wrap_optimizer` as follows. -This will also enable us to access the gradients during analysis without having to identify which tensors out of the saved ones are the gradients. - -``` -import smdebug.tensorflow as smd -... -opt = hook.wrap_optimizer(opt) -``` - -You can refer to [customize collections](#customizing-collections) for -information on how you can create the gradients collection manually. - -Then, you need to pass `gradients` in the `include_collections` parameter of the hook. -``` -import smdebug.tensorflow as smd -hook = smd.SessionHook(..., include_collections = ['gradients'], ...) -``` - -#### Losses -If you are using the default loss functions in Tensorflow, Tornasole can automatically pick up these losses from Tensorflow's losses collection. -In such a case, we only need to specify 'losses' in the `include_collections` argument of the hook. -If you do not pass this argument to the hook, it will save losses by default. -If you are using your custom loss function, you can either add this to Tensorflow's losses collection or Tornasole's losses collection as follows: -``` -import smdebug.tensorflow as smd - -# if your loss function is not a default TF loss function, -# but is a custom loss function -# then add to the collection losses -loss = ... -smd.add_to_collection('losses', loss) - -# specify losses in include_collections -# Note that this is included by default -hook = smd.SessionHook(..., include_collections = ['losses'..], ...) -``` - -#### Optimizer Variables -Optimizer variables such as momentum can also be saved easily with the -above approach of wrapping your optimizer with `wrap_optimizer` -followed by passing `optimizer_variables` in the `include_collections` parameter of the hook. -``` -import smdebug.tensorflow as smd -hook = smd.SessionHook(..., include_collections = ['optimizer_variables'], ...) -``` - -Please refer [API](api.md) for more details on using collections - -### Customizing collections -You can also create any other customized collection yourself. -You can create new collections as well as modify existing collections -(such as including gradients if you do not want to use the above `wrap_optimizer`) -#### Creating or accessing a collection -Each collection should have a unique name (which is a string). -You can get the collection named as `collection_name` by -calling the following function. -It creates the collection if it does not already exist. -``` -smd.get_collection('collection_name') -``` -#### Adding tensors -Tensors can be added to a collection by either passing an include regex parameter to the collection. -If you don't know the name of the tensors you want to add, you can also add the tensors to the collection -by the variables representing the tensors in code. The following sections describe these two scenarios. - -##### Adding tensors by regex -If you know the name of the tensors you want to save and can write regex -patterns to match those tensornames, you can pass the regex patterns to the collection. -The tensors which match these patterns are included and added to the collection. -``` -smd.get_collection('default').include(['foobar/weight*']) -``` - -**Quick note about names**: TensorFlow layers or operations take a name parameter which along with the name scope -of the layer or variable defines the full name of the operation. -For example, refer [`examples/simple/simple.py`](examples/scripts/simple.py#L20), -the weight there is named `foobar/weight1:0`. Here `foobar/weight1` refers to the -node representing operation in the graph, and the suffix `:0` indicates that this is the 0th output of the node. -To make clear the meaning of a given tensor, it helps to organize your code by name scopes and -set the names of different operations you might be interested in. - -##### Adding tensors from variables in the code -If you do not know the names of the tensors you are interested in, you can also just pass the variables to smdebug. -Collection has an add method which takes either a TensorFlow Operation, Variable, or Tensor. - -For example, say you want to log the activations of relu layers in your model. You can save them as follows to a -collection named 'relu_activations'. All the tensors represented by this variable (there could be multiple if this line is a loop for instance) -are saved to this collection. -``` -x = tf.nn.relu(x) - -smd.add_to_collection('relu_activations', x) -``` - -### Regex pattern -A quick way to save tensors when you know the name of the tensors you want to save and -can write a regex pattern to match those tensornames, is to just pass the regex patterns to the hook. -You can use this approach if you just want to save a small number of tensors and do not care about collections. -The tensors which match these patterns are included and added to the collection named `default`. - -``` -hook = smd.SessionHook(..., - include_regex=['foobar/weight*'], - ...) -``` - -**Note** Above does the same as in the Regex section above in Customizing collections. - -### Saving all tensors -Tornasole makes it easy to save all the tensors in the model. You just need to set the flag `save_all=True` when creating the hook. -**Please note that this can severely reduce performance of the job and will generate lot of data** - -## Analyzing the Results -For full details on how to analyze the tensors saved, go to [DeveloperGuide_Rules](../../rules/DeveloperGuide_Rules.md) - -## FAQ -#### Logging -You can control the logging from Tornasole by setting the appropriate -level for the python logger `tornasole` using either of the following approaches. - -**In Python code** -``` -import logging -logging.getLogger('tornasole').setLevel = logging.INFO -``` - -**Using environment variable** -You can also set the environment variable `SMDEBUG_LOG_LEVEL` as below - -``` -export SMDEBUG_LOG_LEVEL=INFO -``` -Log levels available are 'INFO', 'DEBUG', 'WARNING', 'ERROR', 'CRITICAL', 'OFF'. - - -## ContactUs -We would like to hear from you. If you have any question or feedback, please reach out to us tornasole-users@amazon.com - -## License -This library is licensed under the Apache 2.0 License. diff --git a/sagemaker-docs/DeveloperGuide_XGBoost.md b/sagemaker-docs/DeveloperGuide_XGBoost.md deleted file mode 100644 index 8d92416fb..000000000 --- a/sagemaker-docs/DeveloperGuide_XGBoost.md +++ /dev/null @@ -1,192 +0,0 @@ -# Tornasole for XGBoost - -Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training. - -## Quickstart - -If you want to quickly run an end-to-end example, please refer to [XGBoost notebook example](examples/notebooks/xgboost.ipynb) to see tornasole working. - -Integrating Tornasole into the training job can be accomplished by following steps below. - -### Import the Tornasole package - -Import the SessionHook class along with other helper classes in your training script as shown below - -``` -from smdebug.xgboost import SessionHook -from smdebug import SaveConfig -``` - -### Instantiate and initialize hook - -``` - # Create SaveConfig that instructs engine to log graph tensors every 10 steps. - save_config = SaveConfig(save_interval=10) - # Create a hook that logs evaluation metrics and feature importances while training the model. - hook = SessionHook(save_config=save_config) -``` - -Using the *Collection* object and/or *include\_regex* parameter of SessionHook , users can control which tensors will be stored by the SessionHook. -The section [How to save tensors](#how-to-save-tensors) explains various ways users can create *Collection* object to store the required tensors. - -The *SaveConfig* object controls when these tensors are stored. The tensors can be stored for specific steps or after certain interval of steps. If the *save\_config* parameter is not specified, the SessionHook will store tensors after every 100 steps. - -For additional details on SessionHook, SaveConfig and Collection please refer to the [API documentation](api.md) - -### Register Tornasole hook to the model before starting of the training. - -Users can use the hook as a callback function when training a booster. - -``` -xgboost.train(params, dtrain, callbacks=[hook]) -``` - - Examples - -## API - -Please refer to [this document](api.md) for description of all the functions and parameters that our APIs support. - -#### Hook - -SessionHook is the entry point for Tornasole into your program. -Some key parameters to consider when creating the SessionHook are the following: - -- `out_dir`: This represents the path to which the outputs of tornasole will be written to under a directory with the name `out_dir`. Note that in a SageMaker environment the out_dir will be ignored and always default to `/opt/ml/output/tensors`. -- `save_config`: This is an object of [SaveConfig](#saveconfig). The SaveConfig allows user to specify when the tensors are to be stored. User can choose to specify the number of steps or the intervals of steps when the tensors will be stored. If not specified, it defaults to a SaveConfig which saves every 100 steps. -- `include_collections`: This represents the [collections](#collection) to be saved. With this parameter, user can control which tensors are to be saved. -- `include_regex`: This represents the regex patterns of names of tensors to save. With this parameter, user can control which tensors are to be saved. - -**Examples** - -- Save evaluation metrics and feature importances every 10 steps to an S3 location: - -``` -import smdebug.xgboost as tx -tx.SessionHook(save_config=SaveConfig(save_interval=10), - include_collections=['metrics', 'feature_importance']) -``` - -- Save custom tensors by regex pattern to a local path - -``` -import smdebug.xgboost as tx -tx.SessionHook(include_regex=['validation*']) -``` - -Refer [API](api.md) for all parameters available and detailed descriptions. - -#### Collection - -Collection object helps group tensors for easier handling of tensors being saved. -A collection has its own list of tensors, include regex patterns, and [save config](#saveconfig). -This allows setting of different save configs for different tensors. -These collections are then also available during analysis. -Tornasole will save the value of tensors in collection, if the collection is included in `include_collections` param of the [hook](#hook). - -Refer to [API](api.md) for all methods available when using collections such -as setting SaveConfig for a specific collection or retrieving all collections. - -Please refer to [creating a collection](#creating-a-collection) to get overview of how to -create collection and adding tensors to collection. - -#### SaveConfig - -SaveConfig class allows you to customize the frequency of saving tensors. -The hook takes a SaveConfig object which is applied as -default to all tensors included. -A collection can also have its own SaveConfig object which is applied -to the tensors belonging to that collection. - -SaveConfig also allows you to save tensors when certain tensors become nan. -This list of tensors to watch for is taken as a list of strings representing names of tensors. - -The parameters taken by SaveConfig are: - -- `save_interval`: This allows you to save tensors every `n` steps; when `step_num % save_interval == 0`. -- `start_step`: The step at which to start saving (inclusive), defaults to 0. -- `end_step`: The step at which to stop saving (exclusive), default to None/Infinity. -- `save_steps`: Allows you to pass a list of step numbers at which tensors should be saved; overrides `save_interval`, `start_step`, and `end_step`. - -Refer to [API](api.md) for all parameters available and detailed descriptions for them, as well as example SaveConfig objects. - -#### ReductionConfig - -ReductionConfig is not currently used in XGBoost smdebug. -When Tornasole is used with deep learning frameworks, such as MXNet, -Tensorflow, or PyTorch, ReductionConfig allows the saving of certain -reductions of tensors instead of saving the full tensor. -By reduction here we mean an operation that converts the tensor to a scalar. -However, in XGBoost, we currently support evaluation metrics, feature -importances, and average SHAP values, which are all scalars and not tensors. -Therefore, if the `reduction_config` parameter is set in -`smdebug.xgboost.SessionHook`, it will be ignored and not used at all. - -### How to save tensors - -There are different ways to save tensors when using smdebug. -Tornasole provides easy ways to save certain standard tensors by way of default -collections (a Collection represents a group of tensors). -Examples of such collections are 'metrics', 'feature\_importance', -'average\_shap', and 'default'. -Besides the tensors in above default collections, you can save tensors by name or regex patterns on those names. -This section will take you through these ways in more detail. - -#### Saving the tensors with *include\_regex* -The SessionHook API supports *include\_regex* parameter. The users can specify a regex pattern with this pattern. The SessionHook will store the tensors that match with the specified regex pattern. With this approach, users can store the tensors without explicitly creating a Collection object. The specified regex pattern will be associated with 'default' Collection and the SaveConfig object that is associated with the 'default' collection. - -#### Default Collections -Currently, the XGBoost SessionHook creates Collection objects for -'metrics', 'feature\_importance', 'average\_shap', and 'default'. These -collections contain the regex pattern that match with -evaluation metrics, feature importances, and SHAP values. The regex pattern for -the 'default' collection is set when user specifies *include\_regex* with -SessionHook or sets the *save_all=True*. These collections use the SaveConfig -parameter provided with the SessionHook initialization. The SessionHook -will store the related tensors, if user does not specify any special collection -with *include\_collections* parameter. If user specifies a collection with -*include\_collections* the above default collections will not be in effect. -Please refer to [this document](api.md) for description of all the default= -collections. - -#### Custom Collections - -You can also create any other customized collection yourself. -You can create new collections as well as modify existing collections - -##### Creating a collection - -Each collection should have a unique name (which is a string). You can create -collections by invoking helper methods as described in the [API](api.md) documentation - -``` -from smdebug.xgboost as get_collection -get_collection('metrics').include(['validation-auc']) -``` - -##### Adding tensors - -Tensors can be added to a collection by either passing an include regex parameter to the collection. -If you don't know the name of the tensors you want to add, you can also add the tensors to the collection -by the variables representing the tensors in code. The following sections describe these two scenarios. - -##### Adding tensors by regex -If you know the name of the tensors you want to save and can write regex -patterns to match those tensornames, you can pass the regex patterns to the collection. -The tensors which match these patterns are included and added to the collection. - -``` -from smdebug.xgboost import get_collection -get_collection('metrics').include(["train*", "*-auc"]) -``` - -#### Saving All Tensors -Tornasole makes it easy to save all the tensors in the model. You just need to set the flag `save_all=True` when creating the hook. This creates a collection named 'all' and saves all the tensors under that collection. -**NOTE : Storing all the tensors will slow down the training and will increase the storage consumption.** - - -## ContactUs -We would like to hear from you. If you have any question or feedback, please reach out to us tornasole-users@amazon.com - -## License -This library is licensed under the Apache 2.0 License.