Skip to content

Commit b71fe99

Browse files
Merge latest changes from smdebug to smprofiler (#68)
1 parent ed56c1d commit b71fe99

38 files changed

+1070
-339
lines changed

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,21 +63,23 @@ The following frameworks are available AWS Deep Learning Containers with the dee
6363

6464
| Framework | Version |
6565
| --- | --- |
66-
| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1, 2.2 |
66+
| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1.0, 2.2.0, 2.3.0 |
6767
| [MXNet](docs/mxnet.md) | 1.6 |
68-
| [PyTorch](docs/pytorch.md) | 1.4, 1.5 |
68+
| [PyTorch](docs/pytorch.md) | 1.4, 1.5, 1.6 |
6969
| [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))|
7070

71+
**Note**: Debugger with zero script change is partially available for TensorFlow v2.1.0 and v2.3.0. The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently not available for these TensorFlow versions.
72+
7173
### AWS training containers with script mode
7274

7375
The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script.
7476

7577
| Framework | Versions |
7678
| --- | --- |
77-
| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1, 2.2 |
79+
| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0 |
7880
| Keras (with TensorFlow backend) | 2.3 |
7981
| [MXNet](docs/mxnet.md) | 1.4, 1.5, 1.6 |
80-
| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5 |
82+
| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5, 1.6 |
8183
| [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 (As a framework)|
8284

8385
### Debugger on custom containers or local machines

docs/analysis.md

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,10 @@ This page describes the programming model that SageMaker Debugger provides for y
3030
* [steps](#steps-1)
3131
* [value](#value)
3232
* [reduction_value](#reduction_value)
33-
* [reduction_values](#reduction_values)
33+
* [shape](#shape)
3434
* [values](#values)
35+
* [reduction_values](#reduction_values)
36+
* [shapes](#shapes)
3537
* [workers](#workers-1)
3638
* [prev_steps](#prev_steps)
3739
* [Rules](#Rules)
@@ -356,6 +358,34 @@ trial.tensor(name).reduction_value(step_num, reduction_name,
356358
###### Returns
357359
`numpy.ndarray` The reduction value of tensor at the given step and worker (if the training job saved data from multiple workers) as a 1x1 numpy array. If this reduction was saved for the tensor during training as part of specification through reduction config, it will be loaded and returned. If the given reduction was not saved then, but the full tensor was saved, the reduction will be computed on the fly and returned. If both the chosen reduction and full tensor are not available, this method raises `TensorUnavailableForStep` exception.
358360

361+
#### shape
362+
Get the shape of the chosen tensor at a particular step.
363+
364+
```python
365+
trial.tensor(name).shape(step_num, mode=modes.GLOBAL, worker=None)
366+
367+
```
368+
###### Arguments
369+
- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter.
370+
- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL`
371+
- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)`
372+
373+
###### Returns
374+
`tuple(int)` If only the shape of this tensor was saved through `save_shape` configuration in ReductionConfig, it will be returned. If the full tensor was saved, then shape will be computed and returned today. If both the shape and full tensor are not available, this method raises `TensorUnavailableForStep` exception.
375+
376+
#### values
377+
Get the values of the tensor for all steps of a given mode.
378+
379+
```python
380+
trial.tensor(name).values(mode=modes.GLOBAL, worker=None)
381+
```
382+
383+
###### Arguments
384+
- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL`
385+
- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)`
386+
387+
###### Returns
388+
`dict[int -> numpy.ndarray]` A dictionary with step numbers as keys and numpy arrays representing the value of the tensor as values.
359389

360390
#### reduction_values
361391
Get all reduction values saved for the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. Please go through the description of the method `reduction_value` for more details.
@@ -372,19 +402,19 @@ trial.tensor(name).reduction_values(step_num, mode=modes.GLOBAL, worker=None)
372402
###### Returns
373403
`dict[(str, bool) -> numpy.ndarray]` A dictionary with keys being tuples of the form `(reduction_name, abs)` to a 1x1 numpy ndarray value. `abs` here is a boolean that denotes whether the reduction was performed on the absolute value of the tensor or not. Note that this method only returns the reductions which were saved from the training job. It does not compute all known reductions and return them if only the raw tensor was saved.
374404

375-
#### values
376-
Get the values of the tensor for all steps of a given mode.
405+
#### shapes
406+
Get the shapes of the tensor for all steps of a given mode.
377407

378408
```python
379-
trial.tensor(name).values(mode=modes.GLOBAL, worker=None)
409+
trial.tensor(name).shapes(mode=modes.GLOBAL, worker=None)
380410
```
381411

382412
###### Arguments
383413
- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL`
384414
- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)`
385415

386416
###### Returns
387-
`dict[int -> numpy.ndarray]` A dictionary with step numbers as keys and numpy arrays representing the value of the tensor as values.
417+
`dict[int -> tuple(int)]` A dictionary with step numbers as keys and tuples of ints representing the shapes of the tensor as values.
388418

389419
#### workers
390420
Get all the workers for which this tensor was saved at a given step

docs/api.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ include_workers
9696
include_regex
9797
reductions
9898
save_raw_tensor
99+
save_shape
99100
save_interval
100101
save_steps
101102
start_step
@@ -163,6 +164,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`.
163164
|`create_from_json_file(`<br/>` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter. <br/> If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to.
164165
|`close()` | - | Closes all files that are currently open by the hook |
165166
| `save_scalar()` | `name (str)` <br/> `value (float)` <br/> `sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. |
167+
| `save_tensor()`| `tensor_name (str)`, `tensor_value (numpy.array or numpy.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.|
166168

167169

168170
### TensorFlow specific Hook API
@@ -178,7 +180,6 @@ The following hook APIs are specific to training scripts using the TF 2.x Gradie
178180
| Method | Arguments | Returns | Behavior |
179181
| --- | --- | --- | --- |
180182
| `wrap_tape(tape)` | `tape` (tensorflow.python.eager.backprop.GradientTape) | Returns a tape object with three identifying markers to help `smdebug`. This returned tape should be used for training. | When not using Zero Script Change environments, calling this method on your tape is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same tape object passed.
181-
| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors while using TF 2.x GradientTape. Note: `record_tensor_value()` is deprecated.|
182183

183184
### MXNet specific Hook API
184185

0 commit comments

Comments
 (0)