Skip to content

Conversation

@NRauschmayr
Copy link
Contributor

Description of changes:

Extended smdebug's reductions to check for nan- and inf-values and to compute quantiles for PT tensors. Tensors are now also written out in Tensorboard format such that users can display all reductions for a specific tensor within the same visualization and visualizations will be grouped by Debugger collections.
Here is an example visualization:

Screen Shot 2021-11-14 at 2 52 42 PM

Style and formatting:

I have run pre-commit install && pre-commit run --all-files to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2021

Codecov Report

Merging #523 (66999aa) into master (b4dd4c1) will decrease coverage by 6.36%.
The diff coverage is 57.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #523      +/-   ##
==========================================
- Coverage   77.60%   71.24%   -6.37%     
==========================================
  Files         127      117      -10     
  Lines       11111    10614     -497     
==========================================
- Hits         8623     7562    -1061     
- Misses       2488     3052     +564     
Impacted Files Coverage Δ
smdebug/pytorch/utils.py 48.14% <23.52%> (-32.81%) ⬇️
smdebug/core/locations.py 85.71% <60.00%> (-5.96%) ⬇️
smdebug/core/reduction_config.py 86.58% <77.77%> (-9.52%) ⬇️
smdebug/core/hook.py 86.90% <81.25%> (-0.53%) ⬇️
smdebug/mxnet/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/mxnet/singleton_utils.py 0.00% <0.00%> (-100.00%) ⬇️
...debug/profiler/analysis/notebook_utils/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/mxnet/hook.py 0.00% <0.00%> (-84.85%) ⬇️
smdebug/mxnet/utils.py 0.00% <0.00%> (-78.13%) ⬇️
smdebug/rules/action/message_action.py 13.25% <0.00%> (-75.91%) ⬇️
... and 61 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4dd4c1...66999aa. Read the comment docs.

Comment on lines +523 to +524
if subfolder == None:
subfolder = self.mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use subfolder = os.path.join(self.mode)?
It makes the intentions of this line much clearer to the reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subfolder is just a string and not a filepath.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join returns an object of type str. If I understand this part correctly, the subfolder variable contains the path to sub directory for tensorboard data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no subfolder is just the name of the reduction. each reduction will be its own subfolder in the tensorboard directory.

if subfolder == None:
subfolder = self.mode

if subfolder in self.tb_writers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename the name of this map. Maybe something more explicit like self.tb_writer_to_dir_map ?


def _get_reduction_tensor_name(self, tensor_name, reduction_name, abs):
return get_reduction_tensor_name(tensor_name, reduction_name, abs, remove_colon_index=True)
def _get_reduction_tensor_name(self, tensor_name, reduction_name, abs, collection_name=""):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we accept an empty string collection name?
Should we use the DEFAULT collection key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended this part, which is needed for Tensorboard to group visualizations by Debugger collections. To keep it consistent with previous code, the collection name will be empty per default.

reduction_name = "abs_" + reduction_name
tb_writer = self._maybe_get_tb_writer(subfolder=reduction_name)
if tb_writer:
scalar = self._make_numpy_array(tensor_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of scalar if tb_writer = None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per default Debugger writes reductions like normal tensors into debug-output folder and users can retrieve the data via the smdebug API. I extended this part, so that reductions are also written in Tensorboard format (in case user provided a Tensorboard configuration)

if hasattr(torch.Tensor, reduction_name):
f = getattr(torch.Tensor, reduction_name)
op = f(tensor_data.float())
if reduction_name == "isnan" or reduction_name == "isinf":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we manage reduction_name values with Enums?

@NihalHarish NihalHarish self-requested a review November 23, 2021 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants