Skip to content

Conversation

@rahul003
Copy link
Contributor

Description of changes:

When SMP is used together with Horovod, there'll be multiple horovod 'groups'. Rank and size need to be queried from SMP in such cases.

This is not a problem for Pytorch as there is a single torch.distributed group there or MXNet as SMP doesn't support MXNet.

@rahul003 rahul003 requested review from NihalHarish and leleamol and removed request for NihalHarish December 14, 2020 22:49
@codecov-io
Copy link

codecov-io commented Dec 14, 2020

Codecov Report

Merging #411 (42d300b) into master (9d2d0c3) will decrease coverage by 1.57%.
The diff coverage is 62.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #411      +/-   ##
==========================================
- Coverage   77.70%   76.13%   -1.58%     
==========================================
  Files         113      113              
  Lines       10139    10165      +26     
==========================================
- Hits         7879     7739     -140     
- Misses       2260     2426     +166     
Impacted Files Coverage Δ
smdebug/tensorflow/base_hook.py 70.28% <18.18%> (-5.75%) ⬇️
smdebug/core/utils.py 79.72% <74.41%> (+1.98%) ⬆️
smdebug/tensorflow/callable_cache.py 52.17% <0.00%> (-26.09%) ⬇️
smdebug/tensorflow/utils.py 64.59% <0.00%> (-23.45%) ⬇️
smdebug/tensorflow/singleton_utils.py 83.33% <0.00%> (-16.67%) ⬇️
smdebug/profiler/tf_profiler_parser.py 54.54% <0.00%> (-11.58%) ⬇️
smdebug/tensorflow/collection.py 84.53% <0.00%> (-11.35%) ⬇️
smdebug/tensorflow/keras.py 79.21% <0.00%> (-11.00%) ⬇️
smdebug/rules/action/stop_training_action.py 56.45% <0.00%> (-9.68%) ⬇️
smdebug/core/logger.py 66.12% <0.00%> (-8.07%) ⬇️
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d2d0c3...42d300b. Read the comment docs.

@rahul003
Copy link
Contributor Author

ok, will do

@rahul003 rahul003 changed the title [TF] Use SMP rank and size when applicable Use SMP rank and size when applicable Jan 11, 2021
@rahul003 rahul003 merged commit 07a3fd9 into master Jan 11, 2021
@rahul003 rahul003 deleted the smp branch January 11, 2021 19:12
NihalHarish added a commit that referenced this pull request Jan 15, 2021
@rahul003 rahul003 restored the smp branch January 16, 2021 00:45
ndodda-amazon added a commit that referenced this pull request Jan 16, 2021
This reverts commit 07a3fd9.

Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
sophiayue1116 pushed a commit to sophiayue1116/sagemaker-debugger that referenced this pull request Jan 20, 2021
* Add smp rank

* Switch to core initialize

* Use smp size

* Cache whether SMP can be imported

* Lint

* try import with noqa'

* Add smp rank call in core

* Import only once

* Use nested except blocks
sophiayue1116 pushed a commit to sophiayue1116/sagemaker-debugger that referenced this pull request Jan 20, 2021
* Add smp rank

* Switch to core initialize

* Use smp size

* Cache whether SMP can be imported

* Lint

* try import with noqa'

* Add smp rank call in core

* Import only once

* Use nested except blocks
NihalHarish pushed a commit that referenced this pull request Jan 25, 2021
* Add smp rank

* Switch to core initialize

* Use smp size

* Cache whether SMP can be imported

* Lint

* try import with noqa'

* Add smp rank call in core

* Import only once

* Use nested except blocks
NihalHarish added a commit that referenced this pull request Jan 25, 2021
This reverts commit 07a3fd9.

Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants