Tensor-parallel communication overlap with userbuffer backend #6444

github-actions · 2023-04-18T22:50:09Z

What does this PR do ?

Add (1) interfaces to TE and initialized (2) process group setting to support tensor-parallel communication overlap with userbuffer backend.

Changelog

Add specific line by line info of high level changes in this PR.

Usage

Set ub_tp_comm_overlap to True

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

* add interfaces for tp_communication overlap * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Interface to provide custom userbuffer communicator settings by yaml file * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Jenkinsfile Signed-off-by: Sangkug Lym <slym@nvidia.com> --------- Signed-off-by: Sangkug Lym <slym@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>

github-actions · 2023-05-03T01:51:11Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

…3835f79fc2c20

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

+                shape=input_shape,
+                tp_size=self.cfg.get('tensor_model_parallel_size'),
+                use_fp8=self.cfg.get('fp8'),
+                ub_cfgs=ub_cfgs,


…3835f79fc2c20

timmoon10

This may require us to port the changes from NVIDIA/apex#1626 and maybe NVIDIA/apex#1620 to Megatron-LM. ~~That said, I don't see any megatron.core-related changes in this PR.~~

nemo/utils/app_state.py

nemo/collections/nlp/parts/nlp_overrides.py

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2023-05-13T22:27:28Z

I've modified this PR to construct an MPI process group within NeMo, avoiding the need to port NVIDIA/apex#1626 to Megatron-LM. This would check off one of the Megatron-core bugs listed in #6625. This PR is probably dependent on #6627, which restores FP8 support.

Signed-off-by: arendu <adithya.r@gmail.com>

* [TTS] Add callback for saving audio during FastPitch training Signed-off-by: Ryan <rlangman@nvidia.com> * [TTS] Allow NGC model name for vocoder Signed-off-by: Ryan <rlangman@nvidia.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com>

* update batch size recommendation to min 32 for 43b Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Inconsistent usage of the word Note, which includes a broken reading in one case. I'm just doing some tidying -- not trying to be critical. Signed-off-by: Brian McBrayer <BrianMcBrayer@users.noreply.github.com>

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* adding ssl config for fast-conformer adding boolean flags for ssl losses Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * renaming fast-conformer to fastconformer in config folder Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> --------- Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

The tree is invalid as this points to a blob, and the links would not open in colab. Signed-off-by: Brian McBrayer <BrianMcBrayer@users.noreply.github.com> Co-authored-by: Brian McBrayer <brian@acceleratepath.com>

Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>

…efined (#6681) (#6682) Co-authored-by: Li Tao <chntaoli@163.com> Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>

* add GPT FP8 ONNX export support Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * changes 1. Add dynamic axes for inputs 2. Update model input_example to resolve size error by TE Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * Conform to Python style guidelines Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * refactor to avoid typecasting bf16 string Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * fix attribute error in export_utils Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * set constant_folding to False by default Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * refactor exportable wrapper into model class definition Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * remove conditional replacement of modules Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set fp8_recipe to None by default Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * address all comments Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typecast precision check for fp16 Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> * rename export script Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> --------- Signed-off-by: Asfiya Baig <asfiyab@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>

* [TTS] Add script for text preprocessing Signed-off-by: Ryan <rlangman@nvidia.com> * [TTS] Use Normalizer.input_case Signed-off-by: Ryan <rlangman@nvidia.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com>

* allows usage of pre-extracted base model Signed-off-by: arendu <adithya.r@gmail.com> * extracted model checking and loading Signed-off-by: arendu <adithya.r@gmail.com> * style Signed-off-by: arendu <adithya.r@gmail.com> * style Signed-off-by: arendu <adithya.r@gmail.com> * update Signed-off-by: arendu <adithya.r@gmail.com> * removed sft eval script, can use peft eval script for sft models Signed-off-by: arendu <adithya.r@gmail.com> --------- Signed-off-by: arendu <adithya.r@gmail.com>

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> Co-authored-by: Vladimir Bataev <vbataev@nvidia.com>

Signed-off-by: Ryan <rlangman@nvidia.com>

Signed-off-by: Yi Dong <yidong@nvidia.com>

Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>

erhoo82 · 2023-05-25T16:31:13Z

@ericharper What are the remaining issues of this PR?

ericharper · 2023-05-25T17:05:03Z

@ericharper What are the remaining issues of this PR?

Conflicts should be resolved and it needs to pass CI

* preprocess squad in sft format Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <adithya.r@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: smajumdar <titu1994@gmail.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* [Temp] VP Fixes Signed-off-by: smajumdar <titu1994@gmail.com> * Revert logging Signed-off-by: smajumdar <titu1994@gmail.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com>

* add GraphTransducerLossBase abstract class with the interface for Graph-based loses * add RNN-T implementation in GraphRnntLoss with tests * add W-Transducer implementation in GraphWTransducerLoss with tests * add GraphRnntLoss + GraphWTransducerLoss to RNN-T loss resolver --------- Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* fix test fastpitch nightly Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu> * Reformat Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix if elif condition Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu> --------- Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* check for first or last stage * remove redundant check * fix typo * add map_location --------- Signed-off-by: ericharper <complex451@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com>

…3835f79fc2c20 Signed-off-by: Eric Harper <complex451@gmail.com>

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Eric Harper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>

…9ca3835f79fc2c20

ericharper · 2023-05-31T23:07:36Z

Closing in favor of cherry picking the changes.

github-actions bot added cherry-pick CI NLP labels Apr 18, 2023

github-actions bot requested a review from erhoo82 April 18, 2023 22:50

github-actions bot added the stale label May 3, 2023

Merge branch 'main' into cherry-pick-main-68dadb924623f45a5e1e0f1d9ca…

64419bc

…3835f79fc2c20

github-advanced-security bot found potential problems May 5, 2023

View reviewed changes

github-actions bot removed the stale label May 6, 2023

Merge branch 'main' into cherry-pick-main-68dadb924623f45a5e1e0f1d9ca…

6d23781

…3835f79fc2c20

timmoon10 reviewed May 11, 2023

View reviewed changes

timmoon10 requested changes May 11, 2023

View reviewed changes

nemo/utils/app_state.py Outdated Show resolved Hide resolved

nemo/collections/nlp/parts/nlp_overrides.py Outdated Show resolved Hide resolved

Construct MPI process group for userbuffers support

ca76991

Signed-off-by: Tim Moon <tmoon@nvidia.com>

titu1994 closed this May 17, 2023

titu1994 deleted the cherry-pick-main-68dadb924623f45a5e1e0f1d9ca3835f79fc2c20 branch May 17, 2023 21:37

arendu and others added 11 commits May 17, 2023 19:48

minor fix for missing chat attr (#6671)

8aa80ee

Signed-off-by: arendu <adithya.r@gmail.com>

Make Note usage consistent in adapter_mixins.py (#6678)

2104862

Inconsistent usage of the word Note, which includes a broken reading in one case. I'm just doing some tidying -- not trying to be critical. Signed-off-by: Brian McBrayer <BrianMcBrayer@users.noreply.github.com>

Fix masking bug for TTS Aligner (#6677)

57824e0

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Update all invalid tree references to blobs for NeMo samples (#6679)

ae87a40

The tree is invalid as this points to a blob, and the links would not open in colab. Signed-off-by: Brian McBrayer <BrianMcBrayer@users.noreply.github.com> Co-authored-by: Brian McBrayer <brian@acceleratepath.com>

Update README.rst about container (#6686)

20717ba

Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>

Fix a bug, use _ceil_to_nearest instead as _round_to_nearest is not d…

dc9dda0

…efined (#6681) (#6682) Co-authored-by: Li Tao <chntaoli@163.com> Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>

[TTS] Add script for text preprocessing (#6541)

0838fe8

* [TTS] Add script for text preprocessing Signed-off-by: Ryan <rlangman@nvidia.com> * [TTS] Use Normalizer.input_case Signed-off-by: Ryan <rlangman@nvidia.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com>

ericharper restored the cherry-pick-main-68dadb924623f45a5e1e0f1d9ca3835f79fc2c20 branch May 22, 2023 19:00

ericharper reopened this May 22, 2023

arendu and others added 5 commits May 23, 2023 16:06

Fix k2 installation in Docker with CUDA 12 (#6707) (#6709)

ae4d4ee

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> Co-authored-by: Vladimir Bataev <vbataev@nvidia.com>

[TTS] Filter out silent audio files during preprocessing (#6716)

372f519

Signed-off-by: Ryan <rlangman@nvidia.com>

not pinning version (#6680)

8685468

Signed-off-by: Yi Dong <yidong@nvidia.com>

Tutorial fixes (#6717) (#6718)

0150b91

Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>

arendu and others added 13 commits May 25, 2023 11:03

match argument names

931ed78

Fix Codeql (#6731)

5776152

Signed-off-by: smajumdar <titu1994@gmail.com>

[TTS] fix inconsistent type hints for IpaG2p (#6733)

84577c9

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

VP Fixes for converter + Config management (#6698)

b6f46a0

* [Temp] VP Fixes Signed-off-by: smajumdar <titu1994@gmail.com> * Revert logging Signed-off-by: smajumdar <titu1994@gmail.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com>

Fix for interctc test random failure (#6644)

b50ae98

Signed-off-by: Igor Gitman <igitman@nvidia.com>

check for first or last stage (#6708) (#6743)

8b814bc

* check for first or last stage * remove redundant check * fix typo * add map_location --------- Signed-off-by: ericharper <complex451@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com>

Merge branch 'main' into cherry-pick-main-68dadb924623f45a5e1e0f1d9ca…

6d14d87

…3835f79fc2c20 Signed-off-by: Eric Harper <complex451@gmail.com>

Update nemo/utils/app_state.py

9864feb

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Eric Harper <complex451@gmail.com>

add back transformers offline to jenkins

d750426

Signed-off-by: ericharper <complex451@gmail.com>

try base 23.04 container

59514ad

Signed-off-by: ericharper <complex451@gmail.com>

github-actions bot removed the CI label May 27, 2023

ericharper requested a review from timmoon10 May 27, 2023 19:23

erhoo82 changed the base branch from main to r1.19.0 May 31, 2023 17:16

Merge branch 'r1.19.0' into cherry-pick-main-68dadb924623f45a5e1e0f1d…

4e9b8ac

…9ca3835f79fc2c20

github-actions bot added ASR CI core Changes to NeMo Core TTS labels May 31, 2023

ericharper closed this May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor-parallel communication overlap with userbuffer backend #6444

Tensor-parallel communication overlap with userbuffer backend #6444

github-actions bot commented Apr 18, 2023

github-actions bot commented May 3, 2023

timmoon10 left a comment •

edited

Loading

timmoon10 commented May 13, 2023

erhoo82 commented May 25, 2023

ericharper commented May 25, 2023

ericharper commented May 31, 2023

Tensor-parallel communication overlap with userbuffer backend #6444

Tensor-parallel communication overlap with userbuffer backend #6444

Conversation

github-actions bot commented Apr 18, 2023

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

github-actions bot commented May 3, 2023

timmoon10 left a comment • edited Loading

Choose a reason for hiding this comment

timmoon10 commented May 13, 2023

erhoo82 commented May 25, 2023

ericharper commented May 25, 2023

ericharper commented May 31, 2023

timmoon10 left a comment •

edited

Loading