Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor-parallel communication overlap with userbuffer backend #6444

Conversation

github-actions[bot]
Copy link
Contributor

What does this PR do ?

Add (1) interfaces to TE and initialized (2) process group setting to support tensor-parallel communication overlap with userbuffer backend.

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

Set ub_tp_comm_overlap to True

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

* add interfaces for tp_communication overlap

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Interface to provide custom userbuffer communicator settings by yaml file

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Jenkinsfile

Signed-off-by: Sangkug Lym <slym@nvidia.com>

---------

Signed-off-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
@github-actions
Copy link
Contributor Author

github-actions bot commented May 3, 2023

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label May 3, 2023
shape=input_shape,
tp_size=self.cfg.get('tensor_model_parallel_size'),
use_fp8=self.cfg.get('fp8'),
ub_cfgs=ub_cfgs,

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable

Local variable 'ub_cfgs' may be used before it is initialized.
@github-actions github-actions bot removed the stale label May 6, 2023
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may require us to port the changes from NVIDIA/apex#1626 and maybe NVIDIA/apex#1620 to Megatron-LM. That said, I don't see any megatron.core-related changes in this PR.

nemo/utils/app_state.py Outdated Show resolved Hide resolved
nemo/collections/nlp/parts/nlp_overrides.py Outdated Show resolved Hide resolved
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Collaborator

I've modified this PR to construct an MPI process group within NeMo, avoiding the need to port NVIDIA/apex#1626 to Megatron-LM. This would check off one of the Megatron-core bugs listed in #6625. This PR is probably dependent on #6627, which restores FP8 support.

@titu1994 titu1994 closed this May 17, 2023
@titu1994 titu1994 deleted the cherry-pick-main-68dadb924623f45a5e1e0f1d9ca3835f79fc2c20 branch May 17, 2023 21:37
arendu and others added 11 commits May 17, 2023 19:48
Signed-off-by: arendu <adithya.r@gmail.com>
* [TTS] Add callback for saving audio during FastPitch training

Signed-off-by: Ryan <rlangman@nvidia.com>

* [TTS] Allow NGC model name for vocoder

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
* update batch size recommendation to min 32 for 43b

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Inconsistent usage of the word Note, which includes a broken reading in one case.

I'm just doing some tidying -- not trying to be critical.

Signed-off-by: Brian McBrayer <BrianMcBrayer@users.noreply.github.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
* adding ssl config for fast-conformer
adding boolean flags for ssl losses

Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* renaming fast-conformer to fastconformer in config folder

Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>

---------

Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>
Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The tree is invalid as this points to a blob, and the links would not
open in colab.

Signed-off-by: Brian McBrayer <BrianMcBrayer@users.noreply.github.com>
Co-authored-by: Brian McBrayer <brian@acceleratepath.com>
Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>
…efined (#6681) (#6682)

Co-authored-by: Li Tao <chntaoli@163.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
* add GPT FP8 ONNX export support

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* changes
1. Add dynamic axes for inputs
2. Update model input_example to resolve size error by TE

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* Conform to Python style guidelines

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* refactor to avoid typecasting bf16 string

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* fix attribute error in export_utils

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* set constant_folding to False by default

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* refactor exportable wrapper into model class definition

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* remove conditional replacement of modules

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set fp8_recipe to None by default

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* address all comments

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* typecast precision check for fp16

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* rename export script

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
* [TTS] Add script for text preprocessing

Signed-off-by: Ryan <rlangman@nvidia.com>

* [TTS] Use Normalizer.input_case

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
@ericharper ericharper restored the cherry-pick-main-68dadb924623f45a5e1e0f1d9ca3835f79fc2c20 branch May 22, 2023 19:00
@ericharper ericharper reopened this May 22, 2023
arendu and others added 5 commits May 23, 2023 16:06
* allows usage of pre-extracted base model

Signed-off-by: arendu <adithya.r@gmail.com>

* extracted model checking and loading

Signed-off-by: arendu <adithya.r@gmail.com>

* style

Signed-off-by: arendu <adithya.r@gmail.com>

* style

Signed-off-by: arendu <adithya.r@gmail.com>

* update

Signed-off-by: arendu <adithya.r@gmail.com>

* removed sft eval script, can use peft eval script for sft models

Signed-off-by: arendu <adithya.r@gmail.com>

---------

Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Co-authored-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
@erhoo82
Copy link
Collaborator

erhoo82 commented May 25, 2023

@ericharper What are the remaining issues of this PR?

@ericharper
Copy link
Collaborator

@ericharper What are the remaining issues of this PR?

Conflicts should be resolved and it needs to pass CI

arendu and others added 13 commits May 25, 2023 11:03
* preprocess squad in sft format

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <adithya.r@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
* [Temp] VP Fixes

Signed-off-by: smajumdar <titu1994@gmail.com>

* Revert logging

Signed-off-by: smajumdar <titu1994@gmail.com>

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
* add GraphTransducerLossBase abstract class with the interface for Graph-based loses
* add RNN-T implementation in GraphRnntLoss with tests
* add W-Transducer implementation in GraphWTransducerLoss with tests
* add GraphRnntLoss + GraphWTransducerLoss to RNN-T loss resolver

---------

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
* fix test fastpitch nightly

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Reformat

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix if elif condition

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

---------

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
* check for first or last stage



* remove redundant check



* fix typo



* add map_location



---------

Signed-off-by: ericharper <complex451@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
…3835f79fc2c20

Signed-off-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Eric Harper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
@github-actions github-actions bot removed the CI label May 27, 2023
@ericharper ericharper requested a review from timmoon10 May 27, 2023 19:23
@erhoo82 erhoo82 changed the base branch from main to r1.19.0 May 31, 2023 17:16
@github-actions github-actions bot added ASR CI core Changes to NeMo Core TTS labels May 31, 2023
@ericharper
Copy link
Collaborator

Closing in favor of cherry picking the changes.

@ericharper ericharper closed this May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.