Virtual pipeline parallel support for MegatronGPTSFTModel #7964

vysarge · 2023-12-02T02:33:07Z

What does this PR do ?

Enables running MegatronGPTSFTModel with virtual pipeline parallel when peft_scheme=null.

Collection: NLP (language_modeling)

Changelog

Alters word embedding initialization and passing of several parameters in MegatronGPTSFTModel to accommodate virtual pipeline parallelism when self.use_peft is false
Adds a guard in MegatronGPTSFTModel.__init__ to raise a ValueError if self.use_peft is true
Corrects the TP group initialization check in MegatronGPTSFTModel to account for both TE and MCore flags

Usage

Run a GPT model with SFT and set virtual_pipeline_model_parallel_size, such as:

python examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py \
++model.virtual_pipeline_model_parallel_size=10 \
...

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py

…GPTSFTModel into one method Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

…E and MCore, as in MegatronGPTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

github-actions · 2023-12-20T01:39:51Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

for more information, see https://pre-commit.ci

erhoo82 · 2024-01-02T09:10:34Z

jenkins

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

ericharper · 2024-01-17T05:05:27Z

jenkins

ericharper

LGTM. Thanks!

* Virtual pipeline parallel support for MegatronGPTSFTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Correct accidental double pipeline parallel size check Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Correct get_gpt_module_list -> get_model_module_list from SFT model Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Virtual pipeline parallel support for MegatronGPTSFTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Correct accidental double pipeline parallel size check Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Correct get_gpt_module_list -> get_model_module_list from SFT model Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: stevehuang52 <heh@nvidia.com>

* Virtual pipeline parallel support for MegatronGPTSFTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Correct accidental double pipeline parallel size check Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Correct get_gpt_module_list -> get_model_module_list from SFT model Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>

* Virtual pipeline parallel support for MegatronGPTSFTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Correct accidental double pipeline parallel size check Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Correct get_gpt_module_list -> get_model_module_list from SFT model Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Virtual pipeline parallel support for MegatronGPTSFTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Correct accidental double pipeline parallel size check Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Correct get_gpt_module_list -> get_model_module_list from SFT model Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

github-actions bot added the NLP label Dec 2, 2023

Virtual pipeline parallel support for MegatronGPTSFTModel

c1ad1ca

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

vysarge force-pushed the vsarge/gpt_sft_vp branch from 9eddd98 to c1ad1ca Compare December 5, 2023 00:54

vysarge marked this pull request as ready for review December 5, 2023 00:59

Merge branch 'main' into vsarge/gpt_sft_vp

195aa3b

ericharper reviewed Dec 5, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py Outdated Show resolved Hide resolved

ericharper reviewed Dec 5, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py Outdated Show resolved Hide resolved

vysarge added 3 commits December 4, 2023 20:47

Deduplicate word embedding init code in MegatronGPTModel and Megatron…

36c7e83

…GPTSFTModel into one method Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

Correct TP group init call in MegatronGPTSFTModel to check for both T…

2d6dbe1

…E and MCore, as in MegatronGPTModel Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

Correct accidental double pipeline parallel size check

a2232ae

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

github-actions bot added the stale label Dec 20, 2023

vysarge requested a review from ericharper December 20, 2023 19:44

vysarge and others added 2 commits December 20, 2023 14:20

Merge branch 'main' into vsarge/gpt_sft_vp

6e30e01

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

3294988

for more information, see https://pre-commit.ci

github-actions bot removed the stale label Dec 21, 2023

vysarge added 3 commits January 8, 2024 16:04

Merge branch 'main' into vsarge/gpt_sft_vp

4cd576d

Merge branch 'main' into vsarge/gpt_sft_vp

51ce674

Merge branch 'main' into vsarge/gpt_sft_vp

0e95f00

ericharper reviewed Jan 14, 2024

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py Outdated Show resolved Hide resolved

vysarge and others added 3 commits January 16, 2024 08:41

Correct get_gpt_module_list -> get_model_module_list from SFT model

cd8c603

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

Merge branch 'main' into vsarge/gpt_sft_vp

728d5c2

Merge branch 'main' into vsarge/gpt_sft_vp

df1c42b

ericharper approved these changes Jan 17, 2024

View reviewed changes

ericharper merged commit 8811946 into NVIDIA:main Jan 17, 2024
11 checks passed

vysarge mentioned this pull request Jan 22, 2024

Fix to peft & virtual pipeline parallel unsupported check #8216

Merged

8 tasks

vysarge deleted the vsarge/gpt_sft_vp branch March 12, 2024 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Virtual pipeline parallel support for MegatronGPTSFTModel #7964

Virtual pipeline parallel support for MegatronGPTSFTModel #7964

vysarge commented Dec 2, 2023 •

edited

Loading

github-actions bot commented Dec 20, 2023

erhoo82 commented Jan 2, 2024

ericharper commented Jan 17, 2024

ericharper left a comment

Virtual pipeline parallel support for MegatronGPTSFTModel #7964

Virtual pipeline parallel support for MegatronGPTSFTModel #7964

Conversation

vysarge commented Dec 2, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

github-actions bot commented Dec 20, 2023

erhoo82 commented Jan 2, 2024

ericharper commented Jan 17, 2024

ericharper left a comment

Choose a reason for hiding this comment

vysarge commented Dec 2, 2023 •

edited

Loading