New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Start using ModelParallelConfig from Megatron Core #6885

Merged

ericharper merged 57 commits into main from mcore_gpt_path

Aug 14, 2023

Collaborator

ericharper commented Jun 19, 2023

What does this PR do ?

This PR is adding the ModelParallelConfig arguments to be used with the next release of Megatron Core.

Collection: NLP

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

github-actions bot added the NLP label

github-advanced-security bot found potential problems

View reviewed changes

github-advanced-security bot left a comment

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

github-actions bot added CI core labels

ericharper marked this pull request as ready for review

July 25, 2023 16:39

michalivne reviewed

View reviewed changes

Collaborator

michalivne left a comment

LGTM! Very useful to collect configs into model parallel configs. See minor comments.

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

+                          # hidden size is needed for pipeline schedules but is not currently in ModelParallelConfig
+                          setattr(model_parallel_config, 'hidden_size', self.cfg.hidden_size)
+                      except AttributeError:
+                          logging.warning(

Collaborator

michalivne Aug 1, 2023

Why not also fail here? If missing and will fail later wouldn't here be a good place to stop?

Collaborator Author

ericharper Aug 8, 2023

I found this was too brittle. Maybe we can add a strict argument?

Collaborator Author

ericharper Aug 11, 2023

What do you think about the suggestion?

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Outdated Show resolved Hide resolved

nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py

+                      """ Hidden size needs to be set from the cfg.encoder for the pipeline schedule.
+                      """
+                      model_parallel_config = super().build_model_parallel_config()

Collaborator

michalivne Aug 1, 2023

Wouldn't parent class return warning if hidden_size is not in cfg.model.hidden_size? Perhaps this argument can be passed to parent method?

Collaborator Author

ericharper Aug 11, 2023

Maybe you could expand more on your suggestion? I was adding this because the parent class didn't have hidden_size for this model.

aklife97 reviewed

View reviewed changes

nemo/collections/nlp/modules/common/megatron/module.py Show resolved Hide resolved

aklife97 previously approved these changes

View reviewed changes

Collaborator

aklife97 left a comment

LGTM, thank you!
The main concern I have is MPConfig vs TransformerConfig, we need to probably discuss more how we should structure the usages. Apart from that, this looks like it covers everything

ericharper and others added 22 commits

August 8, 2023 10:46


          start adding gpt from megatron core path

63c127d

Signed-off-by: ericharper <complex451@gmail.com>


          set model parallel config

16d85c4

Signed-off-by: ericharper <complex451@gmail.com>


          use model parallel config object

19e1420

Signed-off-by: ericharper <complex451@gmail.com>


          update args

a309c0b

Signed-off-by: ericharper <complex451@gmail.com>


          [pre-commit.ci] auto fixes from pre-commit.com hooks

2ea9285

for more information, see https://pre-commit.ci


          set vp size to none if it is 1

46ec121

Signed-off-by: ericharper <complex451@gmail.com>


          set vp size to none if it is 1

575ef8a

Signed-off-by: ericharper <complex451@gmail.com>


          [pre-commit.ci] auto fixes from pre-commit.com hooks

a8b177c

for more information, see https://pre-commit.ci


          add TransformerConfig

a296be8

Signed-off-by: ericharper <complex451@gmail.com>


          start updating to TransformerConfig

ec3c170

Signed-off-by: ericharper <complex451@gmail.com>


          add todo

e2090ae

Signed-off-by: ericharper <complex451@gmail.com>


          revert to model parallel config

e1f38d8

Signed-off-by: ericharper <complex451@gmail.com>


          add hidden_size to model_parallel_config

cbfb0d4

Signed-off-by: ericharper <complex451@gmail.com>


          remove imports

cbf5036

Signed-off-by: ericharper <complex451@gmail.com>


          [pre-commit.ci] auto fixes from pre-commit.com hooks

c6fe7ed

for more information, see https://pre-commit.ci


          remove import

2bd408c

Signed-off-by: ericharper <complex451@gmail.com>


          small clean up

06064bf

Signed-off-by: ericharper <complex451@gmail.com>


          update hidden size in peft base model, add mcore commit to jenkins

d8e9f4f

Signed-off-by: ericharper <complex451@gmail.com>


          update module args

afdf3f0

Signed-off-by: ericharper <complex451@gmail.com>


          [pre-commit.ci] auto fixes from pre-commit.com hooks

90e8160

for more information, see https://pre-commit.ci


          add config obj to flash attention tests

3f194ac

Signed-off-by: ericharper <complex451@gmail.com>


          remove args

95a3b68

Signed-off-by: ericharper <complex451@gmail.com>


          prefetch num microbatches

aa5b5fb

Signed-off-by: eharper <eharper@nvidia.com>

ericharper dismissed aklife97’s stale review via

aa5b5fb

August 8, 2023 22:52

ericharper force-pushed the mcore_gpt_path branch from 3d974be to aa5b5fb Compare

August 8, 2023 22:52

pre-commit-ci bot and others added 2 commits

August 8, 2023 22:53


          [pre-commit.ci] auto fixes from pre-commit.com hooks

2db6c2a

for more information, see https://pre-commit.ci


          Merge branch 'main' into mcore_gpt_path

e5d48ae

aklife97 reviewed

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_bert_model.py Outdated Show resolved Hide resolved

aklife97 previously approved these changes

View reviewed changes

Collaborator

aklife97 left a comment

LGTM, thank you!! just 1 potential issue with sequence length setting

ericharper and others added 2 commits

August 10, 2023 10:38


          Merge branch 'main' into mcore_gpt_path


          remove import

4ddc99f

Signed-off-by: eharper <eharper@nvidia.com>

ericharper dismissed aklife97’s stale review via

4ddc99f

August 10, 2023 19:36

ericharper and others added 13 commits

August 10, 2023 15:13


          temporarily comment jenkins test

4551eb6

Signed-off-by: eharper <eharper@nvidia.com>


          pull main

072dcad

Signed-off-by: eharper <eharper@nvidia.com>


          update seq_length

a5f24e6

Signed-off-by: eharper <eharper@nvidia.com>


          remove commented code

8f2e8fb

Signed-off-by: eharper <eharper@nvidia.com>


          update arg

f07af89

Signed-off-by: eharper <eharper@nvidia.com>


          resolve conflict

c36054c

Signed-off-by: eharper <eharper@nvidia.com>


          update mbs and gbs of test

7dcf6b7

Signed-off-by: eharper <eharper@nvidia.com>


          update batch size in test

7519e0f

Signed-off-by: eharper <eharper@nvidia.com>


          fix precision in test

7aa1188

Signed-off-by: eharper <eharper@nvidia.com>


          update precision

ceca1f3

Signed-off-by: eharper <eharper@nvidia.com>


          move hidden_size out of conditional

82a55f5

Signed-off-by: eharper <eharper@nvidia.com>


          [pre-commit.ci] auto fixes from pre-commit.com hooks

0a702a0

for more information, see https://pre-commit.ci


          Merge branch 'main' into mcore_gpt_path

e4ba515

aklife97 approved these changes

View reviewed changes

Collaborator

aklife97 left a comment

LGTM! I think we should merge this in now
@michalivne: let us know what your feedback is on Eric's response, and we can send fixes in later PRs accordingly!

ericharper merged commit 4833347 into main

ericharper deleted the mcore_gpt_path branch

August 14, 2023 04:55

guyueh1 pushed a commit to guyueh1/NeMo that referenced this pull request


          Start using ModelParallelConfig from Megatron Core (NVIDIA#6885)

18e6de2

* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* add todo

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <complex451@gmail.com>

* small clean up

Signed-off-by: ericharper <complex451@gmail.com>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <complex451@gmail.com>

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <complex451@gmail.com>

* remove args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to test

Signed-off-by: ericharper <complex451@gmail.com>

* get hidden_size from config

Signed-off-by: ericharper <complex451@gmail.com>

* add try except

Signed-off-by: ericharper <complex451@gmail.com>

* use default

Signed-off-by: ericharper <complex451@gmail.com>

* update config with hidden size

Signed-off-by: ericharper <complex451@gmail.com>

* remove arg

Signed-off-by: ericharper <complex451@gmail.com>

* comment out jenkins test

Signed-off-by: ericharper <complex451@gmail.com>

* revert import

Signed-off-by: ericharper <complex451@gmail.com>

* remove optimizer_idx

Signed-off-by: eharper <eharper@nvidia.com>

* prefetch num microbatches

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

* temporarily comment jenkins test

Signed-off-by: eharper <eharper@nvidia.com>

* update seq_length

Signed-off-by: eharper <eharper@nvidia.com>

* remove commented code

Signed-off-by: eharper <eharper@nvidia.com>

* update arg

Signed-off-by: eharper <eharper@nvidia.com>

* update mbs and gbs of test

Signed-off-by: eharper <eharper@nvidia.com>

* update batch size in test

Signed-off-by: eharper <eharper@nvidia.com>

* fix precision in test

Signed-off-by: eharper <eharper@nvidia.com>

* update precision

Signed-off-by: eharper <eharper@nvidia.com>

* move hidden_size out of conditional

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: eharper <eharper@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

odelalleau reviewed

View reviewed changes

tests/collections/nlp/test_flash_attention.py

@@ @@ -16,6 +16,7 @@ @@
               import pytest
               import torch
+              from megatron.core import ModelParallelConfig

Collaborator

odelalleau Aug 18, 2023

This is breaking pytest --cpu when doing a basic setup without all the fluff.

dorotat-nv pushed a commit to dorotat-nv/NeMo that referenced this pull request


          Start using ModelParallelConfig from Megatron Core (NVIDIA#6885)

* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* add todo

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <complex451@gmail.com>

* small clean up

Signed-off-by: ericharper <complex451@gmail.com>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <complex451@gmail.com>

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <complex451@gmail.com>

* remove args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to test

Signed-off-by: ericharper <complex451@gmail.com>

* get hidden_size from config

Signed-off-by: ericharper <complex451@gmail.com>

* add try except

Signed-off-by: ericharper <complex451@gmail.com>

* use default

Signed-off-by: ericharper <complex451@gmail.com>

* update config with hidden size

Signed-off-by: ericharper <complex451@gmail.com>

* remove arg

Signed-off-by: ericharper <complex451@gmail.com>

* comment out jenkins test

Signed-off-by: ericharper <complex451@gmail.com>

* revert import

Signed-off-by: ericharper <complex451@gmail.com>

* remove optimizer_idx

Signed-off-by: eharper <eharper@nvidia.com>

* prefetch num microbatches

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

* temporarily comment jenkins test

Signed-off-by: eharper <eharper@nvidia.com>

* update seq_length

Signed-off-by: eharper <eharper@nvidia.com>

* remove commented code

Signed-off-by: eharper <eharper@nvidia.com>

* update arg

Signed-off-by: eharper <eharper@nvidia.com>

* update mbs and gbs of test

Signed-off-by: eharper <eharper@nvidia.com>

* update batch size in test

Signed-off-by: eharper <eharper@nvidia.com>

* fix precision in test

Signed-off-by: eharper <eharper@nvidia.com>

* update precision

Signed-off-by: eharper <eharper@nvidia.com>

* move hidden_size out of conditional

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: eharper <eharper@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: dorotat <dorotat@nvidia.com>

rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request


          Start using ModelParallelConfig from Megatron Core (NVIDIA#6885)

3c6d94e

* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* add todo

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <complex451@gmail.com>

* small clean up

Signed-off-by: ericharper <complex451@gmail.com>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <complex451@gmail.com>

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <complex451@gmail.com>

* remove args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to test

Signed-off-by: ericharper <complex451@gmail.com>

* get hidden_size from config

Signed-off-by: ericharper <complex451@gmail.com>

* add try except

Signed-off-by: ericharper <complex451@gmail.com>

* use default

Signed-off-by: ericharper <complex451@gmail.com>

* update config with hidden size

Signed-off-by: ericharper <complex451@gmail.com>

* remove arg

Signed-off-by: ericharper <complex451@gmail.com>

* comment out jenkins test

Signed-off-by: ericharper <complex451@gmail.com>

* revert import

Signed-off-by: ericharper <complex451@gmail.com>

* remove optimizer_idx

Signed-off-by: eharper <eharper@nvidia.com>

* prefetch num microbatches

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

* temporarily comment jenkins test

Signed-off-by: eharper <eharper@nvidia.com>

* update seq_length

Signed-off-by: eharper <eharper@nvidia.com>

* remove commented code

Signed-off-by: eharper <eharper@nvidia.com>

* update arg

Signed-off-by: eharper <eharper@nvidia.com>

* update mbs and gbs of test

Signed-off-by: eharper <eharper@nvidia.com>

* update batch size in test

Signed-off-by: eharper <eharper@nvidia.com>

* fix precision in test

Signed-off-by: eharper <eharper@nvidia.com>

* update precision

Signed-off-by: eharper <eharper@nvidia.com>

* move hidden_size out of conditional

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: eharper <eharper@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels