Add Mcore DistributedDataParallel and distributed optimizer into Nemo #9034

gdengk · 2024-04-24T20:53:49Z

What does this PR do ?

Add Mcore DistributedDataParallel and distributed optimizer into Nemo examples/nlp/language_modeling/megatron_gpt_pretraining.py

Changelog

Port DistributedDataParallel to Nemo
Add a wrapper McoreDistributedOptimizer to bypass torch/PTL assertion check
Add the optim name mcore_distributed_optim to turn on mcore distributed optimizer and a few other mcore related flags (details in next section).
Verified memory and accuracy between mcore and apex optimizer.

Usage

Mcore distributed optimizer usage example is as below:

optim:
    name: mcore_distributed_optim
    overlap_grad_sync: false 
    overlap_param_sync: false 
    grad_sync_dtype: fp32
    delay_param_gather: false
    delay_grad_reduce: True
    ddp_bucket_size: null
    check_for_nan_in_grad: false
    lr: 0.00012
    weight_decay: 0.1
    betas:
    - 0.9
    - 0.95

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Gao Deng <gdeng@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Gao Deng <gdeng@nvidia.com>

…com/gdengk/NeMo into gao/moe/nemo_mcore_dist_optim_part1

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

akoumpa

I've confirmed this gives the same results as mcore (1353) with EP=2 NGPUS=2 GBS=8; so I think it's good to go from my side. Thank you.

akoumpa · 2024-04-26T04:47:35Z

EP=2 NGPU=8
EP=1 TP=1 NGPU=8
These also work correctly

Signed-off-by: Gao Deng <gdeng@nvidia.com>

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

nemo/core/optim/mcore_optim.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

+    from megatron.core.utils import (
+        drain_embedding_wgrad_compute,
+        get_model_config,
+        init_method_normal,
+        scaled_init_method_normal,
+    )


erhoo82 · 2024-04-26T21:34:31Z

@ericharper Can you trigger test?

ericharper · 2024-04-26T21:50:54Z

the tests are triggered, only need to add the "Run CICD" label

Signed-off-by: Gao Deng <gdeng@nvidia.com>

gdengk · 2024-04-26T23:10:20Z

@ericharper will the tests be re-triggered automatically after I made new changes with the CICD label? It looks like it still needs to be manually re-triggered.

nemo/core/classes/modelPT.py

+    from megatron.core.optimizer import OptimizerConfig, get_megatron_optimizer
+    from megatron.core.utils import get_model_config
+
+    HAVE_MEGATRON_CORE = True


nemo/core/classes/modelPT.py

+
+except (ImportError, ModuleNotFoundError):
+
+    HAVE_MEGATRON_CORE = False


nemo/core/optim/mcore_optim.py

Signed-off-by: Gao Deng <gdeng@nvidia.com>

nemo/core/optim/mcore_optim.py

+import torch
+
+try:
+    from megatron.core.optimizer.optimizer import MegatronOptimizer


…NVIDIA#9034) * merge mcore dist optim Signed-off-by: Gao Deng <gdeng@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean up Signed-off-by: Gao Deng <gdeng@nvidia.com> * address comments Signed-off-by: Gao Deng <gdeng@nvidia.com> * fix import and CodeQL comments Signed-off-by: Gao Deng <gdeng@nvidia.com> * remove two type check Signed-off-by: Gao Deng <gdeng@nvidia.com> --------- Signed-off-by: Gao Deng <gdeng@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>

merge mcore dist optim

3311e3e

Signed-off-by: Gao Deng <gdeng@nvidia.com>

github-actions bot added core Changes to NeMo Core NLP labels Apr 24, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

812f9c1

for more information, see https://pre-commit.ci

gdengk changed the title ~~merge mcore dist optim~~ Add Mcore DistributedDataParallel and distributed optimizer into Nemo Apr 24, 2024

gdengk marked this pull request as ready for review April 24, 2024 23:19

gdengk added 2 commits April 24, 2024 17:08

clean up

09657ea

Signed-off-by: Gao Deng <gdeng@nvidia.com>

Merge branch 'gao/moe/nemo_mcore_dist_optim_part1' of https://github.…

15d01f3

…com/gdengk/NeMo into gao/moe/nemo_mcore_dist_optim_part1

ericharper requested a review from akoumpa April 25, 2024 04:54

erhoo82 reviewed Apr 25, 2024

View reviewed changes

akoumpa previously approved these changes Apr 25, 2024

View reviewed changes

address comments

106d503

Signed-off-by: Gao Deng <gdeng@nvidia.com>

gdengk dismissed akoumpa’s stale review via 106d503 April 26, 2024 21:14

Merge branch 'main' into gao/moe/nemo_mcore_dist_optim_part1

fb44b50

ericharper added the Run CICD label Apr 26, 2024

github-advanced-security bot found potential problems Apr 26, 2024

View reviewed changes

fix import and CodeQL comments

7c8f31c

Signed-off-by: Gao Deng <gdeng@nvidia.com>

github-advanced-security bot found potential problems Apr 27, 2024

View reviewed changes

ericharper reviewed Apr 27, 2024

View reviewed changes

nemo/core/optim/mcore_optim.py Outdated Show resolved Hide resolved

gdengk and others added 2 commits April 28, 2024 23:43

remove two type check

1b3054b

Signed-off-by: Gao Deng <gdeng@nvidia.com>

Merge branch 'main' into gao/moe/nemo_mcore_dist_optim_part1

e3b8cd9

github-advanced-security bot found potential problems Apr 30, 2024

View reviewed changes

nemo/core/optim/mcore_optim.py

import torch

try:

from megatron.core.optimizer.optimizer import MegatronOptimizer

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'MegatronOptimizer' is not used.

akoumpa added Run CICD and removed Run CICD labels May 2, 2024

ericharper added Run CICD and removed Run CICD labels May 3, 2024

erhoo82 approved these changes May 7, 2024

View reviewed changes

akoumpa approved these changes May 7, 2024

View reviewed changes

ericharper merged commit fb850d1 into NVIDIA:main May 7, 2024
248 of 252 checks passed

gdengk deleted the gao/moe/nemo_mcore_dist_optim_part1 branch May 7, 2024 20:58

cuichenx mentioned this pull request Aug 6, 2024

Add support for overlapped gradient and parameter synchronization for GPT SFT model #10041

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mcore DistributedDataParallel and distributed optimizer into Nemo #9034

Add Mcore DistributedDataParallel and distributed optimizer into Nemo #9034

gdengk commented Apr 24, 2024 •

edited

Loading

akoumpa left a comment

akoumpa commented Apr 26, 2024

erhoo82 commented Apr 26, 2024

ericharper commented Apr 26, 2024

gdengk commented Apr 26, 2024


		except (ImportError, ModuleNotFoundError):

		HAVE_MEGATRON_CORE = False

Add Mcore DistributedDataParallel and distributed optimizer into Nemo #9034

Add Mcore DistributedDataParallel and distributed optimizer into Nemo #9034

Conversation

gdengk commented Apr 24, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

akoumpa left a comment

Choose a reason for hiding this comment

akoumpa commented Apr 26, 2024

erhoo82 commented Apr 26, 2024

ericharper commented Apr 26, 2024

gdengk commented Apr 26, 2024

gdengk commented Apr 24, 2024 •

edited

Loading