Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Mcore DistributedDataParallel and distributed optimizer into Nemo #9034

Merged
merged 9 commits into from
May 7, 2024

Conversation

gdengk
Copy link
Contributor

@gdengk gdengk commented Apr 24, 2024

What does this PR do ?

Add Mcore DistributedDataParallel and distributed optimizer into Nemo examples/nlp/language_modeling/megatron_gpt_pretraining.py

Changelog

  • Port DistributedDataParallel to Nemo
  • Add a wrapper McoreDistributedOptimizer to bypass torch/PTL assertion check
  • Add the optim name mcore_distributed_optim to turn on mcore distributed optimizer and a few other mcore related flags (details in next section).
  • Verified memory and accuracy between mcore and apex optimizer.

Usage

  • Mcore distributed optimizer usage example is as below:
optim:
    name: mcore_distributed_optim
    overlap_grad_sync: false 
    overlap_param_sync: false 
    grad_sync_dtype: fp32
    delay_param_gather: false
    delay_grad_reduce: True
    ddp_bucket_size: null
    check_for_nan_in_grad: false
    lr: 0.00012
    weight_decay: 0.1
    betas:
    - 0.9
    - 0.95

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Gao Deng <gdeng@nvidia.com>
@github-actions github-actions bot added core Changes to NeMo Core NLP labels Apr 24, 2024
@gdengk gdengk changed the title merge mcore dist optim Add Mcore DistributedDataParallel and distributed optimizer into Nemo Apr 24, 2024
@gdengk gdengk marked this pull request as ready for review April 24, 2024 23:19
gdengk added 2 commits April 24, 2024 17:08
Signed-off-by: Gao Deng <gdeng@nvidia.com>
@ericharper ericharper requested a review from akoumpa April 25, 2024 04:54
akoumpa
akoumpa previously approved these changes Apr 25, 2024
Copy link
Member

@akoumpa akoumpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've confirmed this gives the same results as mcore (1353) with EP=2 NGPUS=2 GBS=8; so I think it's good to go from my side. Thank you.

@akoumpa
Copy link
Member

akoumpa commented Apr 26, 2024

  • EP=2 NGPU=8
  • EP=1 TP=1 NGPU=8
    These also work correctly

Signed-off-by: Gao Deng <gdeng@nvidia.com>
nemo/core/optim/mcore_optim.py Fixed Show fixed Hide fixed
nemo/core/optim/mcore_optim.py Fixed Show fixed Hide fixed
nemo/core/optim/mcore_optim.py Fixed Show fixed Hide fixed
Comment on lines +107 to +112
from megatron.core.utils import (
drain_embedding_wgrad_compute,
get_model_config,
init_method_normal,
scaled_init_method_normal,
)

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'init_method_normal' is not used.
Import of 'scaled_init_method_normal' is not used.
@erhoo82
Copy link
Collaborator

erhoo82 commented Apr 26, 2024

@ericharper Can you trigger test?

@ericharper
Copy link
Collaborator

the tests are triggered, only need to add the "Run CICD" label

Signed-off-by: Gao Deng <gdeng@nvidia.com>
@gdengk
Copy link
Contributor Author

gdengk commented Apr 26, 2024

@ericharper will the tests be re-triggered automatically after I made new changes with the CICD label? It looks like it still needs to be manually re-triggered.

from megatron.core.optimizer import OptimizerConfig, get_megatron_optimizer
from megatron.core.utils import get_model_config

HAVE_MEGATRON_CORE = True

Check notice

Code scanning / CodeQL

Unused global variable Note

The global variable 'HAVE_MEGATRON_CORE' is not used.

except (ImportError, ModuleNotFoundError):

HAVE_MEGATRON_CORE = False

Check notice

Code scanning / CodeQL

Unused global variable Note

The global variable 'HAVE_MEGATRON_CORE' is not used.
import torch

try:
from megatron.core.optimizer.optimizer import MegatronOptimizer

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'MegatronOptimizer' is not used.
@ericharper ericharper merged commit fb850d1 into NVIDIA:main May 7, 2024
248 of 252 checks passed
@gdengk gdengk deleted the gao/moe/nemo_mcore_dist_optim_part1 branch May 7, 2024 20:58
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…NVIDIA#9034)

* merge mcore dist optim

Signed-off-by: Gao Deng <gdeng@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: Gao Deng <gdeng@nvidia.com>

* address comments

Signed-off-by: Gao Deng <gdeng@nvidia.com>

* fix import and CodeQL comments

Signed-off-by: Gao Deng <gdeng@nvidia.com>

* remove two type check

Signed-off-by: Gao Deng <gdeng@nvidia.com>

---------

Signed-off-by: Gao Deng <gdeng@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core NLP Run CICD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants