Add MCore FSDP2 support #11216

BoxiangW · 2024-11-07T20:38:25Z

What does this PR do ?

Add MCore FSDP2 support

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: jasonwan <jasonwan@nvidia.com>

Signed-off-by: blahBlahhhJ <blahBlahhhJ@users.noreply.github.com>

Signed-off-by: BoxiangW <BoxiangW@users.noreply.github.com>

tests/collections/llm/test_mnist_model_nemo2_fsdp2.py

Signed-off-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>

Signed-off-by: BoxiangW <BoxiangW@users.noreply.github.com>

Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

Signed-off-by: BoxiangW <BoxiangW@users.noreply.github.com>

github-actions · 2024-11-14T00:37:46Z

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.

Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.lightning.megatron_parallel
nemo/lightning/megatron_parallel.py:250:0: C0301: Line too long (127/119) (line-too-long)
nemo/lightning/megatron_parallel.py:251:0: C0301: Line too long (140/119) (line-too-long)
nemo/lightning/megatron_parallel.py:252:0: C0301: Line too long (130/119) (line-too-long)
nemo/lightning/megatron_parallel.py:558:0: C0301: Line too long (129/119) (line-too-long)
nemo/lightning/megatron_parallel.py:565:0: C0301: Line too long (135/119) (line-too-long)
nemo/lightning/megatron_parallel.py:863:0: C0301: Line too long (137/119) (line-too-long)
nemo/lightning/megatron_parallel.py:1093:0: C0301: Line too long (136/119) (line-too-long)
nemo/lightning/megatron_parallel.py:1660:0: C0301: Line too long (128/119) (line-too-long)
nemo/lightning/megatron_parallel.py:1699:0: C0301: Line too long (146/119) (line-too-long)
nemo/lightning/megatron_parallel.py:73:0: C0115: Missing class docstring (missing-class-docstring)
nemo/lightning/megatron_parallel.py:74:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:76:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:111:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:115:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:318:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:342:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:368:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:394:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:530:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:573:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:577:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:640:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:676:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:683:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:717:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:725:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:741:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:768:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:780:0: C0115: Missing class docstring (missing-class-docstring)
nemo/lightning/megatron_parallel.py:802:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:811:4: C0115: Missing class docstring (missing-class-docstring)
nemo/lightning/megatron_parallel.py:833:8: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:1353:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:1528:0: C0115: Missing class docstring (missing-class-docstring)
nemo/lightning/megatron_parallel.py:1534:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:1540:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:1544:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:1549:0: C0115: Missing class docstring (missing-class-docstring)
nemo/lightning/megatron_parallel.py:1554:0: C0115: Missing class docstring (missing-class-docstring)
nemo/lightning/megatron_parallel.py:1582:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:1628:8: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:1650:0: C0115: Missing class docstring (missing-class-docstring)
nemo/lightning/megatron_parallel.py:1723:0: C0115: Missing class docstring (missing-class-docstring)
nemo/lightning/megatron_parallel.py:1762:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:1776:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/megatron_parallel.py:51:0: W0611: Unused TrainerFn imported from pytorch_lightning.trainer.states (unused-import)

-----------------------------------
Your code has been rated at 9.54/10

Thank you for improving NeMo's documentation!

nemo/lightning/megatron_parallel.py

+    from megatron.core.distributed import TorchFullyShardedDataParallel as McoreTorchFSDP
+
+    HAVE_MCORE_FSDP2 = True
+except:


ericharper · 2024-11-14T20:33:34Z

nemo/lightning/pytorch/strategies/megatron_strategy.py

@@ -193,6 +195,7 @@ def __init__(
        ckpt_load_optimizer: bool = True,
        ckpt_save_optimizer: bool = True,
        ddp: Union[DDPLiteral, DistributedDataParallelConfig] = "megatron",
+        fsdp: bool = False,


should we rename it torch_fsdp because we know we'll have mcore_fsdp eventually?

blahBlahhhJ and others added 6 commits September 19, 2024 09:57

add configuration for mcore torch fsdp2

4f8ed04

Signed-off-by: jasonwan <jasonwan@nvidia.com>

convert to dtensor and load optim

32b020b

Signed-off-by: jasonwan <jasonwan@nvidia.com>

add tests

273e708

Signed-off-by: jasonwan <jasonwan@nvidia.com>

Apply isort and black reformatting

e67de8d

Signed-off-by: blahBlahhhJ <blahBlahhhJ@users.noreply.github.com>

Merge branch 'jasonwan/mcore-fsdp2' into boxiangw/mcore-fsdp2

773a1e6

Apply isort and black reformatting

f568cdf

Signed-off-by: BoxiangW <BoxiangW@users.noreply.github.com>

github-advanced-security bot found potential problems Nov 7, 2024

View reviewed changes

tests/collections/llm/test_mnist_model_nemo2_fsdp2.py Fixed Show fixed Hide fixed

tests/collections/llm/test_mnist_model_nemo2_fsdp2.py Dismissed Show dismissed Hide dismissed

tests/collections/llm/test_mnist_model_nemo2_fsdp2.py Dismissed Show dismissed Hide dismissed

Fix test

6d79a66

github-advanced-security bot found potential problems Nov 7, 2024

View reviewed changes

tests/collections/llm/test_mnist_model_nemo2_fsdp2.py Dismissed Show dismissed Hide dismissed

BoxiangW self-assigned this Nov 8, 2024

BoxiangW added feature request/PR for a new feature Run CICD labels Nov 8, 2024

Fix param name changes

7cd52bd

BoxiangW added Run CICD and removed Run CICD labels Nov 8, 2024

Merge branch 'main' into boxiangw/mcore-fsdp2

9dc5659

Signed-off-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>

BoxiangW added Run CICD and removed Run CICD labels Nov 8, 2024

BoxiangW and others added 6 commits November 12, 2024 16:32

Add docstring

5b9a598

Add version check for FSDP

d7b29ef

Apply isort and black reformatting

941d9e9

Signed-off-by: BoxiangW <BoxiangW@users.noreply.github.com>

Fix bug

66b20ad

Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

Add version check

acb6104

Apply isort and black reformatting

2e2195e

Signed-off-by: BoxiangW <BoxiangW@users.noreply.github.com>

github-advanced-security bot found potential problems Nov 14, 2024

View reviewed changes

nemo/lightning/megatron_parallel.py

from megatron.core.distributed import TorchFullyShardedDataParallel as McoreTorchFSDP

HAVE_MCORE_FSDP2 = True

except:

Check notice

Code scanning / CodeQL

Except block handles 'BaseException' Note

Except block directly handles BaseException.

ericharper reviewed Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MCore FSDP2 support #11216

Add MCore FSDP2 support #11216

BoxiangW commented Nov 7, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024

ericharper Nov 14, 2024

Add MCore FSDP2 support #11216

Are you sure you want to change the base?

Add MCore FSDP2 support #11216

Conversation

BoxiangW commented Nov 7, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

github-actions bot commented Nov 14, 2024

ericharper Nov 14, 2024

Choose a reason for hiding this comment

BoxiangW commented Nov 7, 2024 •

edited

Loading