Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distopt support for FP8 params and BF16 optimizer state #7909

Merged
merged 21 commits into from
Jan 12, 2024

Conversation

timmoon10
Copy link
Collaborator

@timmoon10 timmoon10 commented Nov 17, 2023

What does this PR do ?

Adds support in the Apex distributed Adam optimizer to support FP8 parameters (using experimental FP8 tensors from Transformer Engine) and BF16 optimizer state.

Collection: NLP

Changelog

  • Adds distributed optimizer support for FP8 parameters
  • Adds the option to initialize GPT with FP8 parameters
  • Add support for non-FP32 distributed optimizer state

Usage

Run GPT, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml.

Enable FP8 support with model.fp8=True, FP8 parameters with model.fp8_params=True, the distributed optimizer with model.optim.name=distributed_fused_adam, and BF16 optimizer state with model.optim.dtype=bf16.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

@timmoon10 timmoon10 requested a review from erhoo82 November 17, 2023 23:39
@github-actions github-actions bot added core Changes to NeMo Core NLP CI labels Nov 17, 2023
@timmoon10 timmoon10 removed core Changes to NeMo Core CI labels Nov 17, 2023
@github-actions github-actions bot added core Changes to NeMo Core CI labels Nov 17, 2023
@chiendb97
Copy link

I trained the llama model with 2 nodes using model.fp8=True, model.fp8_params=True, model.optim.name=distributed_fused_adam, model.optim.dtype=bf16.
I got this error when saving checkpoint

Screenshot 2023-11-20 at 17 27 23

How can i solve this problem

Thank you!

@timmoon10
Copy link
Collaborator Author

@chiendb97 Thanks for the report! I reproduce the error and have a fix at NVIDIA/TransformerEngine#529. I haven't tested thoroughly, but I was able to save and load a checkpoint for LLaMa with FP8 params.

@lhb8125
Copy link
Contributor

lhb8125 commented Dec 8, 2023

@timmoon10 Is there any blocker for the review? With this PR, the memory allocation per param is: 1(fp8)+1(fp8 transpose)+2(BF16 gradient)+[2(bf16 weight)+2(bf16 momentum)+2(bf16 variance)]/dp=4+6/dp. Is my understanding correct?

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 force-pushed the distopt-fp8-bf16-state branch from 8c81a3b to b031db6 Compare December 11, 2023 22:26
@timmoon10
Copy link
Collaborator Author

timmoon10 commented Dec 12, 2023

@lhb8125 The only remaining blocker is fixing some convergence issues when running with SFT. I don't fully understand that issue yet, but I don't expect it would require major changes. Starting a review would be great.

@timmoon10
Copy link
Collaborator Author

I've found a bug when using this PR with LLaMa SFT. Bugfix: NVIDIA/TransformerEngine#567

This does not affect GPT pretraining though. I think this is ready to review and merge.

@timmoon10 timmoon10 requested a review from ericharper December 15, 2023 17:15
@ericharper
Copy link
Collaborator

jenkins

ericharper
ericharper previously approved these changes Dec 15, 2023
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@timmoon10
Copy link
Collaborator Author

The Jenkins failure is because it is using an old version of Apex. The Dockerfile and README have been updated with the required Apex.

@ericharper
Copy link
Collaborator

What Base PyTorch version is needed? We can update it in the Jenkinsfile

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
erhoo82
erhoo82 previously approved these changes Jan 3, 2024
@erhoo82
Copy link
Collaborator

erhoo82 commented Jan 3, 2024

@athitten
Can you help review this PR? I think @ericharper is away.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Collaborator Author

jenkins

erhoo82
erhoo82 previously approved these changes Jan 9, 2024
Copy link
Collaborator

@erhoo82 erhoo82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@timmoon10
Copy link
Collaborator Author

jenkins

@erhoo82
Copy link
Collaborator

erhoo82 commented Jan 9, 2024

@timmoon10 Can you check why this fails the CI?

@timmoon10
Copy link
Collaborator Author

timmoon10 commented Jan 9, 2024

The error message is "No space left on device", so I suspect it's related to the recent file system issues on DLCluster. I find I often need to run a couple times to get past these errors, as well as segfaults coming from ASR.

@timmoon10
Copy link
Collaborator Author

jenkins

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
@timmoon10
Copy link
Collaborator Author

jenkins

@timmoon10
Copy link
Collaborator Author

jenkins

@timmoon10
Copy link
Collaborator Author

jenkins

@ericharper
Copy link
Collaborator

jenkins

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ericharper
Copy link
Collaborator

jenkins

@ericharper ericharper merged commit 6082d76 into NVIDIA:main Jan 12, 2024
11 checks passed
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 17, 2024
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 19, 2024
)

* Add distopt support for FP8 params and BF16 optimizer state

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Removed unused import

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update PyTorch container in Jenkins pipeline

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use custom container with Apex bugfixes

See NVIDIA/apex#1760.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Upgrade to PyTorch 23.11 container

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update Apex commit

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this pull request Feb 15, 2024
)

* Add distopt support for FP8 params and BF16 optimizer state

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Removed unused import

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update PyTorch container in Jenkins pipeline

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use custom container with Apex bugfixes

See NVIDIA/apex#1760.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Upgrade to PyTorch 23.11 container

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update Apex commit

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
)

* Add distopt support for FP8 params and BF16 optimizer state

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Removed unused import

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update PyTorch container in Jenkins pipeline

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use custom container with Apex bugfixes

See NVIDIA/apex#1760.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Upgrade to PyTorch 23.11 container

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update Apex commit

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI core Changes to NeMo Core NLP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants