-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distopt support for FP8 params and BF16 optimizer state #7909
Conversation
@chiendb97 Thanks for the report! I reproduce the error and have a fix at NVIDIA/TransformerEngine#529. I haven't tested thoroughly, but I was able to save and load a checkpoint for LLaMa with FP8 params. |
@timmoon10 Is there any blocker for the review? With this PR, the memory allocation per param is: 1(fp8)+1(fp8 transpose)+2(BF16 gradient)+[2(bf16 weight)+2(bf16 momentum)+2(bf16 variance)]/dp=4+6/dp. Is my understanding correct? |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
8c81a3b
to
b031db6
Compare
@lhb8125 The only remaining blocker is fixing some convergence issues when running with SFT. I don't fully understand that issue yet, but I don't expect it would require major changes. Starting a review would be great. |
I've found a bug when using this PR with LLaMa SFT. Bugfix: NVIDIA/TransformerEngine#567 This does not affect GPT pretraining though. I think this is ready to review and merge. |
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
The Jenkins failure is because it is using an old version of Apex. The Dockerfile and README have been updated with the required Apex. |
What Base PyTorch version is needed? We can update it in the Jenkinsfile |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@athitten |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
jenkins |
@timmoon10 Can you check why this fails the CI? |
The error message is "No space left on device", so I suspect it's related to the recent file system issues on DLCluster. I find I often need to run a couple times to get past these errors, as well as segfaults coming from ASR. |
jenkins |
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
jenkins |
jenkins |
jenkins |
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
jenkins |
…VIDIA#7909)" This reverts commit 6082d76.
) * Add distopt support for FP8 params and BF16 optimizer state Signed-off-by: Tim Moon <tmoon@nvidia.com> * Removed unused import Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update PyTorch container in Jenkins pipeline Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use custom container with Apex bugfixes See NVIDIA/apex#1760. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Upgrade to PyTorch 23.11 container Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>
) * Add distopt support for FP8 params and BF16 optimizer state Signed-off-by: Tim Moon <tmoon@nvidia.com> * Removed unused import Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update PyTorch container in Jenkins pipeline Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use custom container with Apex bugfixes See NVIDIA/apex#1760. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Upgrade to PyTorch 23.11 container Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>
) * Add distopt support for FP8 params and BF16 optimizer state Signed-off-by: Tim Moon <tmoon@nvidia.com> * Removed unused import Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update PyTorch container in Jenkins pipeline Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use custom container with Apex bugfixes See NVIDIA/apex#1760. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Upgrade to PyTorch 23.11 container Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>
What does this PR do ?
Adds support in the Apex distributed Adam optimizer to support FP8 parameters (using experimental FP8 tensors from Transformer Engine) and BF16 optimizer state.
Collection: NLP
Changelog
Usage
Run GPT, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml.
Enable FP8 support with
model.fp8=True
, FP8 parameters withmodel.fp8_params=True
, the distributed optimizer withmodel.optim.name=distributed_fused_adam
, and BF16 optimizer state withmodel.optim.dtype=bf16
.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
[PyTorch] Float8Tensor uses cached transpose if available TransformerEngine#524[PyTorch] Support pickling Float8Tensor TransformerEngine#529Float8Tensor
added in [PyTorch] Experimental FP8 tensor class TransformerEngine#452