-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend sequence length padding for GPT SFT to account for context parallel #8869
Conversation
jenkins |
Hi @vysarge , I do not totally understand what the first input dim and the second dim mean, could you please describe it? And how does it work if |
Hm, I thought sequence parallel and context parallel were mutually exclusive. If they're used together, do CP and TP counts split on the same dimension? Making the change you suggested should fix this case too if so, let me test it. |
Sequence parallel and context parallel can work together. Sequence parallel only split sequence of LayerNorm. In addition, context parallel split sequence of the whole transformer layer. Yeah, you can test my suggested fix, I think it should work for all cases. Thanks. |
…quirements when using fp8 Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
eab5ecb
to
76c7f71
Compare
Tested and this fix is working well; PR has been updated, thanks. |
LGTM, thanks. |
jenkins |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
…allel (NVIDIA#8869) * Pad outputs from G GPTSFTDataset / GPTSFTPackedDataset to match TE requirements when using fp8 Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Account for SP + CP case Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: Pablo Garay <palenq@gmail.com>
What does this PR do ?
Adds a check for MegatronGPTSFTModel to ensure sequence dimension will be padded to a multiple of 16 after being split over context parallel ranks, which TE requires during fp8 training; this fixes issues like
AssertionError: FP8 execution requires 2D input matrices with height divisible by 8 and width divisible by 16, but got tensor with dims=[5504, 184]
when using context parallel with datasets that have a variable sequence length.Collection: nlp
Changelog
pad_seq_length_to_mult
input to GPT SFT datasets to account for context parallelUsage
# Add a code snippet demonstrating how to use this
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information