-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
context parallelism #7739
context parallelism #7739
Conversation
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Hey @xrennvidia during the checkpoint stage, we ran into
You should be able to reproduce the issue on CP=2 and small GPT model. Let me know if you need more details on it. |
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
""" | ||
cp_stream = torch.cuda.Stream() | ||
|
||
for module in self.get_gpt_module_list(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't find method get_gpt_module_list. Is it get_model_module_list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please pull the latest commit. this stale code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Did you mean latest NeMo context parallel commit or Megatron-LM/TransformerEngine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest NeMo context parallel commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
jenkins |
1 similar comment
jenkins |
Hey @xrennvidia I am using Megatron-LM (mcore r0.4.0). If I pull the latest change in the PR, would I also need to cherry-pick NVIDIA/Megatron-LM@5eaa937 to r0.4.0? Again, thanks for developing the feature, it really benefits our use case a lot! |
Hi @stu1130 , very happy to know this is helpful :) . If you want to run with PP > 1, you need to cherry-pick it. Also FYI, I have a fix for your issue at here. I think the fix should be merged to MLM main branch soon, maybe tomorrow. |
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
* make nemo recognize sequence_parallel_size Signed-off-by: xren <xren@nvidia.com> * add helper functions to set up SP running in TE Signed-off-by: xren <xren@nvidia.com> * slice seq length for a specific rank Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix data_parallel_size calculation Signed-off-by: Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add missing argument of self Signed-off-by: Xiaowei Ren <xren@nvidia.com> * pass sp_global_ranks to TE transformer layer Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix nsys setting Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix seq_len calculation Signed-off-by: xren <xren@nvidia.com> * fix attn_mask split across seq-length dim Signed-off-by: xren <xren@nvidia.com> * code update of input split Signed-off-by: xren <xren@nvidia.com> * fix loss calculation Signed-off-by: xren <xren@nvidia.com> * fix loss_mask_sum calculation Signed-off-by: xren <xren@nvidia.com> * fix losss calculation Signed-off-by: xren <xren@nvidia.com> * rename sequence_parallelism to context_parallelism Signed-off-by: xren <xren@nvidia.com> * minor change Signed-off-by: xren <xren@nvidia.com> * fix loss_mask_sum calculation Signed-off-by: xren <xren@nvidia.com> * make sure do not call megatron-core parallel_state while cp_size is 1 Signed-off-by: xren <xren@nvidia.com> * slice position embedding for different CP rank Signed-off-by: xren <xren@nvidia.com> * fix mising property decorator Signed-off-by: xren <xren@nvidia.com> * typo fix Signed-off-by: xren <xren@nvidia.com> * fix rpe_bias CP slicing Signed-off-by: xren <xren@nvidia.com> * code style fix Signed-off-by: xren <xren@nvidia.com> * fix loss_mask_sum calculation Signed-off-by: xren <xren@nvidia.com> * do not load attention mask if it's not needed Signed-off-by: Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by: xren <xren@nvidia.com> * fix ubuf size with CP > 1 Signed-off-by: Xiaowei Ren <xren@nvidia.com> * address naming confusion of mixed dp and cp Signed-off-by: xren <xren@nvidia.com> * rewrite cp code by assuming with_context_parallel=False Signed-off-by: xren <xren@nvidia.com> * pop context_parallel from dist opt kwargs Signed-off-by: xren <xren@nvidia.com> * make sure amax reduction group is aware of context parallelism Signed-off-by: xren <xren@nvidia.com> * remove use_fp8 from initialize_model_parallel Signed-off-by: xren <xren@nvidia.com> * make implementaitons of setup_transformer_engine_tp_groups and setup_transformer_engine_cp_running consistent Signed-off-by: xren <xren@nvidia.com> * cp function renaming Signed-off-by: xren <xren@nvidia.com> * make loss logging broadcast aware of cp Signed-off-by: xren <xren@nvidia.com> * fix a typo Signed-off-by: Xiaowei Ren <xren@nvidia.com> * var name fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * import transformer layer specs from MCore Signed-off-by: Xiaowei Ren <xren@nvidia.com> * upgrade MCore version Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add add context_parallel into the kwargs of dist opt Signed-off-by: Xiaowei Ren <xren@nvidia.com> * remove redundant cp check Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code style fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * recover docker file Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix seq_length of CP Signed-off-by: Xiaowei Ren <xren@nvidia.com> * recover seq-length which has been fixed in mcore Signed-off-by: Xiaowei Ren <xren@nvidia.com> * function name fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by: xren <xren@nvidia.com> Signed-off-by: Xiaowei Ren <xren@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This reverts commit 58d6bce.
This reverts commit 58d6bce.
This reverts commit a16f6c6.
This reverts commit 58d6bce.
This reverts commit 58d6bce.
This reverts commit 58d6bce.
This reverts commit a16f6c6.
This reverts commit 58d6bce. Signed-off-by: Jan Baczek <jbaczek@nvidia.com>
* make nemo recognize sequence_parallel_size Signed-off-by: xren <xren@nvidia.com> * add helper functions to set up SP running in TE Signed-off-by: xren <xren@nvidia.com> * slice seq length for a specific rank Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix data_parallel_size calculation Signed-off-by: Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add missing argument of self Signed-off-by: Xiaowei Ren <xren@nvidia.com> * pass sp_global_ranks to TE transformer layer Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix nsys setting Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix seq_len calculation Signed-off-by: xren <xren@nvidia.com> * fix attn_mask split across seq-length dim Signed-off-by: xren <xren@nvidia.com> * code update of input split Signed-off-by: xren <xren@nvidia.com> * fix loss calculation Signed-off-by: xren <xren@nvidia.com> * fix loss_mask_sum calculation Signed-off-by: xren <xren@nvidia.com> * fix losss calculation Signed-off-by: xren <xren@nvidia.com> * rename sequence_parallelism to context_parallelism Signed-off-by: xren <xren@nvidia.com> * minor change Signed-off-by: xren <xren@nvidia.com> * fix loss_mask_sum calculation Signed-off-by: xren <xren@nvidia.com> * make sure do not call megatron-core parallel_state while cp_size is 1 Signed-off-by: xren <xren@nvidia.com> * slice position embedding for different CP rank Signed-off-by: xren <xren@nvidia.com> * fix mising property decorator Signed-off-by: xren <xren@nvidia.com> * typo fix Signed-off-by: xren <xren@nvidia.com> * fix rpe_bias CP slicing Signed-off-by: xren <xren@nvidia.com> * code style fix Signed-off-by: xren <xren@nvidia.com> * fix loss_mask_sum calculation Signed-off-by: xren <xren@nvidia.com> * do not load attention mask if it's not needed Signed-off-by: Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by: xren <xren@nvidia.com> * fix ubuf size with CP > 1 Signed-off-by: Xiaowei Ren <xren@nvidia.com> * address naming confusion of mixed dp and cp Signed-off-by: xren <xren@nvidia.com> * rewrite cp code by assuming with_context_parallel=False Signed-off-by: xren <xren@nvidia.com> * pop context_parallel from dist opt kwargs Signed-off-by: xren <xren@nvidia.com> * make sure amax reduction group is aware of context parallelism Signed-off-by: xren <xren@nvidia.com> * remove use_fp8 from initialize_model_parallel Signed-off-by: xren <xren@nvidia.com> * make implementaitons of setup_transformer_engine_tp_groups and setup_transformer_engine_cp_running consistent Signed-off-by: xren <xren@nvidia.com> * cp function renaming Signed-off-by: xren <xren@nvidia.com> * make loss logging broadcast aware of cp Signed-off-by: xren <xren@nvidia.com> * fix a typo Signed-off-by: Xiaowei Ren <xren@nvidia.com> * var name fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * import transformer layer specs from MCore Signed-off-by: Xiaowei Ren <xren@nvidia.com> * upgrade MCore version Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add add context_parallel into the kwargs of dist opt Signed-off-by: Xiaowei Ren <xren@nvidia.com> * remove redundant cp check Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code style fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * recover docker file Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix seq_length of CP Signed-off-by: Xiaowei Ren <xren@nvidia.com> * recover seq-length which has been fixed in mcore Signed-off-by: Xiaowei Ren <xren@nvidia.com> * function name fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by: xren <xren@nvidia.com> Signed-off-by: Xiaowei Ren <xren@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>
* make nemo recognize sequence_parallel_size Signed-off-by: xren <xren@nvidia.com> * add helper functions to set up SP running in TE Signed-off-by: xren <xren@nvidia.com> * slice seq length for a specific rank Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix data_parallel_size calculation Signed-off-by: Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add missing argument of self Signed-off-by: Xiaowei Ren <xren@nvidia.com> * pass sp_global_ranks to TE transformer layer Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix nsys setting Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix seq_len calculation Signed-off-by: xren <xren@nvidia.com> * fix attn_mask split across seq-length dim Signed-off-by: xren <xren@nvidia.com> * code update of input split Signed-off-by: xren <xren@nvidia.com> * fix loss calculation Signed-off-by: xren <xren@nvidia.com> * fix loss_mask_sum calculation Signed-off-by: xren <xren@nvidia.com> * fix losss calculation Signed-off-by: xren <xren@nvidia.com> * rename sequence_parallelism to context_parallelism Signed-off-by: xren <xren@nvidia.com> * minor change Signed-off-by: xren <xren@nvidia.com> * fix loss_mask_sum calculation Signed-off-by: xren <xren@nvidia.com> * make sure do not call megatron-core parallel_state while cp_size is 1 Signed-off-by: xren <xren@nvidia.com> * slice position embedding for different CP rank Signed-off-by: xren <xren@nvidia.com> * fix mising property decorator Signed-off-by: xren <xren@nvidia.com> * typo fix Signed-off-by: xren <xren@nvidia.com> * fix rpe_bias CP slicing Signed-off-by: xren <xren@nvidia.com> * code style fix Signed-off-by: xren <xren@nvidia.com> * fix loss_mask_sum calculation Signed-off-by: xren <xren@nvidia.com> * do not load attention mask if it's not needed Signed-off-by: Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by: xren <xren@nvidia.com> * fix ubuf size with CP > 1 Signed-off-by: Xiaowei Ren <xren@nvidia.com> * address naming confusion of mixed dp and cp Signed-off-by: xren <xren@nvidia.com> * rewrite cp code by assuming with_context_parallel=False Signed-off-by: xren <xren@nvidia.com> * pop context_parallel from dist opt kwargs Signed-off-by: xren <xren@nvidia.com> * make sure amax reduction group is aware of context parallelism Signed-off-by: xren <xren@nvidia.com> * remove use_fp8 from initialize_model_parallel Signed-off-by: xren <xren@nvidia.com> * make implementaitons of setup_transformer_engine_tp_groups and setup_transformer_engine_cp_running consistent Signed-off-by: xren <xren@nvidia.com> * cp function renaming Signed-off-by: xren <xren@nvidia.com> * make loss logging broadcast aware of cp Signed-off-by: xren <xren@nvidia.com> * fix a typo Signed-off-by: Xiaowei Ren <xren@nvidia.com> * var name fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * import transformer layer specs from MCore Signed-off-by: Xiaowei Ren <xren@nvidia.com> * upgrade MCore version Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add add context_parallel into the kwargs of dist opt Signed-off-by: Xiaowei Ren <xren@nvidia.com> * remove redundant cp check Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code style fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * recover docker file Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix seq_length of CP Signed-off-by: Xiaowei Ren <xren@nvidia.com> * recover seq-length which has been fixed in mcore Signed-off-by: Xiaowei Ren <xren@nvidia.com> * function name fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by: xren <xren@nvidia.com> Signed-off-by: Xiaowei Ren <xren@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do ?
GPT training with long-context input (e.g., sequence length of 16, 32K, 64K) can easily overflow GPU memory with huge activations. Context parallelism splits long-context input along the dimension of sequence length, parallelizes partitioned sequence segments among multiple GPUs. In this way, each GPU only needs to store activations of a part of sequence length, so that we can avoid memory overflow.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information