Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

context parallelism #7739

Merged
merged 88 commits into from
Jan 10, 2024
Merged

context parallelism #7739

merged 88 commits into from
Jan 10, 2024

Conversation

xrennvidia
Copy link
Collaborator

What does this PR do ?

GPT training with long-context input (e.g., sequence length of 16, 32K, 64K) can easily overflow GPU memory with huge activations. Context parallelism splits long-context input along the dimension of sequence length, parallelizes partitioned sequence segments among multiple GPUs. In this way, each GPU only needs to store activations of a part of sequence length, so that we can avoid memory overflow.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

xrennvidia added 30 commits June 5, 2023 18:40
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: xren <xren@nvidia.com>
@stu1130
Copy link

stu1130 commented Dec 9, 2023

Hey @xrennvidia during the checkpoint stage, we ran into

 self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
  File "/workspace/src/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 305, in save_checkpoint
    dist_checkpointing.save(sharded_state_dict=checkpoint, checkpoint_dir=checkpoint_dir)
  File "/workspace/src/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 221, in save
    validate_sharding_integrity(sharded_tensors)
  File "/workspace/src/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 278, in validate_sharding_integrity
    _validate_sharding_for_key(shardings)
  File "/workspace/src/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 316, in _validate_sharding_for_key
    raise CheckpointingException(f'Invalid access pattern for {rank_sharding[0][1]}')
megatron.core.dist_checkpointing.core.CheckpointingException: Invalid access pattern for ShardedTensor(key='optimizer.state.exp_avg.model.embedding.word_embeddings.weight')

You should be able to reproduce the issue on CP=2 and small GPT model. Let me know if you need more details on it.

Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
"""
cp_stream = torch.cuda.Stream()

for module in self.get_gpt_module_list():
Copy link

@stu1130 stu1130 Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't find method get_gpt_module_list. Is it get_model_module_list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please pull the latest commit. this stale code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Did you mean latest NeMo context parallel commit or Megatron-LM/TransformerEngine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest NeMo context parallel commit

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
@xrennvidia
Copy link
Collaborator Author

jenkins

1 similar comment
@xrennvidia
Copy link
Collaborator Author

jenkins

@stu1130
Copy link

stu1130 commented Jan 9, 2024

Hey @xrennvidia I am using Megatron-LM (mcore r0.4.0). If I pull the latest change in the PR, would I also need to cherry-pick NVIDIA/Megatron-LM@5eaa937 to r0.4.0? Again, thanks for developing the feature, it really benefits our use case a lot!

@xrennvidia
Copy link
Collaborator Author

xrennvidia commented Jan 9, 2024

Hey @xrennvidia I am using Megatron-LM (mcore r0.4.0). If I pull the latest change in the PR, would I also need to cherry-pick NVIDIA/Megatron-LM@5eaa937 to r0.4.0? Again, thanks for developing the feature, it really benefits our use case a lot!

Hi @stu1130 , very happy to know this is helpful :) . If you want to run with PP > 1, you need to cherry-pick it.

Also FYI, I have a fix for your issue at here. I think the fix should be merged to MLM main branch soon, maybe tomorrow.

@xrennvidia
Copy link
Collaborator Author

jenkins

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ericharper ericharper merged commit 58d6bce into main Jan 10, 2024
15 checks passed
@ericharper ericharper deleted the xren/context_parallelism branch January 10, 2024 06:32
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 19, 2024
* make nemo recognize sequence_parallel_size

Signed-off-by: xren <xren@nvidia.com>

* add helper functions to set up SP running in TE

Signed-off-by: xren <xren@nvidia.com>

* slice seq length for a specific rank

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix data_parallel_size calculation

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* minor change

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add missing argument of self

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* pass sp_global_ranks to TE transformer layer

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix nsys setting

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix seq_len calculation

Signed-off-by: xren <xren@nvidia.com>

* fix attn_mask split across seq-length dim

Signed-off-by: xren <xren@nvidia.com>

* code update of input split

Signed-off-by: xren <xren@nvidia.com>

* fix loss calculation

Signed-off-by: xren <xren@nvidia.com>

* fix loss_mask_sum calculation

Signed-off-by: xren <xren@nvidia.com>

* fix losss calculation

Signed-off-by: xren <xren@nvidia.com>

* rename sequence_parallelism to context_parallelism

Signed-off-by: xren <xren@nvidia.com>

* minor change

Signed-off-by: xren <xren@nvidia.com>

* fix loss_mask_sum calculation

Signed-off-by: xren <xren@nvidia.com>

* make sure do not call megatron-core parallel_state while cp_size is 1

Signed-off-by: xren <xren@nvidia.com>

* slice position embedding for different CP rank

Signed-off-by: xren <xren@nvidia.com>

* fix mising property decorator

Signed-off-by: xren <xren@nvidia.com>

* typo fix

Signed-off-by: xren <xren@nvidia.com>

* fix rpe_bias CP slicing

Signed-off-by: xren <xren@nvidia.com>

* code style fix

Signed-off-by: xren <xren@nvidia.com>

* fix loss_mask_sum calculation

Signed-off-by: xren <xren@nvidia.com>

* do not load attention mask if it's not needed

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* bug fix

Signed-off-by: xren <xren@nvidia.com>

* fix ubuf size with CP > 1

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* address naming confusion of mixed dp and cp

Signed-off-by: xren <xren@nvidia.com>

* rewrite cp code by assuming with_context_parallel=False

Signed-off-by: xren <xren@nvidia.com>

* pop context_parallel from dist opt kwargs

Signed-off-by: xren <xren@nvidia.com>

* make sure amax reduction group is aware of context parallelism

Signed-off-by: xren <xren@nvidia.com>

* remove use_fp8 from initialize_model_parallel

Signed-off-by: xren <xren@nvidia.com>

* make implementaitons of setup_transformer_engine_tp_groups and setup_transformer_engine_cp_running consistent

Signed-off-by: xren <xren@nvidia.com>

* cp function renaming

Signed-off-by: xren <xren@nvidia.com>

* make loss logging broadcast aware of cp

Signed-off-by: xren <xren@nvidia.com>

* fix a typo

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* var name fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* import transformer layer specs from MCore

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* upgrade MCore version

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add add context_parallel into the kwargs of dist opt

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* remove redundant cp check

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* code style fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* recover docker file

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix seq_length of CP

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* recover seq-length which has been fixed in mcore

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* function name fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------

Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 19, 2024
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 22, 2024
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 22, 2024
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 24, 2024
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 29, 2024
layalir added a commit to layalir/NeMo that referenced this pull request Jan 31, 2024
minitu pushed a commit to minitu/NeMo that referenced this pull request Feb 1, 2024
jbaczek pushed a commit to jbaczek/NeMo that referenced this pull request Feb 2, 2024
This reverts commit 58d6bce.

Signed-off-by: Jan Baczek <jbaczek@nvidia.com>
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this pull request Feb 15, 2024
* make nemo recognize sequence_parallel_size

Signed-off-by: xren <xren@nvidia.com>

* add helper functions to set up SP running in TE

Signed-off-by: xren <xren@nvidia.com>

* slice seq length for a specific rank

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix data_parallel_size calculation

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* minor change

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add missing argument of self

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* pass sp_global_ranks to TE transformer layer

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix nsys setting

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix seq_len calculation

Signed-off-by: xren <xren@nvidia.com>

* fix attn_mask split across seq-length dim

Signed-off-by: xren <xren@nvidia.com>

* code update of input split

Signed-off-by: xren <xren@nvidia.com>

* fix loss calculation

Signed-off-by: xren <xren@nvidia.com>

* fix loss_mask_sum calculation

Signed-off-by: xren <xren@nvidia.com>

* fix losss calculation

Signed-off-by: xren <xren@nvidia.com>

* rename sequence_parallelism to context_parallelism

Signed-off-by: xren <xren@nvidia.com>

* minor change

Signed-off-by: xren <xren@nvidia.com>

* fix loss_mask_sum calculation

Signed-off-by: xren <xren@nvidia.com>

* make sure do not call megatron-core parallel_state while cp_size is 1

Signed-off-by: xren <xren@nvidia.com>

* slice position embedding for different CP rank

Signed-off-by: xren <xren@nvidia.com>

* fix mising property decorator

Signed-off-by: xren <xren@nvidia.com>

* typo fix

Signed-off-by: xren <xren@nvidia.com>

* fix rpe_bias CP slicing

Signed-off-by: xren <xren@nvidia.com>

* code style fix

Signed-off-by: xren <xren@nvidia.com>

* fix loss_mask_sum calculation

Signed-off-by: xren <xren@nvidia.com>

* do not load attention mask if it's not needed

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* bug fix

Signed-off-by: xren <xren@nvidia.com>

* fix ubuf size with CP > 1

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* address naming confusion of mixed dp and cp

Signed-off-by: xren <xren@nvidia.com>

* rewrite cp code by assuming with_context_parallel=False

Signed-off-by: xren <xren@nvidia.com>

* pop context_parallel from dist opt kwargs

Signed-off-by: xren <xren@nvidia.com>

* make sure amax reduction group is aware of context parallelism

Signed-off-by: xren <xren@nvidia.com>

* remove use_fp8 from initialize_model_parallel

Signed-off-by: xren <xren@nvidia.com>

* make implementaitons of setup_transformer_engine_tp_groups and setup_transformer_engine_cp_running consistent

Signed-off-by: xren <xren@nvidia.com>

* cp function renaming

Signed-off-by: xren <xren@nvidia.com>

* make loss logging broadcast aware of cp

Signed-off-by: xren <xren@nvidia.com>

* fix a typo

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* var name fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* import transformer layer specs from MCore

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* upgrade MCore version

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add add context_parallel into the kwargs of dist opt

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* remove redundant cp check

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* code style fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* recover docker file

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix seq_length of CP

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* recover seq-length which has been fixed in mcore

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* function name fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------

Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* make nemo recognize sequence_parallel_size

Signed-off-by: xren <xren@nvidia.com>

* add helper functions to set up SP running in TE

Signed-off-by: xren <xren@nvidia.com>

* slice seq length for a specific rank

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix data_parallel_size calculation

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* minor change

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add missing argument of self

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* pass sp_global_ranks to TE transformer layer

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix nsys setting

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix seq_len calculation

Signed-off-by: xren <xren@nvidia.com>

* fix attn_mask split across seq-length dim

Signed-off-by: xren <xren@nvidia.com>

* code update of input split

Signed-off-by: xren <xren@nvidia.com>

* fix loss calculation

Signed-off-by: xren <xren@nvidia.com>

* fix loss_mask_sum calculation

Signed-off-by: xren <xren@nvidia.com>

* fix losss calculation

Signed-off-by: xren <xren@nvidia.com>

* rename sequence_parallelism to context_parallelism

Signed-off-by: xren <xren@nvidia.com>

* minor change

Signed-off-by: xren <xren@nvidia.com>

* fix loss_mask_sum calculation

Signed-off-by: xren <xren@nvidia.com>

* make sure do not call megatron-core parallel_state while cp_size is 1

Signed-off-by: xren <xren@nvidia.com>

* slice position embedding for different CP rank

Signed-off-by: xren <xren@nvidia.com>

* fix mising property decorator

Signed-off-by: xren <xren@nvidia.com>

* typo fix

Signed-off-by: xren <xren@nvidia.com>

* fix rpe_bias CP slicing

Signed-off-by: xren <xren@nvidia.com>

* code style fix

Signed-off-by: xren <xren@nvidia.com>

* fix loss_mask_sum calculation

Signed-off-by: xren <xren@nvidia.com>

* do not load attention mask if it's not needed

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* bug fix

Signed-off-by: xren <xren@nvidia.com>

* fix ubuf size with CP > 1

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* address naming confusion of mixed dp and cp

Signed-off-by: xren <xren@nvidia.com>

* rewrite cp code by assuming with_context_parallel=False

Signed-off-by: xren <xren@nvidia.com>

* pop context_parallel from dist opt kwargs

Signed-off-by: xren <xren@nvidia.com>

* make sure amax reduction group is aware of context parallelism

Signed-off-by: xren <xren@nvidia.com>

* remove use_fp8 from initialize_model_parallel

Signed-off-by: xren <xren@nvidia.com>

* make implementaitons of setup_transformer_engine_tp_groups and setup_transformer_engine_cp_running consistent

Signed-off-by: xren <xren@nvidia.com>

* cp function renaming

Signed-off-by: xren <xren@nvidia.com>

* make loss logging broadcast aware of cp

Signed-off-by: xren <xren@nvidia.com>

* fix a typo

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* var name fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* import transformer layer specs from MCore

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* upgrade MCore version

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add add context_parallel into the kwargs of dist opt

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* remove redundant cp check

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* code style fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* recover docker file

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix seq_length of CP

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* recover seq-length which has been fixed in mcore

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* function name fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------

Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core NLP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants