-
-
Couldn't load subscription status.
- Fork 10.9k
Add context parallelism configurations and parallel group #26057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces context parallelism by adding configurations and communication groups. The changes are mostly correct and consistent with the existing parallelism structure. However, I've identified a critical bug in the creation of expert parallel groups when context parallelism is enabled. The logic for grouping ranks for expert parallelism is incorrect for the new 5D tensor layout, which will lead to incorrect behavior for MoE models. I've provided a fix for this issue.
vllm/distributed/parallel_state.py
Outdated
| group_ranks = (all_ranks.transpose(1, 2).reshape( | ||
| -1, data_parallel_size * tensor_model_parallel_size).unbind(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for creating the expert parallel group (_EP) is incorrect with the introduction of the context parallelism dimension. The all_ranks tensor is now 5D with shape (ExtDP, DP, PP, CP, TP). The expert parallel group should group ranks that have the same (ExtDP, PP, CP) coordinates, which means it should span across DP and TP dimensions. The current transpose(1, 2) operation is incorrect for this 5D tensor and does not produce the correct grouping. This will lead to incorrect behavior for MoE models when context parallelism is enabled. It should be replaced with a permutation that brings the DP and TP dimensions to the end before reshaping.
group_ranks = (all_ranks.permute(0, 2, 3, 1, 4).reshape(
-1, data_parallel_size * tensor_model_parallel_size).unbind(0))|
This pull request has merge conflicts that must be resolved before it can be |
|
These conflicts are caused by our migration to |
Purpose
Context parallelism improves performance as the context length grows by distributing both computation and the KV cache across multiple GPUs. This approach effectively lowers processing latency and can also decrease the memory required per GPU potentially, especially when dealing with extremely large KV caches (such as sequence lengths on the order of 1 million tokens), as shown in the figure below. This PR add initial context parallel parallel configuration and communication group.
Test Plan
Unit tests and e2e tests will be submitted in following PRs
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.