Add context parallelism configurations and parallel group #26057

qiruiyangmeta · 2025-10-01T23:01:53Z

Purpose

Context parallelism improves performance as the context length grows by distributing both computation and the KV cache across multiple GPUs. This approach effectively lowers processing latency and can also decrease the memory required per GPU potentially, especially when dealing with extremely large KV caches (such as sequence lengths on the order of 1 million tokens), as shown in the figure below. This PR add initial context parallel parallel configuration and communication group.

Test Plan

Unit tests and e2e tests will be submitted in following PRs

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces context parallelism by adding configurations and communication groups. The changes are mostly correct and consistent with the existing parallelism structure. However, I've identified a critical bug in the creation of expert parallel groups when context parallelism is enabled. The logic for grouping ranks for expert parallelism is incorrect for the new 5D tensor layout, which will lead to incorrect behavior for MoE models. I've provided a fix for this issue.

gemini-code-assist · 2025-10-01T23:03:25Z

vllm/distributed/parallel_state.py

+    group_ranks = (all_ranks.transpose(1, 2).reshape(
+        -1, data_parallel_size * tensor_model_parallel_size).unbind(0))


The logic for creating the expert parallel group (_EP) is incorrect with the introduction of the context parallelism dimension. The all_ranks tensor is now 5D with shape (ExtDP, DP, PP, CP, TP). The expert parallel group should group ranks that have the same (ExtDP, PP, CP) coordinates, which means it should span across DP and TP dimensions. The current transpose(1, 2) operation is incorrect for this 5D tensor and does not produce the correct grouping. This will lead to incorrect behavior for MoE models when context parallelism is enabled. It should be replaced with a permutation that brings the DP and TP dimensions to the end before reshaping.

group_ranks = (all_ranks.permute(0, 2, 3, 1, 4).reshape( -1, data_parallel_size * tensor_model_parallel_size).unbind(0))

mergify · 2025-10-07T22:16:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @qiruiyangmeta.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

hmellor · 2025-10-08T08:35:46Z

These conflicts are caused by our migration to ruff. Please see https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1759663228844749 which contains detailed instructions to make updating your branch as painless as possible.

qiruiyangmeta requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners October 1, 2025 23:01

mergify bot added the v1 label Oct 1, 2025

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

qiruiyangmeta mentioned this pull request Oct 2, 2025

[RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention #26133

Open

1 task

Qirui Yang added 3 commits October 2, 2025 19:42

Add context parallelism configurations and parallel group

633bde0

Lint

09e199a

Fix EP group ranks

0f803e8

qiruiyangmeta force-pushed the cp branch from f3ffc89 to 0f803e8 Compare October 3, 2025 02:54

mergify bot added the needs-rebase label Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Add context parallelism configurations and parallel group #26057

Add context parallelism configurations and parallel group #26057

qiruiyangmeta commented Oct 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 1, 2025

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

hmellor commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		group_ranks = (all_ranks.transpose(1, 2).reshape(
		-1, data_parallel_size * tensor_model_parallel_size).unbind(0))

Uh oh!

Uh oh!

Add context parallelism configurations and parallel group #26057

Are you sure you want to change the base?

Add context parallelism configurations and parallel group #26057

Conversation

qiruiyangmeta commented Oct 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

hmellor commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qiruiyangmeta commented Oct 1, 2025 •

edited by github-actions bot

Loading