Skip to content

Conversation

@FENP
Copy link
Contributor

@FENP FENP commented Sep 15, 2025

Purpose

This PR adds Decode Context Parallel (DCP) support for GQA following PR #23734. Current implementation based on FlashAttention.

Unlike MLA inference, GQA does not distinguish between prefill and decode during forward pass. To support DCP, this PR separately computes the attention scores for the context and query KV within a sequence and then merges the results.

  # |- tokenA -|......................|-- newTokens ---|
  # |---------- context_len ----------|-- query_len ---|
  • For the query, no collective communication is required among the DCP group.
  • For context, the KV is distributed across different DCP ranks. This PR follows the DCP decode approach from MLA, i.e., all-gathering Q and lse, then correcting the attn out before performing reduce-scatter.

Test Plan

Qwen/Qwen3-235B-A22B-Instruct-2507

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 --gpu-memory-utilization 0.9 --tensor-parallel-size 8 --decode-context-parallel-size 2

Test Result

  • KV Cache Size
[kv_cache_utils.py:859] Multiplying the GPU KV cache size by the dcp_world_size 2.
[kv_cache_utils.py:864] GPU KV cache size: 1,723,584 tokens
[kv_cache_utils.py:868] Maximum concurrency for 262,144 tokens per request: 6.57x
  • gsm8k eval
TP8
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8836|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.8628|±  |0.0067|

TP8DCP2
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8988|±  |0.0059|
|     |       |strict-match    |     5|exact_match|↑  |0.8870|±  |0.0062|

Additionally, I think we can reduce latency by overlapping query computation with context communication.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the v1 label Sep 15, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Decode Context Parallel (DCP) support for GQA models using FlashAttention. The approach separates attention computation for the context and new tokens, computing partial attention on each rank for the distributed context KV cache and then combining them. While the overall logic is sound, I've identified a critical issue in the implementation concerning tensor shape manipulation that will lead to incorrect results. Specifically, there are incorrect transpose operations on the log-sum-exp (LSE) tensor that must be addressed. The other changes for integrating DCP, such as disabling cascade attention and adding necessary metadata, appear to be correct.

Copy link
Contributor

@youzhedian youzhedian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz add GQA+DCP ut in test_context_parallel.py.

Others, LGTM. Clean and good job!

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix pre-commit linter error, and add tests. otherwise LGTM, thanks for the great work!

cc @LucasWilkinson to give the final sign off.

@FENP FENP requested a review from DarkLight1337 as a code owner September 16, 2025 12:12
@mergify
Copy link

mergify bot commented Sep 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @FENP.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay! LGTM! Thanks for the contribution!

FENP added 8 commits October 9, 2025 12:05
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
FENP and others added 2 commits October 13, 2025 10:23
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
@youkaichao youkaichao enabled auto-merge (squash) October 14, 2025 11:17
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2025
@youkaichao youkaichao merged commit ea97940 into vllm-project:main Oct 14, 2025
53 checks passed
simondanielsson pushed a commit to simondanielsson/vllm that referenced this pull request Oct 14, 2025
…on (vllm-project#24864)

Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…on (vllm-project#24864)

Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
@FENP FENP deleted the dcp-gqa-fa branch October 15, 2025 09:01
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…on (vllm-project#24864)

Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…on (vllm-project#24864)

Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…on (vllm-project#24864)

Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…on (vllm-project#24864)

Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…on (vllm-project#24864)

Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…on (vllm-project#24864)

Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…on (vllm-project#24864)

Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants