[DCP] Support Decode Context Parallel (DCP) for GQA with FlashAttention #24864

FENP · 2025-09-15T08:42:07Z

Purpose

This PR adds Decode Context Parallel (DCP) support for GQA following PR #23734. Current implementation based on FlashAttention.

Unlike MLA inference, GQA does not distinguish between prefill and decode during forward pass. To support DCP, this PR separately computes the attention scores for the context and query KV within a sequence and then merges the results.

  # |- tokenA -|......................|-- newTokens ---|
  # |---------- context_len ----------|-- query_len ---|

For the query, no collective communication is required among the DCP group.
For context, the KV is distributed across different DCP ranks. This PR follows the DCP decode approach from MLA, i.e., all-gathering Q and lse, then correcting the attn out before performing reduce-scatter.

Test Plan

Qwen/Qwen3-235B-A22B-Instruct-2507

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 --gpu-memory-utilization 0.9 --tensor-parallel-size 8 --decode-context-parallel-size 2

Test Result

KV Cache Size

[kv_cache_utils.py:859] Multiplying the GPU KV cache size by the dcp_world_size 2.
[kv_cache_utils.py:864] GPU KV cache size: 1,723,584 tokens
[kv_cache_utils.py:868] Maximum concurrency for 262,144 tokens per request: 6.57x

gsm8k eval

TP8
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8836|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.8628|±  |0.0067|

TP8DCP2
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8988|±  |0.0059|
|     |       |strict-match    |     5|exact_match|↑  |0.8870|±  |0.0062|

Additionally, I think we can reduce latency by overlapping query computation with context communication.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-09-15T08:42:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces Decode Context Parallel (DCP) support for GQA models using FlashAttention. The approach separates attention computation for the context and new tokens, computing partial attention on each rank for the distributed context KV cache and then combining them. While the overall logic is sound, I've identified a critical issue in the implementation concerning tensor shape manipulation that will lead to incorrect results. Specifically, there are incorrect transpose operations on the log-sum-exp (LSE) tensor that must be addressed. The other changes for integrating DCP, such as disabling cascade attention and adding necessary metadata, appear to be correct.

vllm/v1/attention/backends/flash_attn.py

youzhedian

plz add GQA+DCP ut in test_context_parallel.py.

Others, LGTM. Clean and good job!

vllm/v1/attention/backends/flash_attn.py

vllm/config/__init__.py

youkaichao

please fix pre-commit linter error, and add tests. otherwise LGTM, thanks for the great work!

cc @LucasWilkinson to give the final sign off.

tests/models/registry.py

tests/distributed/test_context_parallel.py

mergify · 2025-09-16T13:30:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @FENP.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson

Apologies for the delay! LGTM! Thanks for the contribution!

Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>

Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>

Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>

Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>

…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>

…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com> Signed-off-by: bbartels <benjamin@bartels.dev>

…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>

…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

FENP requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 15, 2025 08:42

mergify bot added the v1 label Sep 15, 2025

gemini-code-assist bot reviewed Sep 15, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

youzhedian suggested changes Sep 16, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Show resolved Hide resolved