[Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 #25049

minosfuture · 2025-09-17T07:27:33Z

Purpose

Combined with vllm-project/flash-attention#93, this is to enable MTP (multi-token prediction) with DCP (decode context parallelism). It also allows prefill/decode to be mixed in a batch.

See vllm-project/flash-attention#93 for the implementation and solution details. Here we just need to pass the cp world size and cp rank.

Test Plan

Benchmark output token throughput
Verify LM eval accuracy

Test Result

Benchmark

config	output token per second	speedup (relative to tp8)	speedup (relative to tp8+mtp)
tp8	2332.43	1x	0.77x
tp8,mtp	3038.71	1.30x	1x
tp8,dcp4	2983.08	1.28	0.98x
tp8,dcp8	3502.24	1.50	1.15x
tp8,mtp,dcp4	3722.81	1.60x	1.23x
tp8,mtp,dcp8	3600.58	1.54x	1.18x

Expand for details:

Metric Details

#### With MTP and TP8,DCP4

VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA vllm serve deepseek-ai/DeepSeek-R1-0528 --tensor-parallel-size 8 --speculative-config '{"num_speculative_tokens": 1, "method": "deepseek_mtp"}' -dcp 4

============ Serving Benchmark Result ============
Successful requests:                     4096
Maximum request concurrency:             4096
Benchmark duration (s):                  1126.65
Total input tokens:                      8373936
Total generated tokens:                  4194304
Request throughput (req/s):              3.64
Output token throughput (tok/s):         3722.81
Total Token throughput (tok/s):          11155.42
---------------Time to First Token----------------
Mean TTFT (ms):                          490816.57
Median TTFT (ms):                        434249.00
P99 TTFT (ms):                           1072761.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          168.82
Median TPOT (ms):                        161.39
P99 TPOT (ms):                           262.88
---------------Inter-token Latency----------------
Mean ITL (ms):                           334.02
Median ITL (ms):                         208.86
P99 ITL (ms):                            824.69
==================================================
Benchmark completed for tp8_spec_dcp4

With MTP and TP8,DCP8

============ Serving Benchmark Result ============
Successful requests:                     4096
Maximum request concurrency:             4096
Benchmark duration (s):                  1164.90
Total input tokens:                      8373311
Total generated tokens:                  4194304
Request throughput (req/s):              3.52
Output token throughput (tok/s):         3600.58
Total Token throughput (tok/s):          10788.61
---------------Time to First Token----------------
Mean TTFT (ms):                          445652.28
Median TTFT (ms):                        448234.92
P99 TTFT (ms):                           896324.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          321.94
Median TPOT (ms):                        319.65
P99 TPOT (ms):                           491.82
---------------Inter-token Latency----------------
Mean ITL (ms):                           636.99
Median ITL (ms):                         480.00
P99 ITL (ms):                            1087.89
==================================================
Benchmark completed for tp8_spec_dcp8

With MTP and TP8

============ Serving Benchmark Result ============
Successful requests:                     4096
Maximum request concurrency:             4096
Benchmark duration (s):                  1380.29
Total input tokens:                      8371302
Total generated tokens:                  4194304
Request throughput (req/s):              2.97
Output token throughput (tok/s):         3038.71
Total Token throughput (tok/s):          9103.59
---------------Time to First Token----------------
Mean TTFT (ms):                          661675.41
Median TTFT (ms):                        677812.47
P99 TTFT (ms):                           1347726.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.12
Median TPOT (ms):                        52.00
P99 TPOT (ms):                           88.92
---------------Inter-token Latency----------------
Mean ITL (ms):                           113.07
Median ITL (ms):                         68.78
P99 ITL (ms):                            734.05
==================================================

With TP8, DCP8

============ Serving Benchmark Result ============
Successful requests:                     4096
Maximum request concurrency:             4096
Benchmark duration (s):                  1197.60
Total input tokens:                      8371375
Total generated tokens:                  4194304
Request throughput (req/s):              3.42
Output token throughput (tok/s):         3502.24
Total Token throughput (tok/s):          10492.34
---------------Time to First Token----------------
Mean TTFT (ms):                          456091.64
Median TTFT (ms):                        472464.37
P99 TTFT (ms):                           945354.75
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          356.88
Median TPOT (ms):                        368.52
P99 TPOT (ms):                           565.14
---------------Inter-token Latency----------------
Mean ITL (ms):                           356.88
Median ITL (ms):                         253.01
P99 ITL (ms):                            853.34
==================================================
Benchmark completed for tp8_dcp8

With TP8, DCP4

============ Serving Benchmark Result ============
Successful requests:                     4096
Maximum request concurrency:             4096
Benchmark duration (s):                  1406.03
Total input tokens:                      8369181
Total generated tokens:                  4194304
Request throughput (req/s):              2.91
Output token throughput (tok/s):         2983.08
Total Token throughput (tok/s):          8935.43
---------------Time to First Token----------------
Mean TTFT (ms):                          597440.50
Median TTFT (ms):                        545819.80
P99 TTFT (ms):                           1262335.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          236.80
Median TPOT (ms):                        224.94
P99 TPOT (ms):                           390.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           236.80
Median ITL (ms):                         162.87
P99 ITL (ms):                            757.87
==================================================
Benchmark completed for tp8_dcp4

With TP8

VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA vllm serve deepseek-ai/DeepSeek-R1-0528 --tensor-parallel-size 8 --speculative-config '{"num_speculative_tokens": 1, "method": "deepseek_mtp"}' 

============ Serving Benchmark Result ============
Successful requests:                     4096
Maximum request concurrency:             4096
Benchmark duration (s):                  1798.26
Total input tokens:                      8373030
Total generated tokens:                  4194304
Request throughput (req/s):              2.28
Output token throughput (tok/s):         2332.43
Total Token throughput (tok/s):          6988.62
---------------Time to First Token----------------
Mean TTFT (ms):                          855856.29
Median TTFT (ms):                        876320.14
P99 TTFT (ms):                           1748166.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          82.57
Median TPOT (ms):                        74.08
P99 TPOT (ms):                           134.68
---------------Inter-token Latency----------------
Mean ITL (ms):                           82.57
Median ITL (ms):                         56.14
P99 ITL (ms):                            704.71
==================================================
Benchmark completed for tp8

LM Eval

with tp8dcp8+mtp

local-completions (model=deepseek-ai/DeepSeek-R1-0528,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=32), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.96	±	0.0197
		strict-match	5	exact_match	↑	0.96	±	0.0197

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request aims to enable multi-token prediction (MTP) with decode context parallelism (DCP) for FlashAttention-3. The changes involve removing a restriction on query length for DCP and passing cp_world_size and cp_rank to the attention kernel. While the changes in vllm/v1/worker/gpu_model_runner.py are correct, there is a critical issue in vllm/v1/attention/backends/mla/flashattn_mla.py. The newly used attributes self.dcp_world_size and self.dcp_rank are not properly initialized due to an issue in the MLACommonImpl base class, which will cause a TypeError at runtime. This must be addressed for the feature to function correctly.

vllm/v1/attention/backends/mla/flashattn_mla.py

MatthewBonanni · 2025-09-17T15:08:39Z

Thanks for this contribution! Just wanted to leave a reminder to update the FlashAttention GIT_TAG in cmake/external_projects/vllm_flash_attn.cmake after vllm-project/flash-attention#93 lands

youzhedian · 2025-09-25T03:38:13Z

vllm/v1/worker/gpu_model_runner.py

-            #  assert once the custom mask is support is added to FA3.
-            if self.dcp_world_size > 1:
-                assert self.reorder_batch_threshold == 1, \
-                    "DCP not support reorder_batch_threshold > 1 now."


since only flash_attn_mla support custom mask, we can't just remove this assert right now?

make sense. I'll make a whitelist here for FA3 MLA

mergify · 2025-09-25T08:18:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @minosfuture.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Ming Yang <minos.future@gmail.com>

…eq_len Signed-off-by: Ming Yang <minos.future@gmail.com>

Signed-off-by: Ming Yang <minos.future@gmail.com>

vllm/v1/attention/backends/mla/flashattn_mla.py

Signed-off-by: Ming Yang <minos.future@gmail.com>

LucasWilkinson · 2025-10-06T18:49:43Z

vllm/v1/attention/backends/utils.py

    # Needed by CrossAttentionBuilder
    encoder_seq_lens: Optional[np.ndarray] = None

+    cp_seq_lens: Optional[torch.Tensor] = None


nit: can we maybe rename this to dcp_num_local_tokens or something like that? im worried this will clash with #25749 / #26133

sounds good. lemme keep the dcp prefix.

LucasWilkinson

overall looks good to me; left one nit

Signed-off-by: Ming Yang <minos.future@gmail.com>

…lm-project#25049) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: yang926 <yang926@naver.com>

…lm-project#25049) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…lm-project#25049) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…lm-project#25049) Signed-off-by: Ming Yang <minos.future@gmail.com>

…lm-project#25049) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…lm-project#25049) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

minosfuture requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 17, 2025 07:27

mergify bot added the v1 label Sep 17, 2025

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

vllm/v1/attention/backends/mla/flashattn_mla.py Show resolved Hide resolved

youzhedian reviewed Sep 25, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 25, 2025

minosfuture added 3 commits September 28, 2025 19:13

[Attention][DCP] Support DCP with query len > 1 with FA3

8cdbd20

Signed-off-by: Ming Yang <minos.future@gmail.com>

fix num_heads

45bb7e8

Signed-off-by: Ming Yang <minos.future@gmail.com>

keep reorder_batch_threshold check for other backends; also fix max_s…

9c0176b

…eq_len Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture force-pushed the dcp_mtp branch from 965cdab to 9c0176b Compare September 29, 2025 03:10

minosfuture requested a review from LucasWilkinson as a code owner September 29, 2025 03:10

mergify bot removed the needs-rebase label Sep 29, 2025

minosfuture added 2 commits September 29, 2025 17:24

pass cp_tot_seqused_k

4cc05c6

Signed-off-by: Ming Yang <minos.future@gmail.com>

fix pre-commit check

a6efa96

Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture requested a review from gshtras as a code owner September 30, 2025 01:46

mergify bot added the rocm Related to AMD ROCm label Sep 30, 2025

LucasWilkinson reviewed Sep 30, 2025

View reviewed changes

vllm/v1/attention/backends/mla/flashattn_mla.py Show resolved Hide resolved

make cp_seq_lens cuda graph compatible

ed6dcdd

Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture requested review from benchislett and luccafong as code owners September 30, 2025 23:09

mergify bot added the speculative-decoding label Sep 30, 2025

MatthewBonanni mentioned this pull request Oct 2, 2025

[Attention] Support MTP with DCP #24997

Closed

5 tasks

LucasWilkinson mentioned this pull request Oct 2, 2025

[RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention #26133

Open

1 task

update flash-attn commit hash to pick up cp-related changes

efaf7fa

Signed-off-by: Ming Yang <minos.future@gmail.com>

Merge remote-tracking branch 'origin/main' into dcp_mtp_merge

036357e

Signed-off-by: Ming Yang <minos.future@gmail.com>

mergify bot added the ci/build label Oct 6, 2025

LucasWilkinson reviewed Oct 6, 2025

View reviewed changes

LucasWilkinson approved these changes Oct 6, 2025

View reviewed changes

minosfuture added 2 commits October 6, 2025 11:59

change cp_ prefix to dcp_

19a7f8c

Signed-off-by: Ming Yang <minos.future@gmail.com>

Merge branch 'main' into dcp_mtp

3a5dcb2

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 8, 2025

Merge branch 'main' into dcp_mtp

56478b6

vllm-bot merged commit 3b736e1 into vllm-project:main Oct 9, 2025
80 of 84 checks passed

yang926 pushed a commit to yang926/vllm_1008 that referenced this pull request Oct 9, 2025

[Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 (vl…

582fcea

…lm-project#25049) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: yang926 <yang926@naver.com>

LucasWilkinson mentioned this pull request Oct 9, 2025

[FA/Chore] Bump FA version #26109

Closed

LucasWilkinson mentioned this pull request Oct 14, 2025

[DCP] Support dcp kv_cache interleave size > 1 #26696

Open

5 tasks

FENP mentioned this pull request Oct 16, 2025

[Bugfix][Attention][DCP] Set reorder_batch_threshold back to 1 when using FlashMLA and enable DCP #27023

Open

5 tasks

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 (vl…

b62f169

…lm-project#25049) Signed-off-by: Ming Yang <minos.future@gmail.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 (vl…

ecf5f95

…lm-project#25049) Signed-off-by: Ming Yang <minos.future@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 #25049

[Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 #25049

Uh oh!

minosfuture commented Sep 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

MatthewBonanni commented Sep 17, 2025

Uh oh!

youzhedian Sep 25, 2025

Uh oh!

minosfuture Sep 25, 2025

Uh oh!

mergify bot commented Sep 25, 2025

Uh oh!

Uh oh!

LucasWilkinson Oct 6, 2025 •

edited

Loading

Uh oh!

minosfuture Oct 6, 2025

Uh oh!

minosfuture Oct 6, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 #25049

[Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 #25049

Uh oh!

Conversation

minosfuture commented Sep 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Benchmark

With MTP and TP8,DCP8

With MTP and TP8

With TP8, DCP8

With TP8, DCP4

With TP8

LM Eval

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

MatthewBonanni commented Sep 17, 2025

Uh oh!

youzhedian Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

minosfuture Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 25, 2025

Uh oh!

Uh oh!

LucasWilkinson Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minosfuture Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

minosfuture Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

minosfuture commented Sep 17, 2025 •

edited by github-actions bot

Loading

LucasWilkinson Oct 6, 2025 •

edited

Loading