[Attention] MLA with chunked prefill #12639

LucasWilkinson · 2025-02-01T04:43:16Z

Need to do more benchmarking to see if this makes sense to be on by default in V0, but lays the groundwork for a V1 implementation. (#13111 may help performance)

lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=False --task gsm8k --num_fewshot=5 --limit 100

vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=False), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|


lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=True --task gsm8k --num_fewshot=5 --limit 100


vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|

Shout-out to @pathorn for assisting with hardening this PR

Future work:

Allocate the worst case result of self.kv_b_proj(kv_c_normed) in the profile run
[Attention] MLA with chunked prefill #12639 (comment)
Improved algo for allocating workspace amongst batch elements
Improve how the workspace is allocated

github-actions · 2025-02-01T04:43:26Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-02-06T05:22:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-02-07T03:56:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/engine/arg_utils.py

vllm/attention/backends/utils.py

csrc/cuda_utils.h

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

tlrmchlsmth · 2025-02-19T20:46:17Z

vllm/engine/arg_utils.py

            # For multimodal models and models with MLA, chunked prefill is
            # disabled by default in V0, but enabled by design in V1
-            if model_config.is_multimodal_model and model_config.use_mla:
+            if model_config.is_multimodal_model or model_config.use_mla:


ok yeah that makes sense for some of the red tests

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify · 2025-02-21T01:57:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ZhongYingMatrix · 2025-02-23T10:09:19Z

Hi @LucasWilkinson thx for ur wonderful work!
I am a little confused on the backend that got from get_attn_backend_cls.
Since we should set VLLM_USE_V1 to use chunked prefill, from here, we would get vllm.v1.attention.backends.flash_attn.FlashAttentionBackend instead of vllm.attention.backends.triton_mla.TritonMLABackend?

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Patrick Horn <patrick.horn@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Patrick Horn <patrick.horn@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Patrick Horn <patrick.horn@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

oreo-wjx · 2025-05-11T08:17:18Z

vllm/attention/backends/mla/common.py

+                # Max sure there is enough for 8 full length request or at least
+                # 4 pages of cache per request
+                max(
+                    8 * self.model_config.max_model_len, 4 *


@LucasWilkinson A dumb question, how are these magic number(8 full length request & 4 pages) been decided?

arbitrarily; just whatever seemed reasonable, we want want it so under common loads theres enough work space that we don't have to chunk the context but not so large that we materially impact KV-cache space. Feel free to tune these and open a PR if you come up with better numbers; we'd appreciate the contribution!

LucasWilkinson changed the title ~~[Attention] WIP MLA with chunked prefill~~ [WIP][Attention] WIP MLA with chunked prefill Feb 1, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from f939824 to 77be9af Compare February 4, 2025 21:15

pathorn mentioned this pull request Feb 6, 2025

Implement chunked prefill for Triton MLA attention backend #12800

Closed

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 77be9af to bf6a400 Compare February 6, 2025 02:27

mergify bot added the needs-rebase label Feb 6, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch 2 times, most recently from 463e453 to c542cc4 Compare February 6, 2025 05:24

mergify bot added v1 and removed needs-rebase labels Feb 6, 2025

LucasWilkinson changed the title ~~[WIP][Attention] WIP MLA with chunked prefill~~ [Attention] WIP MLA with chunked prefill Feb 6, 2025

LucasWilkinson marked this pull request as ready for review February 6, 2025 05:49

LucasWilkinson requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, tlrmchlsmth and ywang96 as code owners February 6, 2025 05:49

mergify bot added the needs-rebase label Feb 7, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 727b265 to c2d5468 Compare February 7, 2025 16:44

mergify bot removed the needs-rebase label Feb 7, 2025

tlrmchlsmth reviewed Feb 7, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Feb 7, 2025

View reviewed changes

vllm/attention/backends/utils.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Feb 7, 2025

View reviewed changes

csrc/cuda_utils.h Outdated Show resolved Hide resolved

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 7bffc5c to de3474d Compare February 12, 2025 01:04

LucasWilkinson added 2 commits February 13, 2025 21:47

chunked mla

4267344

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

add gather cache kernel

2821aed

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

tlrmchlsmth mentioned this pull request Feb 18, 2025

set chunked_prefill off when use mla #13374

Closed

tlrmchlsmth and others added 2 commits February 19, 2025 15:55

Merge branch 'main' into lwilkinson/chunked-mla

609267b

fix basic model test

dfb3ada

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

tlrmchlsmth reviewed Feb 19, 2025

View reviewed changes

LucasWilkinson and others added 3 commits February 19, 2025 21:51

attempt to fix AMD build

9ca182b

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

attempt 2 fix amd build

d325935

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

6394a8a

LucasWilkinson mentioned this pull request Feb 20, 2025

[WIP][Kernel] Flashinfer MLA support #13630

Closed

mergify bot added the needs-rebase label Feb 21, 2025

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

f17599e

mergify bot removed the needs-rebase label Feb 21, 2025

LucasWilkinson added 2 commits February 21, 2025 17:54

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

c5fbdaa

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

10c4e54

simon-mo disabled auto-merge February 21, 2025 23:30

simon-mo merged commit 288cc6c into vllm-project:main Feb 21, 2025
47 of 69 checks passed

qli88 mentioned this pull request Feb 23, 2025

[core] MLA performance boost for AMD GPUs and tuned MoE config for MI… #13439

Closed

ZhongYingMatrix mentioned this pull request Feb 23, 2025

[Bug]: Can't deploy DeepSeek R1 with lora failure on vLLM Engine V1 #12891

Closed

1 task

LucasWilkinson mentioned this pull request Feb 26, 2025

[Kernel] FlashMLA integration #13747

Merged

ApostaC mentioned this pull request Mar 1, 2025

[Bug]: Runtime error when running MLA models with "prefix caching enabled" and "chunked prefill disabled" #14069

Closed

1 task

hmellor mentioned this pull request Apr 2, 2025

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

Closed

1 task

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

oreo-wjx reviewed May 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Attention] MLA with chunked prefill #12639

[Attention] MLA with chunked prefill #12639

Uh oh!

LucasWilkinson commented Feb 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 1, 2025

Uh oh!

mergify bot commented Feb 6, 2025

Uh oh!

mergify bot commented Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth Feb 19, 2025

Uh oh!

mergify bot commented Feb 21, 2025

Uh oh!

Uh oh!

ZhongYingMatrix commented Feb 23, 2025

Uh oh!

oreo-wjx May 11, 2025

Uh oh!

LucasWilkinson May 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[Attention] MLA with chunked prefill #12639

[Attention] MLA with chunked prefill #12639

Uh oh!

Conversation

LucasWilkinson commented Feb 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 1, 2025

Uh oh!

mergify bot commented Feb 6, 2025

Uh oh!

mergify bot commented Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 21, 2025

Uh oh!

Uh oh!

ZhongYingMatrix commented Feb 23, 2025

Uh oh!

oreo-wjx May 11, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LucasWilkinson commented Feb 1, 2025 •

edited by github-actions bot

Loading

LucasWilkinson May 12, 2025 •

edited

Loading