[ROCm] Enable chunked prefill/paged attention in MLA on ROCm #14316

SageMoore · 2025-03-05T22:09:19Z

This PR is largely just removing the guards in config.py to allow chunked prefill and paged attention in MLA. The LSE computation in the triton kernel doesn't work so we always fall back to flash attention in this case.

I ran lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,trust_remote_code=True,enable_chunked_prefill=True --tasks gsm8k --num_fewshot 5 --batch_size auto and got

vllm (pretrained=deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,trust_remote_code=True,enable_chunked_prefill=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7612|±  |0.0117|
|     |       |strict-match    |     5|exact_match|↑  |0.7437|±  |0.0120|

CC: @LucasWilkinson

Signed-off-by: Sage Moore <sage@neuralmagic.com>

github-actions · 2025-03-05T22:09:32Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vincent-4

Thoughts?

vllm/attention/backends/mla/common.py

Signed-off-by: Sage Moore <sage@neuralmagic.com>

shajrawi · 2025-03-06T23:12:57Z

Nice! Thank you - cc @mawong-amd

houseroad

The changes look good to me. The changes are only applied to hip, and straightforward.

vllm/attention/backends/mla/common.py

Signed-off-by: Sage Moore <sage@neuralmagic.com>

LucasWilkinson

LGTM now, thanks!

…amd-deepseek

…oject#14316) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Richard Liu <ricliu@google.com>

…oject#14316) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…oject#14316) Signed-off-by: Sage Moore <sage@neuralmagic.com>

…oject#14316) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

init

ae056e1

Signed-off-by: Sage Moore <sage@neuralmagic.com>

vincent-4 suggested changes Mar 5, 2025

View reviewed changes

vllm/attention/backends/mla/common.py Outdated Show resolved Hide resolved

vllm/attention/backends/mla/common.py Outdated Show resolved Hide resolved

hongxiayang added the rocm Related to AMD ROCm label Mar 6, 2025

hongxiayang mentioned this pull request Mar 6, 2025

[Installation]: Cannot Enable Prefix Caching With Deepseek-R1 ROCm/vllm#457

Closed

1 task

cleanup boolean logic

f1dbffb

Signed-off-by: Sage Moore <sage@neuralmagic.com>

shajrawi approved these changes Mar 6, 2025

View reviewed changes

houseroad approved these changes Mar 6, 2025

View reviewed changes

LucasWilkinson requested changes Mar 7, 2025

View reviewed changes

vllm/attention/backends/mla/common.py Outdated Show resolved Hide resolved

comments

8f9664d

Signed-off-by: Sage Moore <sage@neuralmagic.com>

hongxiayang approved these changes Mar 10, 2025

View reviewed changes

LucasWilkinson approved these changes Mar 10, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) March 10, 2025 14:49

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2025

gshtras mentioned this pull request Mar 10, 2025

Upstream merge 25 03 10 ROCm/vllm#471

Merged

vincent-4 approved these changes Mar 11, 2025

View reviewed changes

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

e967314

…amd-deepseek

LucasWilkinson merged commit d9f83d6 into vllm-project:main Mar 12, 2025
33 checks passed

SageMoore deleted the sage/amd-deepseek branch March 12, 2025 16:04

richardsliu pushed a commit to richardsliu/vllm that referenced this pull request Mar 14, 2025

[ROCm] Enable chunked prefill/paged attention in MLA on ROCm (vllm-pr…

c9b6675

…oject#14316) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Richard Liu <ricliu@google.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[ROCm] Enable chunked prefill/paged attention in MLA on ROCm (vllm-pr…

0494a75

…oject#14316) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[ROCm] Enable chunked prefill/paged attention in MLA on ROCm (vllm-pr…

a72c0c3

…oject#14316) Signed-off-by: Sage Moore <sage@neuralmagic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm] Enable chunked prefill/paged attention in MLA on ROCm #14316

[ROCm] Enable chunked prefill/paged attention in MLA on ROCm #14316

Uh oh!

SageMoore commented Mar 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

vincent-4 left a comment

Uh oh!

Uh oh!

Uh oh!

shajrawi commented Mar 6, 2025

Uh oh!

houseroad left a comment •

edited

Loading

Uh oh!

Uh oh!

LucasWilkinson left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[ROCm] Enable chunked prefill/paged attention in MLA on ROCm #14316

[ROCm] Enable chunked prefill/paged attention in MLA on ROCm #14316

Uh oh!

Conversation

SageMoore commented Mar 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

vincent-4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shajrawi commented Mar 6, 2025

Uh oh!

houseroad left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LucasWilkinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SageMoore commented Mar 5, 2025 •

edited by github-actions bot

Loading

houseroad left a comment •

edited

Loading

LucasWilkinson left a comment •

edited

Loading