Fp8 paged attention update #22222

xiao-llm · 2025-08-05T01:23:12Z

Purpose

Support fp8 mfma instruction with per wrap dynamic quantization for Query to improve performance, which reduces fp8 to fp16 data type conversion cost and improve mfma throughput for MI300x or later accelerators.

Test Plan

Unit testing:
export VLLM_ROCM_FP8_MFMA_PAGE_ATTN=1
pytest -s tests/kernels/attention/test_attention.py
Benchmark: lm-eval-harness
LLM model: Meta-Llama-3.1-8B-Instruct
Dataset: wikitext and gsm8k

Sample script:
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MODEL_DIR=Meta-Llama-3.1-8B-Instruct
export VLLM_USE_V1=0

lm_eval --model vllm
--model_args pretrained=$MODEL_DIR
--tasks gsm8k
--trust_remote_code
--batch_size 8

Test Result

GSM8K-Llama-3.1-8B

Wikitext-Llama-3.3-70b

Unitest

(Optional) Documentation Update

mergify · 2025-08-05T01:23:47Z

⚠️ The sha of the head commit of this PR conflicts with #22221. Mergify cannot evaluate rules on this PR. ⚠️

gemini-code-assist

Code Review

This pull request introduces updates for FP8 paged attention on ROCm. I've identified two critical compilation errors that need to be addressed. The first is a stray #endif directive that would break the build. The second is a variable scoping issue within a loop due to incorrect placement of preprocessor directives, which would also cause a compilation failure. Addressing these issues will ensure the code compiles and functions as intended.

csrc/rocm/attention.cu

github-actions · 2025-08-05T01:42:54Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

pyproject.toml

BowenBao · 2025-08-08T20:41:56Z

cc @gshtras for review and comments

xiao-llm · 2025-08-13T15:16:42Z

@tlrmchlsmth Can you please help me review this PR?

gshtras

Solid job overall.
Could we have unit tests to cover the new fp8 path? Extend the additional rocm attention test.
Another major point is that we need a way to actually enable this, through command line/env/heuristic/etc., and not leave as a dead code.

csrc/rocm/attention.cu

CMakeLists.txt

vllm/_custom_ops.py

pyproject.toml

csrc/rocm/attention.cu

xiao-llm · 2025-08-13T20:27:25Z

Solid job overall. Could we have unit tests to cover the new fp8 path? Extend the additional rocm attention test. Another major point is that we need a way to actually enable this, through command line/env/heuristic/etc., and not leave as a dead code.
I can add env setting like: USE_FP8_MFMA, what do you think?

mergify · 2025-08-26T22:47:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xiao-llm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/envs.py

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

mergify · 2025-09-11T18:32:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xiao-llm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

Signed-off-by: Xiao Yu <xiao.yu@amd.com> Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com> Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com> Co-authored-by: Xiao Yu <xiao.yu@amd.com> Co-authored-by: Bowen Bao <bowenbao@amd.com>

Signed-off-by: Xiao Yu <xiao.yu@amd.com> Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com> Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com> Co-authored-by: Xiao Yu <xiao.yu@amd.com> Co-authored-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Xiao Yu <xiao.yu@amd.com> Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com> Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com> Co-authored-by: Xiao Yu <xiao.yu@amd.com> Co-authored-by: Bowen Bao <bowenbao@amd.com>

Signed-off-by: Xiao Yu <xiao.yu@amd.com> Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com> Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com> Co-authored-by: Xiao Yu <xiao.yu@amd.com> Co-authored-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

gemini-code-assist bot reviewed Aug 5, 2025

View reviewed changes

csrc/rocm/attention.cu Outdated Show resolved Hide resolved

mergify bot added the rocm Related to AMD ROCm label Aug 5, 2025

mergify bot added the ci/build label Aug 6, 2025

xiao-llm marked this pull request as ready for review August 6, 2025 17:33

xiao-llm requested review from LucasWilkinson and tlrmchlsmth as code owners August 6, 2025 17:33

BowenBao reviewed Aug 8, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

mergify bot added the performance Performance-related issues label Aug 13, 2025

gshtras requested changes Aug 13, 2025

View reviewed changes

xiao-llm force-pushed the fp8_paged_attention_update branch from c851085 to 94018d1 Compare August 13, 2025 18:02

mergify bot added the documentation Improvements or additions to documentation label Aug 13, 2025

mergify bot added the needs-rebase label Aug 26, 2025

xiao-llm force-pushed the fp8_paged_attention_update branch from ae384dc to 4c0c4ed Compare August 26, 2025 23:26

mergify bot removed the needs-rebase label Aug 27, 2025

BowenBao reviewed Aug 27, 2025

View reviewed changes

vllm/envs.py Outdated Show resolved Hide resolved

xiao-llm force-pushed the fp8_paged_attention_update branch from b7ecc8f to d2fceda Compare August 28, 2025 16:16

xiao-llm requested review from bigPYJ1151, hmellor, jeejeelee, jikunshang, mgoin, patrickvonplaten, sighingnow and tdoublep as code owners August 28, 2025 16:16

gshtras removed deepseek Related to DeepSeek models gpt-oss Related to GPT-OSS models labels Sep 8, 2025

xiao-llm added 7 commits September 9, 2025 22:24

refactoring fp8 mfma compile flag

f7c1024

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

Correct fp8 mfma compile flag after refactoring

3f306d8

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

Correct fp8 mfma compile flag location

bdb3a4d

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

Update envs fp8 mfma check

e3d4ea2

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

Update envs.py

6cd5b28

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

Update _custom_ops.py

e7b1aa6

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

Reformat attention.cu

adac5b4

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

gshtras approved these changes Sep 10, 2025

View reviewed changes

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025

gshtras enabled auto-merge (squash) September 10, 2025 21:59

Merge branch 'main' into fp8_paged_attention_update

478b3d8

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

mergify bot added the needs-rebase label Sep 11, 2025

Resolve envs.py conflit

c945bd0

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

auto-merge was automatically disabled September 14, 2025 20:41
Head branch was pushed to by a user without write access

mergify bot removed the needs-rebase label Sep 14, 2025

Merge branch 'main' into fp8_paged_attention_update

129a59f

Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>

gshtras merged commit 01413e0 into vllm-project:main Sep 15, 2025
80 checks passed

github-project-automation bot moved this to Done in Tool Calling Sep 15, 2025

github-project-automation bot moved this to Done in Structured Output Sep 15, 2025

Uh oh!

Fp8 paged attention update #22222

Fp8 paged attention update #22222

Uh oh!

Conversation

xiao-llm commented Aug 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

mergify bot commented Aug 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

Uh oh!

BowenBao commented Aug 8, 2025

Uh oh!

xiao-llm commented Aug 13, 2025

Uh oh!

gshtras left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiao-llm commented Aug 13, 2025

Uh oh!

mergify bot commented Aug 26, 2025

Uh oh!

Uh oh!

mergify bot commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiao-llm commented Aug 5, 2025 •

edited by github-actions bot

Loading