[Qwen][ROCm] Flash Attention Rotary Embeddings #24642

vllmellm · 2025-09-11T07:39:45Z

Purpose

Qwen VL models previously relied on a basic PyTorch implementation for applying rotary positional embeddings on ROCm architectures. This PR adds a ROCm specialisation to use flash_attn module's Triton-based operation which makes it much faster as seen in the benchmarking results below:

benchmark_serving script

python3 vllm/benchmarks/benchmark_serving.py  \
--backend openai-chat   \
--model Qwen/Qwen2-VL-7B-Instruct     \
--endpoint /v1/chat/completions   \
--dataset-name hf   \
--dataset-path lmarena-ai/VisionArena-Chat   \
--hf-split train   \
--num-prompts 2000 \

Metric	Qwen2.5-VL-7B-Instruct-After	Qwen2.5-VL-7B-Instruct-Before	Qwen2-VL-7B-Instruct-After	Qwen2-VL-7B-Instruct-Before
Successful requests	2000	2000	2000	2000
Benchmark duration (s)	1345.53	1813.95	1061.45	1245.72
Total input tokens	128725	128725	128725	128725
Total generated tokens	226206	226462	198558	197729
Request throughput (req/s)	1.49	1.10	1.88	1.61
Output token throughput (tok/s)	168.12	124.84	187.06	158.73
Total token throughput (tok/s)	263.78	195.81	308.34	262.06
Mean TTFT (ms)	518860.51	780409.88	393523.69	525409.53
Median TTFT (ms)	442684.40	734388.24	335331.87	524210.13
P99 TTFT (ms)	1339478.98	1784936.03	1034260.36	1217034.75
Mean TPOT (ms)	4134.86	5809.08	3351.70	4390.97
Median TPOT (ms)	4724.09	7592.68	3179.76	5423.96
P99 TPOT (ms)	9092.23	9980.45	8516.70	7897.34
Mean ITL (ms)	4826.60	6175.72	4119.97	4689.50
Median ITL (ms)	575.49	7872.18	470.59	520.93
P99 ITL (ms)	15527.60	15521.51	15152.06	14973.82

Test Plan

Evaluate accuracy with mistral-eval using chartqa dataset on Qwen2.5-VL-7B-Instruct and Qwen2-VL-7B-Instruct

Test Result

Qwen2.5-VL-7B-Instruct-Before

Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.8676,
    "anywhere_in_answer_relaxed_correctness": 0.868
}

Qwen2-VL-7B-Instruct-Before

{
    "explicit_prompt_relaxed_correctness": 0.4552,
    "anywhere_in_answer_relaxed_correctness": 0.6968
}

Qwen2.5-VL-7B-Instruct-After

Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.8632,
    "anywhere_in_answer_relaxed_correctness": 0.8632
}

Qwen2-VL-7B-Instruct-After

Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.462,
    "anywhere_in_answer_relaxed_correctness": 0.7012
}

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllmellm · 2025-10-01T11:14:16Z

@DarkLight1337 would appreciate if you can review.

DarkLight1337 · 2025-10-01T12:47:41Z

vllm/model_executor/layers/rotary_embedding/common.py

+import importlib
+from functools import cache


Can you reorganize the imports to be in the correct order?

it is ordered using isort. Arranging it alphabetically causes isort to complain

Hmm weird, I thought isort would put it with the rest of the stdlib imports near L5

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

xuechendi · 2025-10-02T18:39:50Z

@DarkLight1337 @vllmellm , I found this PR introduced a bug to non_cuda/non_rocm, may you take a look of the fixing I provided here:
#26123

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

…d by #24642 (#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io>

…d by vllm-project#24642 (vllm-project#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…d by #24642 (#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…d by vllm-project#24642 (vllm-project#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…d by vllm-project#24642 (vllm-project#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…d by vllm-project#24642 (vllm-project#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

…d by vllm-project#24642 (vllm-project#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…d by vllm-project#24642 (vllm-project#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

…d by vllm-project#24642 (vllm-project#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

…d by vllm-project#24642 (vllm-project#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…d by vllm-project#24642 (vllm-project#26123) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

use flash_attn rotary embeddings

c88c547

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

mergify bot added qwen Related to Qwen models rocm Related to AMD ROCm labels Sep 11, 2025

check if module exists; cache dispatch

81c8350

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllmellm marked this pull request as ready for review September 14, 2025 12:55

vllmellm requested a review from sighingnow as a code owner September 14, 2025 12:55

Merge branch 'main' into rocm-triton-rotary

270f938

pytorch-bot bot added the module: rocm label Sep 14, 2025

vllmellm added 2 commits September 16, 2025 06:49

consolidate dispatch logic in common module

43fcfdd

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into rocm-triton-rotary

a377085

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

DarkLight1337 reviewed Oct 1, 2025

View reviewed changes

DarkLight1337 approved these changes Oct 2, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) October 2, 2025 02:22

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 2, 2025

vllmellm added 2 commits October 2, 2025 13:21

Merge remote-tracking branch 'origin/main' into rocm-triton-rotary

6bb86ae

fix import order

61f988c

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

auto-merge was automatically disabled October 2, 2025 13:25
Head branch was pushed to by a user without write access

DarkLight1337 approved these changes Oct 2, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) October 2, 2025 13:26

vllm-bot merged commit 5e4a822 into vllm-project:main Oct 2, 2025
51 of 55 checks passed

xuechendi mentioned this pull request Oct 2, 2025

[BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduced by #24642 #26123

Merged

5 tasks

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[Qwen][ROCm] Flash Attention Rotary Embeddings (vllm-project#24642)

c48129c

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Qwen][ROCm] Flash Attention Rotary Embeddings (#24642)

06d102e

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025

[Qwen][ROCm] Flash Attention Rotary Embeddings (vllm-project#24642)

37c2551

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025

[Qwen][ROCm] Flash Attention Rotary Embeddings (vllm-project#24642)

e5a2e1b

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Qwen][ROCm] Flash Attention Rotary Embeddings (vllm-project#24642)

3517926

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Qwen][ROCm] Flash Attention Rotary Embeddings (vllm-project#24642)

044fe38

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Qwen][ROCm] Flash Attention Rotary Embeddings (vllm-project#24642)

96e814a

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Qwen][ROCm] Flash Attention Rotary Embeddings (vllm-project#24642)

05b9323

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

vllmellm mentioned this pull request Oct 29, 2025

[Bugfix][ROCm] Fix ViT rotary embeddings for torch.compile compatibility on ROCm #27748

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Qwen][ROCm] Flash Attention Rotary Embeddings #24642

[Qwen][ROCm] Flash Attention Rotary Embeddings #24642

Uh oh!

vllmellm commented Sep 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

vllmellm commented Oct 1, 2025

Uh oh!

DarkLight1337 Oct 1, 2025

Uh oh!

vllmellm Oct 1, 2025

Uh oh!

DarkLight1337 Oct 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

xuechendi commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Qwen][ROCm] Flash Attention Rotary Embeddings #24642

[Qwen][ROCm] Flash Attention Rotary Embeddings #24642

Uh oh!

Conversation

vllmellm commented Sep 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

vllmellm commented Oct 1, 2025

Uh oh!

DarkLight1337 Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

vllmellm Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuechendi commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vllmellm commented Sep 11, 2025 •

edited by github-actions bot

Loading

DarkLight1337 Oct 1, 2025 •

edited

Loading