Skip to content

Conversation

@mawong-amd
Copy link
Contributor

@mawong-amd mawong-amd commented Oct 8, 2025

Purpose

Enable FlexAttention on ROCm if it is specified e.g. via VLLM_ATTENTION_BACKEND=FLEX_ATTENTION. This makes progress towards batch invariant inference on AMD.

Test Plan

Comparing correctness of default attention backend vs FlexAttention
lm_eval --model local-completions --model_args model=meta-llama/Llama-3.1-8B,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,max_retries=5 --tasks gsm8k

Test Result

Default:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5011|±  |0.0138|
|     |       |strict-match    |     5|exact_match|↑  |0.5011|±  |0.0138|

FlexAttention:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4882|±  |0.0138|
|     |       |strict-match    |     5|exact_match|↑  |0.4875|±  |0.0138|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the rocm Related to AMD ROCm label Oct 8, 2025
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
@mawong-amd mawong-amd force-pushed the mawong/enable_flex_attn branch from 3cd0c5d to f643ec2 Compare October 8, 2025 19:43
@mawong-amd mawong-amd marked this pull request as ready for review October 8, 2025 19:43
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 9, 2025 04:27
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025
@DarkLight1337 DarkLight1337 merged commit de253d6 into vllm-project:main Oct 9, 2025
46 checks passed
@mawong-amd mawong-amd deleted the mawong/enable_flex_attn branch October 9, 2025 15:48
845473182 pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Oct 10, 2025
…to loader

* 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits)
  [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001)
  Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164)
  Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353)
  [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189)
  Bump Flashinfer to v0.4.0 (vllm-project#26326)
  Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464)
  [Core] Relax the LoRA  max rank (vllm-project#26461)
  [CI/Build] Fix model nightly tests (vllm-project#26466)
  [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486)
  [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926)
  [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200)
  [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439)
  [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462)
  [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445)
  [Minor] Change warning->warning_once in preprocess (vllm-project#26455)
  [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392)
  [Misc] Redact ray runtime env before logging (vllm-project#26302)
  Separate MLAAttention class from Attention (vllm-project#25103)
  [Attention] Register FLASHMLA_SPARSE (vllm-project#26441)
  [Kernels] Modular kernel refactor (vllm-project#24812)
  ...
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
)

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
)

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
)

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
)

Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants