Support encoder_only attention for FlexAttention #22273

maxdebayser · 2025-08-05T20:38:29Z

Purpose

This PR adds support for encoder-only attention, which is bidirectional and doesn't use KV cache. This type of attention is used in embedding and classifier models such as the Bert and Roberta architecture models. They are already supported with flash attention after PR #21270 but flash attention only supports float16 and bfloat16. To be able to run embedding benchmarks such as MTEB at the highest precision, we need support for float32.

Test Plan

I've added an embedding test in the kernel tests.

Test Results

I've verified that the tests are passing on a A100.

cc: @DarkLight1337 , @russellb , @drisspg

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

- Fix boolean condition - Reduce max tokens in generative tests to reduce the change of divergence - Clean up memory after LLM run Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

github-actions · 2025-08-05T20:38:37Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

maxdebayser · 2025-08-05T20:39:42Z

tests/kernels/test_flex_attention.py

    model_name = "Qwen/Qwen2.5-1.5B-Instruct"
    seed = 42
-    max_tokens = 32
+    max_tokens = 24


I've reduced the max tokens a bit because as the sequence length growth the chance of divergence increases. On the A100 where I'm testing this, I get the following output on the main branch:

flex_text= ' Paris. The capital of France is also the capital of which country?\nA) Germany\nB) Italy\nC) Spain\nD) United Kingdom\nE' default_text=' Paris. The capital of France is also the capital of which country?\nA) Germany\nB) Italy\nC) Spain\nD) Belgium\nE)'

tests/kernels/test_flex_attention.py

maxdebayser · 2025-08-05T20:41:30Z

vllm/v1/attention/backends/flex_attention.py

-            key_cache,
-            value_cache,
+            key_tensor,
+            value_tensor,


I've renamed the variables here because in one case they are key and value and in the other case key_cache and key_value.

gemini-code-assist

Code Review

This pull request adds support for encoder-only attention to FlexAttention, which is a key feature for running embedding models like BERT at high precision. The changes are well-structured, introducing a causal flag to differentiate between decoder and encoder attention paths. The new test case for encoder models is a great addition. My review includes a suggestion to enhance the new test to cover float32 as mentioned in the PR description, and a refactoring suggestion to improve code maintainability by reducing duplication.

tests/kernels/test_flex_attention.py

vllm/v1/attention/backends/flex_attention.py

DarkLight1337

LGTM as long as tests pass but I'll have @drisspg take a look at this as well.

tests/kernels/test_flex_attention.py

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

drisspg

Looks good

DarkLight1337 · 2025-08-06T16:37:55Z

Can you merge from main to fix the build errors?

DarkLight1337 · 2025-08-07T01:37:06Z

Test failures are not related, merging

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser added 2 commits August 5, 2025 14:32

Add encoder-only support to FlexAttention

51ca8e1

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Fix several small problems

d68e60b

- Fix boolean condition - Reduce max tokens in generative tests to reduce the change of divergence - Clean up memory after LLM run Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser requested review from WoosukKwon, robertgshaw2-redhat and tlrmchlsmth as code owners August 5, 2025 20:38

maxdebayser requested review from alexm-redhat, comaniac, njhill and ywang96 as code owners August 5, 2025 20:38

mergify bot added the v1 label Aug 5, 2025

maxdebayser commented Aug 5, 2025

View reviewed changes

tests/kernels/test_flex_attention.py Outdated Show resolved Hide resolved

maxdebayser commented Aug 5, 2025

View reviewed changes

gemini-code-assist bot reviewed Aug 5, 2025

View reviewed changes

tests/kernels/test_flex_attention.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flex_attention.py Show resolved Hide resolved

DarkLight1337 reviewed Aug 6, 2025

View reviewed changes

DarkLight1337 requested review from Isotr0py and LucasWilkinson August 6, 2025 03:14

Isotr0py approved these changes Aug 6, 2025

View reviewed changes

tests/kernels/test_flex_attention.py Outdated Show resolved Hide resolved

maxdebayser added 2 commits August 6, 2025 10:16

Merge branch 'upstream_main' into flex_encoder_attn

88e388d

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

use vllm_runner

16f18c6

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 6, 2025

drisspg approved these changes Aug 6, 2025

View reviewed changes

Merge branch 'upstream_main' into flex_encoder_attn

ed55ccc

vllm-bot merged commit f825c6b into vllm-project:main Aug 7, 2025
34 of 41 checks passed

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

11bb5da

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

9118029

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

wuhang2014 pushed a commit to wuhang2014/vllm that referenced this pull request Aug 12, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

5443b35

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

2a13be6

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

a6a0300

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

4a764b6

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

1003540

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

Support encoder_only attention for FlexAttention (vllm-project#22273)

caaa2ab

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support encoder_only attention for FlexAttention #22273

Support encoder_only attention for FlexAttention #22273

Uh oh!

maxdebayser commented Aug 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

maxdebayser Aug 5, 2025

Uh oh!

Uh oh!

maxdebayser Aug 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment •

edited

Loading

Uh oh!

Uh oh!

drisspg left a comment

Uh oh!

DarkLight1337 commented Aug 6, 2025

Uh oh!

DarkLight1337 commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Support encoder_only attention for FlexAttention #22273

Support encoder_only attention for FlexAttention #22273

Uh oh!

Conversation

maxdebayser commented Aug 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Results

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

maxdebayser Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxdebayser Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Aug 6, 2025

Uh oh!

DarkLight1337 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

maxdebayser commented Aug 5, 2025 •

edited by github-actions bot

Loading

DarkLight1337 left a comment •

edited

Loading

DarkLight1337 commented Aug 7, 2025 •

edited

Loading