-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[V1] Add sliding window support to Flex Attention backend #24089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
Also cc @drisspg for visibility |
|
|
||
| def build_block_mask(self) -> BlockMask: | ||
| if self.causal: | ||
| if self.sliding_window is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this would still work w/ the direct build path are your new test checking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 181f15d, the new test can cover the direct build code path, it's disabled for torch2.8 currently:
use_direct_block_mask = is_torch_equal_or_newer("2.9.0.dev0")
if backend == "FLEX_ATTENTION_SLOW":
actual_backend = _Backend.FLEX_ATTENTION
use_direct_block_mask = FalseI modified it to this locally and confirmed it passed on torch2.8:
use_direct_block_mask = True
if backend == "FLEX_ATTENTION_SLOW":
actual_backend = _Backend.FLEX_ATTENTION
use_direct_block_mask = FalseBut I didn't push it in that commit, just in case that there are something need to disable direct build.
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
Please test Alibaba-NLP/gte-reranker-modernbert-base and google/embeddinggemma-300m (need to manually set dtype = float32) to ensure the results of bi-directional attention + sliding window + Flex Attention are correct pytest tests/models/language/pooling/test_st_projector.py::test_embed_models_mteb[model_info1] #24318 |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@noooop Have confirmed both tests passed with fp32 locally now: |
|
But as |
Different from vllm/vllm/v1/attention/backends/flex_attention.py Lines 459 to 467 in a8c0f59
In fact, the key is # update mask mod in attention metadata
attn_metadata.mask_mod = attn_metadata.get_mask_mod()
attn_metadata.block_mask = (
attn_metadata._build_block_mask_direct())If At the switching stage for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as is this fine, but can you create some issues on the follow up work. The dynamic creating of block_mask is not ideal
|
Did you try some e2e model? I tried The logs are attached |
I tried I just ran Full logs: I suspect the dynamic block_mask creation caused this issue when using hybrid allocator, let me investigate then. |
+1 PTAL #24872 (comment) (I only saw the keyword ‘compile’, maybe it’s not related.) |
|
Oh, seems we have to disable hybrid allocator when using FlexAttention 😢: |
|
By the way, e.g. Welcome to use and fix what you need! |
The direct build path should skip non intra window blocks if the page table correctly evicts those blocks |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Trying to some follow-ups:
- https://github.com/vllm-project/vllm/pull/24089/files#r2341783626
- support hybrid allocator
- support real sliding window when using direct build
…ct#24089) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…ct#24089) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ct#24089) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ct#24089) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…ct#24089) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…ct#24089) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Test Plan
Test Result
Test should still pass.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.