-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Support encoder_only attention for FlexAttention #22273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
- Fix boolean condition - Reduce max tokens in generative tests to reduce the change of divergence - Clean up memory after LLM run Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
| model_name = "Qwen/Qwen2.5-1.5B-Instruct" | ||
| seed = 42 | ||
| max_tokens = 32 | ||
| max_tokens = 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reduced the max tokens a bit because as the sequence length growth the chance of divergence increases. On the A100 where I'm testing this, I get the following output on the main branch:
flex_text= ' Paris. The capital of France is also the capital of which country?\nA) Germany\nB) Italy\nC) Spain\nD) United Kingdom\nE'
default_text=' Paris. The capital of France is also the capital of which country?\nA) Germany\nB) Italy\nC) Spain\nD) Belgium\nE)'
| key_cache, | ||
| value_cache, | ||
| key_tensor, | ||
| value_tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed the variables here because in one case they are key and value and in the other case key_cache and key_value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for encoder-only attention to FlexAttention, which is a key feature for running embedding models like BERT at high precision. The changes are well-structured, introducing a causal flag to differentiate between decoder and encoder attention paths. The new test case for encoder models is a great addition. My review includes a suggestion to enhance the new test to cover float32 as mentioned in the PR description, and a refactoring suggestion to improve code maintainability by reducing duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as tests pass but I'll have @drisspg take a look at this as well.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
|
Can you merge from main to fix the build errors? |
|
Test failures are not related, merging |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Purpose
This PR adds support for encoder-only attention, which is bidirectional and doesn't use KV cache. This type of attention is used in embedding and classifier models such as the Bert and Roberta architecture models. They are already supported with flash attention after PR #21270 but flash attention only supports float16 and bfloat16. To be able to run embedding benchmarks such as MTEB at the highest precision, we need support for float32.
Test Plan
I've added an embedding test in the kernel tests.
Test Results
I've verified that the tests are passing on a A100.
cc: @DarkLight1337 , @russellb , @drisspg