-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Upgrade FA3 for attention sink #22313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the flash-attention dependency to a newer commit, with the stated goal of upgrading to a version that supports attention sinks in FlashAttention v3. While updating the dependency is a necessary step, the pull request appears to be incomplete. There are no corresponding code changes within the vLLM codebase to actually configure or utilize the new attention sink functionality. The implementation seems to continue using a sliding window with a hardcoded sink size of 0, which means the primary objective of this PR is not met.
| vllm-flash-attn | ||
| GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git | ||
| GIT_TAG 1c2624e53c078854e0637ee566c72fe2107e75f4 | ||
| GIT_TAG b99f8c821771fd11feb66d5c89661e9858fde359 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change updates the flash-attention dependency, and the pull request title suggests this is for enabling 'attention sink' functionality. However, the rest of the codebase does not seem to be updated to use this feature. For example, in vllm/attention/backends/flash_attn.py, the sliding_window parameter for FlashAttention is initialized with a sink size of 0. This means that even with the updated dependency, the attention sink feature will not be active. To fulfill the goal of this PR, code changes are required to configure and utilize the attention sink size. Without them, this PR is incomplete.
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Noam Gat <noamgat@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
No description provided.