-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel][Model] PagedAttention: Support custom attention bias for T5 model (1/2) #11334
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
qk += (alibi_slope != 0) ? alibi_slope * (token_idx - seq_len + 1) : 0; | ||
qk += (attn_bias_vec != nullptr) ? attn_bias_vec[token_idx] : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T5 uses relative bias, so I believe attn_bias_vec
should be instead indexed with seq_len - token_idx - 1
.
qk += (attn_bias_vec != nullptr) ? attn_bias_vec[token_idx] : 0; | |
const int custom_bias_idx = max(min(seq_len - token_idx - 1, padded_max_seq_len), 0); | |
qk += (attn_bias_vec != nullptr) ? attn_bias_vec[custom_bias_idx] : 0; |
I have tested it with T5 using similar implementation.
@NickLucche In case of T5, it can be limited to the maximum relative distance, which for T5 Large is 128. Anything beyond 128 just uses the last value from the bias tensor. |
Thanks for the quick input @sfc-gh-phalama, personally I'd rather not have the attn bias implementation depend or be tied to T5/relative positional encoding mechanisms, otherwise we could probably implement the inline formula. |
7c8dd7e
to
5c47f43
Compare
e858a1f
to
f37d776
Compare
Fixed the kernel tests failing and rebased, not sure how related the rest of errors may be tbh..
|
@NickLucche the gpu count issue might be fixed by this PR: #12111 |
f37d776
to
ca9562b
Compare
CI failures still look unrelated to me :/ |
I believe this PR could also be useful in places like https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py#L863, where we can pass in the custom attention mask as bias term A, while getting rid of the logic to handle kv cache in the model. |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
ca9562b
to
c7c983d
Compare
Rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm partway through all the files but I have 2 initial questions:
- Do the
attn_bias
arguments need to be added to the other attention backends? - Is the new tensor materialised even if the attention bias is not used?
Hey thanks for the review!
Initial effort was planned only for the pagedattn backend, so that we could run at least with xformers+pagedattn. FlashAttn would require a separate PR on base repo or on vllm maintained fork, so that requires more involvement.
Nope attn_bias it's an optional pointer, when the pointer is null it's a noop. |
This contribution is part of 2 PRs attempting to add support for the T5 model.
Based on the great work done in #7366 (comment) and #3117.
It implements:
In particular, the following aims to be a more generalized addition (not tied to T5), enabling any
custom attention bias in the PagedAttention kernel.
While I am aware this introduces a fully materialized matrix (akin to what's done in xformers),
currently padded to
max_seq_len
, I believe the flexibility brought by allowing any custom biaswould still pay for itself when providing initial compatibility for yet unsupported models (such as T5).
I'd be happy to focus on performance optimization later (eg variable len bias matrix or flashattn-like IO optimization)
or even have yet another model-specific feature in the kernel for T5 to compute relative positional embeddings.
TODO work left for future contribs (ie extend to all platforms):