[Kernel][Model] PagedAttention: Support custom attention bias for T5 model (1/2) #11334

NickLucche · 2024-12-19T10:51:02Z

This contribution is part of 2 PRs attempting to add support for the T5 model.
Based on the great work done in #7366 (comment) and #3117.

It implements:

$$ Attention(Q, K, V) = \text{softmax}\left(\frac{QK^\top + A}{\sqrt{d_k}}\right)V $$

In particular, the following aims to be a more generalized addition (not tied to T5), enabling any
custom attention bias in the PagedAttention kernel.

While I am aware this introduces a fully materialized matrix (akin to what's done in xformers),
currently padded to max_seq_len, I believe the flexibility brought by allowing any custom bias
would still pay for itself when providing initial compatibility for yet unsupported models (such as T5).
I'd be happy to focus on performance optimization later (eg variable len bias matrix or flashattn-like IO optimization)
or even have yet another model-specific feature in the kernel for T5 to compute relative positional embeddings.

TODO work left for future contribs (ie extend to all platforms):

ROCm attn bias support
CPU attn bias support
..?

github-actions · 2024-12-19T10:51:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

sfc-gh-phalama · 2024-12-19T13:52:28Z

csrc/attention/attention_kernels.cuh

      qk += (alibi_slope != 0) ? alibi_slope * (token_idx - seq_len + 1) : 0;
+      qk += (attn_bias_vec != nullptr) ? attn_bias_vec[token_idx] : 0;


T5 uses relative bias, so I believe attn_bias_vec should be instead indexed with seq_len - token_idx - 1.

Suggested change

qk += (attn_bias_vec != nullptr) ? attn_bias_vec[token_idx] : 0;

const int custom_bias_idx = max(min(seq_len - token_idx - 1, padded_max_seq_len), 0);

qk += (attn_bias_vec != nullptr) ? attn_bias_vec[custom_bias_idx] : 0;

I have tested it with T5 using similar implementation.

sfc-gh-phalama · 2024-12-19T13:56:24Z

While I am aware this introduces a fully materialized matrix (akin to what's done in xformers),
currently padded to max_seq_len

@NickLucche In case of T5, it can be limited to the maximum relative distance, which for T5 Large is 128. Anything beyond 128 just uses the last value from the bias tensor.

NickLucche · 2024-12-20T08:33:34Z

Thanks for the quick input @sfc-gh-phalama, personally I'd rather not have the attn bias implementation depend or be tied to T5/relative positional encoding mechanisms, otherwise we could probably implement the inline formula.

NickLucche · 2025-01-15T20:40:33Z

Fixed the kernel tests failing and rebased, not sure how related the rest of errors may be tbh..
https://buildkite.com/vllm/ci/builds/12022#01946b42-da66-4512-919d-c2115515eb6d/147-444

AssertionError: please set tensor_parallel_size (4) to less than max local gpu count (1)
looks more like a CI/node issue

joerunde · 2025-01-16T18:25:00Z

@NickLucche the gpu count issue might be fixed by this PR: #12111

NickLucche · 2025-01-21T18:34:20Z

CI failures still look unrelated to me :/

NickLucche · 2025-01-22T18:05:58Z

I believe this PR could also be useful in places like https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py#L863, where we can pass in the custom attention mask as bias term A, while getting rid of the logic to handle kv cache in the model.

mergify · 2025-02-05T21:31:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche · 2025-02-17T16:05:47Z

Rebased

hmellor

I'm partway through all the files but I have 2 initial questions:

Do the attn_bias arguments need to be added to the other attention backends?
Is the new tensor materialised even if the attention bias is not used?

NickLucche · 2025-02-17T17:39:45Z

Hey thanks for the review!

Do the attn_bias arguments need to be added to the other attention backends?

Initial effort was planned only for the pagedattn backend, so that we could run at least with xformers+pagedattn. FlashAttn would require a separate PR on base repo or on vllm maintained fork, so that requires more involvement.

Is the new tensor materialised even if the attention bias is not used?

Nope attn_bias it's an optional pointer, when the pointer is null it's a noop.

NickLucche requested review from tlrmchlsmth and WoosukKwon as code owners December 19, 2024 10:51

sfc-gh-phalama reviewed Dec 19, 2024

View reviewed changes

NickLucche force-pushed the custom-attn-bias-signed branch from 7c8dd7e to 5c47f43 Compare January 9, 2025 14:04

NickLucche mentioned this pull request Jan 9, 2025

[Model] Add T5 model (2/2) #11901

Open

1 task

joerunde added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 13, 2025

DarkLight1337 mentioned this pull request Jan 14, 2025

Adding support for encoder-decoder models, like T5 or BART #187

Closed

NickLucche force-pushed the custom-attn-bias-signed branch from e858a1f to f37d776 Compare January 15, 2025 18:38

NickLucche force-pushed the custom-attn-bias-signed branch from f37d776 to ca9562b Compare January 21, 2025 15:32

hmellor mentioned this pull request Feb 3, 2025

Add Encoder-decoder model support and T5 Model support #3117

Closed

njhill mentioned this pull request Feb 5, 2025

[Model] Added Google T5 model support to vLLM #11780

Closed

mergify bot added the needs-rebase label Feb 5, 2025

NickLucche added 6 commits February 17, 2025 09:53

add working kernel with padded_max_seq_len as arg

98eaf1f

Signed-off-by: NickLucche <nlucches@redhat.com>

add attn_bias case to pagedattn tests

752be56

Signed-off-by: NickLucche <nlucches@redhat.com>

format

8703add

Signed-off-by: NickLucche <nlucches@redhat.com>

enforce last dim of attn bias to be block aligned

0ef1470

Signed-off-by: NickLucche <nlucches@redhat.com>

fix blocksparse tests

6228f76

Signed-off-by: NickLucche <nlucches@redhat.com>

fix rebase

c7c983d

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the custom-attn-bias-signed branch from ca9562b to c7c983d Compare February 17, 2025 16:04

mergify bot removed the needs-rebase label Feb 17, 2025

hmellor reviewed Feb 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Model] PagedAttention: Support custom attention bias for T5 model (1/2) #11334

[Kernel][Model] PagedAttention: Support custom attention bias for T5 model (1/2) #11334

NickLucche commented Dec 19, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 19, 2024

sfc-gh-phalama Dec 19, 2024

sfc-gh-phalama commented Dec 19, 2024

NickLucche commented Dec 20, 2024 •

edited

Loading

NickLucche commented Jan 15, 2025 •

edited

Loading

joerunde commented Jan 16, 2025

NickLucche commented Jan 21, 2025

NickLucche commented Jan 22, 2025

mergify bot commented Feb 5, 2025

NickLucche commented Feb 17, 2025

hmellor left a comment

NickLucche commented Feb 17, 2025 •

edited

Loading

		qk += (alibi_slope != 0) ? alibi_slope * (token_idx - seq_len + 1) : 0;
		qk += (attn_bias_vec != nullptr) ? attn_bias_vec[token_idx] : 0;

	qk += (attn_bias_vec != nullptr) ? attn_bias_vec[token_idx] : 0;
	const int custom_bias_idx = max(min(seq_len - token_idx - 1, padded_max_seq_len), 0);
	qk += (attn_bias_vec != nullptr) ? attn_bias_vec[custom_bias_idx] : 0;

[Kernel][Model] PagedAttention: Support custom attention bias for T5 model (1/2) #11334

Are you sure you want to change the base?

[Kernel][Model] PagedAttention: Support custom attention bias for T5 model (1/2) #11334

Conversation

NickLucche commented Dec 19, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 19, 2024

sfc-gh-phalama Dec 19, 2024

Choose a reason for hiding this comment

sfc-gh-phalama commented Dec 19, 2024

NickLucche commented Dec 20, 2024 • edited Loading

NickLucche commented Jan 15, 2025 • edited Loading

joerunde commented Jan 16, 2025

NickLucche commented Jan 21, 2025

NickLucche commented Jan 22, 2025

mergify bot commented Feb 5, 2025

NickLucche commented Feb 17, 2025

hmellor left a comment

Choose a reason for hiding this comment

NickLucche commented Feb 17, 2025 • edited Loading

NickLucche commented Dec 19, 2024 •

edited by github-actions bot

Loading

NickLucche commented Dec 20, 2024 •

edited

Loading

NickLucche commented Jan 15, 2025 •

edited

Loading

NickLucche commented Feb 17, 2025 •

edited

Loading