New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models #9559

Open

sroy745 wants to merge 104 commits into vllm-project:main from sroy745:sroy-vllm-encdec-flash

+716 −317

Contributor

sroy745 commented Oct 21, 2024 •

edited

Loading

This PR adds support for flash attention kernel for encoder decoder models. For encoder-decoder models with dtype=bfloat16 the default backend choice is now FlashAttention instead of XFormers. However for llama-3.2-11b-vision-instruct we still use the Xformers backend even with dtype=bfloat16 because the model implementation (models/mllama.py) has dependency on PagedAttention.

For adding this support, we make the following changes in this pr

Updated flash_attn.py to add support for encoder-decoder models. Also updated the tests in tests/kernels/test_encoder_decoder.py to test FlashAttention backend along with the existing XFormers backend.
Updated test_bart.py , test_florence2.py and encoder_decoder/test_e2e_correctness.py to run with both backends.
Moved some methods from xformers.py to backend/utils.py so that they can be reused in both xformers.py and flash_attn.py
Updated the checks in worker/enc_dec_model_runner.py to now check that the backend is either FlashAttention or XFormers instead of only XFormers as we do currently.
Updated models/bart.py to invoke attention.forward with query of shape [num_tokens, hidden_size]. Currently it was invoking the forward with a query of shape [num_tokens, num_heads, head_size] which is not default.

sroy745 added 30 commits

May 28, 2024 20:39


          Merge pull request #1 from vllm-project/main

5650b95

Pull from head


          Merge branch 'vllm-project:main' into main

8f36146


          Merge branch 'vllm-project:main' into main

9e75057


          Merge branch 'vllm-project:main' into main

db2c679


          Merge branch 'vllm-project:main' into main

8d7512c


          Merge branch 'vllm-project:main' into main

1473f74


          Merge branch 'vllm-project:main' into main

4013e1a


          Merge branch 'vllm-project:main' into main

2dbdd78


          Merge branch 'vllm-project:main' into main

b3575e9


          Merge branch 'vllm-project:main' into main

94b0d43


          Merge branch 'vllm-project:main' into main

fa8fedf


          Merge branch 'vllm-project:main' into main

6ed96b4


          Merge branch 'vllm-project:main' into main

b71c533


          Merge branch 'vllm-project:main' into main

57babef


          Merge branch 'vllm-project:main' into main

4b19bac


          Merge branch 'vllm-project:main' into main

eb7a1c4


          Merge branch 'vllm-project:main' into main

7e2c87e


          Merge branch 'vllm-project:main' into main

6212d5f


          Merge branch 'vllm-project:main' into main


          Merge branch 'vllm-project:main' into main

68e080a


          Merge branch 'vllm-project:main' into main

55e4332


          Merge branch 'vllm-project:main' into main

532eb48


          Merge branch 'vllm-project:main' into main

7cea056


          Merge branch 'vllm-project:main' into main

185e056


          Merge branch 'vllm-project:main' into main

e2be95f


          Merge branch 'vllm-project:main' into main

2ed5473


          Merge branch 'vllm-project:main' into main

efa4714


          Merge branch 'vllm-project:main' into main

fb87d34


          Merge branch 'vllm-project:main' into main

5419e49


          Merge branch 'vllm-project:main' into main

9ba12f8

Collaborator

heheda12345 commented Oct 29, 2024

I think it won't be too difficult to support mllama+flashattention. @sroy745 ping me if you need more background information. I'll go through the code later today.

heheda12345 reviewed

View reviewed changes

Collaborator

heheda12345 left a comment

Thanks for the great work. I left some comments, mainly about simplifying the logic of different AttentionType.

vllm/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

vllm/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

vllm/attention/backends/flash_attn.py Outdated

+                              kv_cache[0],
+                              kv_cache[1],
+                              updated_slot_mapping.flatten()
+                              if updated_slot_mapping is not None else None,

Collaborator

heheda12345 Oct 30, 2024

I think we do not need this branch. When decode phase & attn_type==encoder_decoder, the key & value should be None and we will not enter the true branch of if (attn_type != AttentionType.ENCODER) and (key is not None) and (value is not None):
I think we can add some comment to explain it and remove the branches

Contributor Author

sroy745 Oct 30, 2024

Yes this is not needed as you mentioned. I had this because I was getting a mypy type check error. Removed this condition and instead added an ignore annotation.

vllm/attention/backends/flash_attn.py Outdated

		raise AttributeError(f"Invalid attention type {str(attn_type)}")


		def _get_num_prefill_encode_decode_tokens(

Collaborator

heheda12345 Oct 30, 2024

If this function is the same with xformer backend, can we move it to utils.py and calls it in both flashattn & xformer?

Contributor Author

sroy745 Oct 30, 2024

Moved it to utils and using it now in xformers backend. However there is a slight diff in the way I set num_encoder_tokens when attention_type = DECODER as you have noted in your other comment.

vllm/attention/backends/flash_attn.py Outdated

Comment on lines 891 to 897

+                          if (attn_type == AttentionType.ENCODER or \
+                              attn_type == AttentionType.ENCODER_DECODER):
+                              key = key[:num_encoder_tokens]
+                              value = value[:num_encoder_tokens]
+                          else:
+                              key = key[:num_prefill_tokens]
+                              value = value[:num_prefill_tokens]

Collaborator

heheda12345 Oct 30, 2024

If you set def _get_num_prefill_encode_decode_tokens() if attn_type == AttentionType.DECODER: num_encoder_tokens=attn_metadata.num_prefill_tokens like what xformer is doing, you can avoid this branch. The num_encoder_token is similar to q_len.

And not sure if it is correct to remove these lines and pass the full key and value to the attention kernel.

Contributor Author

sroy745 Oct 30, 2024

I am slightly in favor of this because when the AttentionType is DECODER I was thinking it might be more intuitive to split the key based on prefill_tokens rather than encoder_tokens. encoder_tokens seem more relevant when the AttentionType is ENCODER or ENCODER_DECODER.

I modified the xformers code also to have the same split since I am now using the common get_num_prefill_encode_decode_tokens. Please let me know your preference and I will update this accordingly in both the backends.

Collaborator

heheda12345 Oct 30, 2024

I think the goal of _get_num_prefill_encode_decode_tokens is to unify different attention types and avoid branching in the following code path as much as possible. What about renaming the three variables to make it clearer, e.g., num_prefill_query_tokens, num_prefill_kv_tokens, num_decode_query_tokens?

Contributor Author

sroy745 Nov 1, 2024

Done renamed to num_prefill_query_tokens, num_prefill_kv_tokens, num_decode_query_tokens and removed the if branches in the backends.

vllm/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

vllm/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

vllm/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

vllm/worker/enc_dec_model_runner.py Outdated Show resolved Hide resolved

sroy745 and others added 6 commits

October 29, 2024 22:49


          Merge branch 'vllm-project:main' into main

a2090e0


          Merge remote-tracking branch 'origin/main' into sroy-vllm-encdec-flash

a1e8c98


          Comments

0604c0a


          Format

77ee5e2


          Comments

b147fb9


          Comment

7284de5

Contributor Author

sroy745 commented Oct 30, 2024

Thanks for the review. Addressed your comments. PTAL.

heheda12345 reviewed

View reviewed changes

tests/encoder_decoder/test_e2e_correctness.py Show resolved Hide resolved

Collaborator

heheda12345 commented Oct 30, 2024

Thanks for your fix. I left some comments.

sroy745 added 2 commits

November 1, 2024 00:40


          Comments

282a918


          Comments

c39d4c9

Contributor Author

sroy745 commented Nov 1, 2024

Thanks for the review. Addressed comments. PTAL


          Comments

cc58ebe

mergify bot commented Nov 1, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @sroy745 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

heheda12345 approved these changes

View reviewed changes

Collaborator

heheda12345 left a comment

LGTM. Thanks for your hardwork on this. Looking forward for the follow-up PRs for test_encoder_decoder_attention and mllama support.

Also CC @WoosukKwon. You may need to sync this PR to v1 later.


          Merge branch 'main' into sroy-vllm-encdec-flash

mergify bot removed the needs-rebase label

sroy745 and others added 3 commits

November 1, 2024 08:46


          Merge branch 'vllm-project:main' into main

c9a3f00


          Comments

834572f


          Merge remote-tracking branch 'origin/main' into sroy-vllm-encdec-flash

15dc714

Contributor Author

sroy745 commented Nov 1, 2024

@ywang96 PTAL when you get a chance. PR has been LG'ed by @heheda12345 , is synced to head and all tests are passing.

mergify bot commented Nov 1, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @sroy745 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label


          Merge branch 'main' into sroy-vllm-encdec-flash

21946be

mergify bot removed the needs-rebase label

sroy745 added 2 commits

November 2, 2024 01:10


          Dummy

2264a62


          Format

7ca0ab7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ywang96 ywang96 left review comments

heheda12345 heheda12345 approved these changes

tlrmchlsmth Awaiting requested review from tlrmchlsmth tlrmchlsmth is a code owner

WoosukKwon Awaiting requested review from WoosukKwon WoosukKwon is a code owner

DarkLight1337 Awaiting requested review from DarkLight1337 DarkLight1337 is a code owner

At least 1 approving review is required to merge this pull request.

Labels

ready