[Flash Attention 2] Add flash attention 2 for GPT-J #28295

bytebarde · 2024-01-01T03:39:13Z

What does this PR do?

Adds Flash Attention 2 for GPT-J
Fixes #26350

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

cc: @younesbelkada

bytebarde · 2024-01-01T03:42:07Z

Current progress with running flash_attn_test. Will dive deeper to fix the error.

susnato · 2024-01-01T19:28:28Z

Hi @bytebarde, what is the error message?
If it is something like - "IndexError: tensors used as ...", then updating CUDA could solve the error (At least it was for my case in OPT).

BTW run make fixup to make the CI green!

…ng_right

bytebarde · 2024-01-02T02:49:53Z

Hi @susnato, thank you so much for for your attention to this PR!

I believe the error originates from two factors: (1) my preliminary implementation of GPTJFlashAttention2, which aimed to eliminate "redundant" transposing of the key and query, and (2) the execution of test_flash_attn_2_generate_padding_right using the testing configuration.

To address these issues, I have reinstated the original transposing operations and reverted the QKV cache concatenation. Additionally, I overwrote test_flash_attn_2_generate_padding_right by using the actual checkpoint and passed all eight tests, similar to what @younesbelkada and you did for llama2 and phi2.

Currently, the code has some problems with make fixup. Will work on this for the next step.

bytebarde · 2024-01-04T05:27:18Z

Hi @younesbelkada,

I believe this pull request is now ready for your review.

I'd like to highlight a few changes, especially regarding check_copies.py, that I'm not entirely confident about. To ensure the branch passes the make fixup check, I removed the "copies" lines before both modeling_codegen.CodeGenBlock and test_modeling_gptj.test_flash_attn_2_generate_padding_right. This was done because the changes involved are somehow complex.

I would really appreciate your guidance on this. If there's a more standard or preferable way to handle such intricate changes, please let me know so I can make the necessary adjustments.

Thank you for your time on this!

tests/models/gptj/test_modeling_gptj.py

src/transformers/models/codegen/modeling_codegen.py

younesbelkada

Looks clean on my end already ! Would you be happy to address the comment about the copy mechanism that has been removed ?
Also, can you run the benchmarking script here: https://gist.github.com/younesbelkada/02f35734da906cc0f2389ae4f665c58f with a gpt-j checkpoint and see the speedup (the result should give you something similar than: #26414 (review)), I can take care of pushing the images on the Hub and we'll just need to update the docs similarly as: 7d4c688

src/transformers/models/codegen/modeling_codegen.py

tests/models/gptj/test_modeling_gptj.py

bytebarde · 2024-01-28T04:55:36Z

Hi @younesbelkada, thank you very much for your valuable input and guidance! I apologize for the delayed response.

I've addressed the comment regarding the copy mechanism, and the branch successfully passed the make fixup test.

Additionally, I've conducted the speed test. However, the observed speedup was not as significant as what we noted with OPT. The test was performed on an Nvidia RTX 4090, utilizing max-batch-size=8 and max-seqlen=32 to conserve memory. The model checkpoint used was EleutherAI/gpt-j-6b with the revision set to "float16". I've attached the speedup graph below for your review.

Could you also perform the test on an A100 GPU for comparison?

Thank you once again for your time. I look forward to hearing your thoughts on this!

younesbelkada

Thank you ! Can you just rebase / merge with main to make sure the CI passes?

HuggingFaceDocBuilderDev · 2024-01-29T23:29:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM Thanks for adding flash attention support!

src/transformers/models/gptj/modeling_gptj.py

ArthurZucker · 2024-01-30T00:46:43Z

src/transformers/models/gptj/modeling_gptj.py

@@ -293,7 +560,11 @@ def __init__(self, config):
        super().__init__()
        inner_dim = config.n_inner if config.n_inner is not None else 4 * config.n_embd
        self.ln_1 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
-        self.attn = GPTJAttention(config)
+        self.attn = (


let's define and use !

Suggested change

self.attn = (

GPTJ_ATTENTION_CLASSES = {

"eager": GPTJAttention,

"flash_attention_2": GPTJFlashAttention,

}

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

younesbelkada

Hi @bytebarde
Can you address this comment? https://github.com/huggingface/transformers/pull/28295/files#r1470429885
It shouldn't be super hard, you just need to do something similar than what we do in Llama, specifically: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L746 and

transformers/src/transformers/models/llama/modeling_llama.py

Line 758 in 7b2bd1f

    
           self.self_attn = LLAMA_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)

bytebarde · 2024-02-09T06:48:22Z

Hi @ArthurZucker and @younesbelkada ,

Thank you so much for your additional suggestions!

I am sorry. I had assumed that GPTJ_ATTENTION_CLASSES had already been introduced by @ArthurZucker previously...

I have now added GPTJ_ATTENTION_CLASSES and made the necessary code modifications.
Furthermore, I re-ran the test suite and successfully passed all the tests.

Please let me know if there's anything more I can do!
Thank you so much!

ArthurZucker · 2024-02-12T05:40:58Z

Good for me merging! 🤗

ArthurZucker

Last nits!

ArthurZucker · 2024-02-12T05:43:09Z

tests/models/gptj/test_modeling_gptj.py

+    @require_torch_gpu
+    @pytest.mark.flash_attn_test
+    @slow
+    def test_flash_attn_2_generate_padding_right(self):


requires_bitsandbytes here!
Also let's add the expected text explicitly! to make sure we always have what we want!

github-actions · 2024-03-07T08:05:27Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2024-03-07T10:12:01Z

Hey @bytebarde could you rebase and add the explicit expected outputs? Or should I do it? 🤗

bytebarde · 2024-03-08T14:21:47Z

Hi @ArthurZucker , good morning!

I have added the @require_bitsandbytes and expected outputs in the test function.
Please let me know if there is anything needed to be addressed!

Thank you so much!

younesbelkada

Hi @bytebarde
Thanks! Can you run the styling checks? make fixup and / or make fix-copies after that we can merge

bytebarde · 2024-03-12T18:47:24Z

Hi @younesbelkada,

Thank you for taking the time to review this!

I have run the make fix-copies and believe that the previous issues regarding consistency have been addressed.

Please let me know if any further changes are needed. Thank you!

younesbelkada

Thanks again !

* initial implementation of flash attention for gptj * modify flash attention and overwrite test_flash_attn_2_generate_padding_right * update flash attention support list * remove the copy line in the `CodeGenBlock` * address copy mechanism * Update src/transformers/models/gptj/modeling_gptj.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Add GPTJ attention classes * add expected outputs in the gptj test * Ensure repo consistency with 'make fix-copies' --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

initial implementation of flash attention for gptj

6419b04

modify flash attention and overwrite test_flash_attn_2_generate_paddi…

3a9e31f

…ng_right

bytebarde and others added 2 commits January 1, 2024 20:14

update flash attention support list

cb65655

remove the copy line in the CodeGenBlock

e47ef13

bytebarde force-pushed the flash_attn_gptj branch from 41c9d4a to e47ef13 Compare January 4, 2024 05:13

bytebarde changed the title ~~[Flash Attention 2] [WIP] Add flash attention 2 for GPT-J~~ [Flash Attention 2] Add flash attention 2 for GPT-J Jan 4, 2024

bytebarde commented Jan 4, 2024

View reviewed changes

tests/models/gptj/test_modeling_gptj.py Outdated Show resolved Hide resolved

bytebarde commented Jan 4, 2024

View reviewed changes

tests/models/gptj/test_modeling_gptj.py Outdated Show resolved Hide resolved

bytebarde commented Jan 4, 2024

View reviewed changes

tests/models/gptj/test_modeling_gptj.py Show resolved Hide resolved

bytebarde commented Jan 4, 2024

View reviewed changes

src/transformers/models/codegen/modeling_codegen.py Show resolved Hide resolved

ArthurZucker requested a review from younesbelkada January 4, 2024 09:11

younesbelkada reviewed Jan 8, 2024

View reviewed changes

src/transformers/models/codegen/modeling_codegen.py Show resolved Hide resolved

tests/models/gptj/test_modeling_gptj.py Outdated Show resolved Hide resolved

address copy mechanism

0c31cb3

younesbelkada approved these changes Jan 29, 2024

View reviewed changes

younesbelkada requested a review from ArthurZucker January 29, 2024 23:11

ArthurZucker approved these changes Jan 30, 2024

View reviewed changes

younesbelkada and others added 2 commits January 30, 2024 02:54

Update src/transformers/models/gptj/modeling_gptj.py

def626e

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Merge branch 'huggingface:main' into flash_attn_gptj

6e9c707

younesbelkada reviewed Feb 1, 2024

View reviewed changes

Add GPTJ attention classes

af0752e

ArthurZucker approved these changes Feb 12, 2024

View reviewed changes

bytebarde added 2 commits March 8, 2024 01:20

Merge branch 'huggingface:main' into flash_attn_gptj

cd73e33

add expected outputs in the gptj test

cb265c7

younesbelkada reviewed Mar 12, 2024

View reviewed changes

Ensure repo consistency with 'make fix-copies'

2b489b0

younesbelkada approved these changes Mar 13, 2024

View reviewed changes

younesbelkada merged commit be3fd8a into huggingface:main Mar 13, 2024
19 checks passed

younesbelkada mentioned this pull request Mar 13, 2024

Core: Fix copies on main #29624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flash Attention 2] Add flash attention 2 for GPT-J #28295

[Flash Attention 2] Add flash attention 2 for GPT-J #28295

bytebarde commented Jan 1, 2024

bytebarde commented Jan 1, 2024 •

edited

Loading

susnato commented Jan 1, 2024 •

edited

Loading

bytebarde commented Jan 2, 2024 •

edited

Loading

bytebarde commented Jan 4, 2024

younesbelkada left a comment

bytebarde commented Jan 28, 2024

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Jan 29, 2024

ArthurZucker left a comment

ArthurZucker Jan 30, 2024

younesbelkada left a comment •

edited

Loading

bytebarde commented Feb 9, 2024

ArthurZucker commented Feb 12, 2024

ArthurZucker left a comment

ArthurZucker Feb 12, 2024

github-actions bot commented Mar 7, 2024

ArthurZucker commented Mar 7, 2024

bytebarde commented Mar 8, 2024 •

edited

Loading

younesbelkada left a comment

bytebarde commented Mar 12, 2024

younesbelkada left a comment

-        self.attn = (
+GPTJ_ATTENTION_CLASSES = {
+    "eager": GPTJAttention,
+    "flash_attention_2": GPTJFlashAttention,
+}

[Flash Attention 2] Add flash attention 2 for GPT-J #28295

[Flash Attention 2] Add flash attention 2 for GPT-J #28295

Conversation

bytebarde commented Jan 1, 2024

What does this PR do?

Before submitting

Who can review?

bytebarde commented Jan 1, 2024 • edited Loading

susnato commented Jan 1, 2024 • edited Loading

bytebarde commented Jan 2, 2024 • edited Loading

bytebarde commented Jan 4, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

bytebarde commented Jan 28, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 29, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 30, 2024

Choose a reason for hiding this comment

younesbelkada left a comment • edited Loading

Choose a reason for hiding this comment

bytebarde commented Feb 9, 2024

ArthurZucker commented Feb 12, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Feb 12, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 7, 2024

ArthurZucker commented Mar 7, 2024

bytebarde commented Mar 8, 2024 • edited Loading

younesbelkada left a comment

Choose a reason for hiding this comment

bytebarde commented Mar 12, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

bytebarde commented Jan 1, 2024 •

edited

Loading

susnato commented Jan 1, 2024 •

edited

Loading

bytebarde commented Jan 2, 2024 •

edited

Loading

younesbelkada left a comment •

edited

Loading

bytebarde commented Mar 8, 2024 •

edited

Loading