[Whisper, Bart, MBart] Add Flash Attention 2 #27203

patrickvonplaten · 2023-11-01T09:49:04Z

What does this PR do?

This PR adds Flash Attention for Whisper, Bart & MBart.

Whisper depends on Bart and MBart quite a bit for Flash Attention like 20+ other model architectures.
As this is the first PR that adds Flash Attention 2 to a encoder-decoder model, I wanted to make sure it's done for the two template models (Bart and MBart) as well so that Whisper (and all other encoder-decoder models that follow) don't loose their "# Copied from" statements.

Note that while this PR changes 27 files, only 4 files are really relevant to review because all other files are just consequences of the "# Copied from mechanism":
The following there files fully implement Flash Attention 2:

src/transformers/models/bart/modeling_bart.py
src/transformers/models/mbart/modeling_mbart.py
src/transformers/models/whisper/modeling_whisper.py
The test files is restructured so that Flash Attention 2 tests can nicely run for different kinds of models (audio & nlp as well as decoder-only and encoder-decoder).
tests/test_modeling_common.py

I ran the following tests to make sure everything works as expected:

CUDA_VISIBLE_DEVICES="0" RUN_SLOW=1 pytest tests/models/whisper/test_modeling_whisper.py
CUDA_VISIBLE_DEVICES="0" RUN_SLOW=1 pytest tests/models/mbart/test_modeling_mbart.py
CUDA_VISIBLE_DEVICES="0" RUN_SLOW=1 pytest tests/models/bart/test_modeling_bart.py

as well as:

RUN_SLOW=1 pytest -m flash_attn_test tests

All tests pass that also pass on "main". The only failures are related to disk offloading which should be fixed in: #27204

There are some "error not raised" failures for flash attn and mistral, but they are also present in "main" and seem to be related to this PR: #27125 (cc @younesbelkada), I'd suggest to also fix those in another PR.

Other CI test failures are unrelated.

HuggingFaceDocBuilderDev · 2023-11-01T10:15:49Z

The documentation is not available anymore as the PR was closed or merged.

src/transformers/models/bart/modeling_bart.py

amyeroberts

Amazing piece of work! 🔥

Main comment is about the tests - I think some might be indexing on outputs when it should be using outputs_fa

src/transformers/models/bart/modeling_bart.py

tests/test_modeling_common.py

amyeroberts · 2023-11-01T17:28:24Z

tests/test_modeling_common.py


-                output = model(dummy_input, attention_mask=dummy_attention_mask, output_hidden_states=True)
-                logits = output.hidden_states[-1]
+                self.assertTrue(torch.allclose(logits_fa, logits, atol=4e-2, rtol=4e-2))


This... isn't that close. I can see it's the tolerance used elsewhere but seems like quite a big difference

Yeah, Flash attention leads to very much different results though. I think 0.04 is good enough tbh

tests/test_modeling_common.py

patrickvonplaten · 2023-11-01T17:48:22Z

src/transformers/models/bart/modeling_bart.py

+        input_dtype = query_states.dtype
+        if input_dtype == torch.float32:
+            # Handle the case where the model is quantized
+            if hasattr(self.config, "_pre_quantization_dtype"):


We need to have access to the config here

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

LysandreJik

Great. I appreciate the # Copied from statements which make the code simpler to review.

Very clean, it's ok for me to merge

LysandreJik · 2023-11-01T18:14:39Z

src/transformers/models/bart/modeling_bart.py

 class BartEncoderLayer(nn.Module):
    def __init__(self, config: BartConfig):
        super().__init__()
        self.embed_dim = config.d_model
-        self.self_attn = BartAttention(
+        attn_type = "flash_attention_2" if getattr(config, "_flash_attn_2_enabled", False) else "default"


we should eventually move this to an enum to be clenaer (out of scope for this PR)

(brainstorming, still out of scope) It would be cleaner to eventually have the config return the appropriate attention name for all models:

self.self_attn = BART_ATTENTION_CLASSES[config.attention_type]( ... )

with

class PreTrainedConfig(): ... @property def attention_type(self): return AttentionTypes.FA2 if getattr(self, "_flash_attn_2_enabled", False) else AttentionTypes.DEFAULT

Yeah attention_type as a property is a good idea I think! We should then probably also allow users to change it even after the model was loaded

LysandreJik · 2023-11-01T18:22:39Z

tests/test_modeling_common.py

+            # make sure that all models have at least 40 position ids
+            if hasattr(config, "max_position_embeddings"):
+                config.max_position_embeddings = 40


Why that minimum?

patrickvonplaten · 2023-11-01T19:57:58Z

Ok ran some more tests and it should be good now. I'm getting some flaky behavior with the flash attention tests on my RTX 4090 (especially extreme for Whisper). We should maybe think about how we can make them more robust now that we've added some more models (cc @younesbelkada)

* add whisper fa2 * correct * change all * correct * correct * fix more * fix more * fix more * fix more * fix more * fix more * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix more * fix more * fix more * fix more * fix more --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

add whisper fa2

69f847f

patrickvonplaten added 8 commits November 1, 2023 11:52

correct

3d3b797

change all

d86f5b0

correct

997ec7a

correct

0a5f7be

fix more

0fc4a8f

fix more

a305535

fix more

3a61bd3

fix more

1539611

patrickvonplaten commented Nov 1, 2023

View reviewed changes

src/transformers/models/bart/modeling_bart.py Outdated Show resolved Hide resolved

patrickvonplaten added 2 commits November 1, 2023 16:12

fix more

733f0f0

fix more

814e6bb

patrickvonplaten changed the title ~~[Whisper] Add Flash Attention 2~~ [Whisper, Bart, MBart] Add Flash Attention 2 Nov 1, 2023

patrickvonplaten requested review from amyeroberts, LysandreJik and younesbelkada November 1, 2023 16:42

amyeroberts approved these changes Nov 1, 2023

View reviewed changes

patrickvonplaten commented Nov 1, 2023

View reviewed changes

patrickvonplaten and others added 4 commits November 1, 2023 18:50

Apply suggestions from code review

d98c748

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

fix more

854b588

fix more

6d16509

fix more

f3ea870

LysandreJik approved these changes Nov 1, 2023

View reviewed changes

patrickvonplaten added 3 commits November 1, 2023 19:12

fix more

edd1fbe

fix more

039aa91

Merge branch 'main' into fa2_whisper

9b1eefc

patrickvonplaten merged commit af3de8d into main Nov 1, 2023
2 of 3 checks passed

patrickvonplaten deleted the fa2_whisper branch November 1, 2023 20:03

ydshieh added a commit that referenced this pull request Nov 2, 2023

run doctest 2023-12-02 [PR #27203]

a13dff5

ArthurZucker mentioned this pull request Nov 6, 2023

F.scaled_dot_product_attention support #26572

Merged

amyeroberts mentioned this pull request Nov 8, 2023

Add Flash Attention 2 support to Bark #27364

Merged

sanchit-gandhi mentioned this pull request Dec 1, 2023

Flash Attention 2 for audio/musicgen #27552

Closed

sanchit-gandhi mentioned this pull request Apr 5, 2024

SPDA/FA2 Attention for the Wav2Vec2 Family of Models #30073

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper, Bart, MBart] Add Flash Attention 2 #27203

[Whisper, Bart, MBart] Add Flash Attention 2 #27203

patrickvonplaten commented Nov 1, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 1, 2023 •

edited

Loading

amyeroberts left a comment

amyeroberts Nov 1, 2023

patrickvonplaten Nov 1, 2023

patrickvonplaten Nov 1, 2023

LysandreJik left a comment

LysandreJik Nov 1, 2023

LysandreJik Nov 1, 2023

patrickvonplaten Nov 1, 2023

LysandreJik Nov 1, 2023

patrickvonplaten commented Nov 1, 2023

[Whisper, Bart, MBart] Add Flash Attention 2 #27203

[Whisper, Bart, MBart] Add Flash Attention 2 #27203

Conversation

patrickvonplaten commented Nov 1, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Nov 1, 2023 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Nov 1, 2023

Choose a reason for hiding this comment

patrickvonplaten Nov 1, 2023

Choose a reason for hiding this comment

patrickvonplaten Nov 1, 2023

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Nov 1, 2023

Choose a reason for hiding this comment

LysandreJik Nov 1, 2023

Choose a reason for hiding this comment

patrickvonplaten Nov 1, 2023

Choose a reason for hiding this comment

LysandreJik Nov 1, 2023

Choose a reason for hiding this comment

patrickvonplaten commented Nov 1, 2023

patrickvonplaten commented Nov 1, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 1, 2023 •

edited

Loading