Fix flash attention bugs with Mistral and Falcon #27625

fxmarty · 2023-11-21T10:07:02Z

This PR fixes some important bugs in the Mistral and Falcon integration.

#26933 broke flash attention for Falcon due to the modification of the layout

The following tests were not passing:

FAILED tests/models/mistral/test_modeling_mistral.py::MistralModelTest::test_flash_attn_2_generate_padding_right - AssertionError: ValueError not raised
FAILED tests/models/mistral/test_modeling_mistral.py::MistralModelTest::test_flash_attn_2_inference_padding_right - AssertionError: ValueError not raised

FAILED tests/models/falcon/test_modeling_falcon.py::FalconModelTest::test_flash_attn_2_generate_left_padding - RuntimeError: CUDA error: device-side assert triggered
FAILED tests/models/falcon/test_modeling_falcon.py::FalconModelTest::test_flash_attn_2_generate_padding_right - RuntimeError: CUDA error: device-side assert triggered
FAILED tests/models/falcon/test_modeling_falcon.py::FalconModelTest::test_flash_attn_2_generate_use_cache - RuntimeError: CUDA error: device-side assert triggered
FAILED tests/models/falcon/test_modeling_falcon.py::FalconModelTest::test_flash_attn_2_inference - RuntimeError: CUDA error: device-side assert triggered
FAILED tests/models/falcon/test_modeling_falcon.py::FalconModelTest::test_flash_attn_2_inference_padding_right - RuntimeError: CUDA error: device-side assert triggered

and Falcon with FA2 is not really usable on main due to an error in the shape (currently [batch_size, num_head, seqlen, head_dim] instead of the required [batch_size, seqlen, num_head, head_dim].

fxmarty · 2023-11-21T10:08:22Z

src/transformers/models/mistral/modeling_mistral.py

@@ -838,7 +838,7 @@ def forward(
            attention_mask is not None
            and hasattr(self.config, "_flash_attn_2_enabled")
            and self.config._flash_attn_2_enabled
-            and past_key_values is not None
+            and use_cache


In the first autoregressive pass, past_key_values is None. In the following passes, we may have masks as

[[1, 1, 0, 0, 1], [1, 1, 1, 1, 1]]

and is_padding_right is wrongfully evaluated to False.

HuggingFaceDocBuilderDev · 2023-11-21T10:27:39Z

The documentation is not available anymore as the PR was closed or merged.

younesbelkada

Clean ! Thanks !

src/transformers/models/falcon/modeling_falcon.py

ArthurZucker

Thanks for fixing!

tests/models/llama/test_modeling_llama.py

ArthurZucker · 2023-11-21T11:11:24Z

tests/models/llama/test_modeling_llama.py

+                model.save_pretrained(tmpdirname)
+
+                dummy_attention_mask = inputs_dict.get("attention_mask", torch.ones_like(dummy_input))
+                # NOTE: Mistral apparently does not support right padding + use_cache with FA2.


it works but you'll get terrible results because the cache will cut the non padded values first

amyeroberts

Thanks for adding!

I just have a small Q about the tests

tests/test_modeling_common.py

tests/models/llama/test_modeling_llama.py

tests/models/mistral/test_modeling_mistral.py

tests/models/llama/test_modeling_llama.py

fxmarty · 2023-11-21T13:14:09Z

Hi @amyeroberts thank you for the review! I added by mistake test_flash_attn_2_generate_use_cache in test_modeling_llama.py while it was meant to be added in test_modeling_mistral.py, hence the confusion, apology!

amyeroberts · 2023-11-21T13:32:46Z

@fxmarty Thanks for clarifying. Tbh, I'm still a bit confused with the tests - it's not clear to me how this explicitly tests for the cache as use_cache isn't set anywhere 😅

fxmarty · 2023-11-21T14:18:10Z

@amyeroberts good catch indeed... I just checked, we are going here that sets use_cache=True:

transformers/src/transformers/generation/utils.py

Line 1602 in f93c1e9

model_kwargs["use_cache"] = generation_config.use_cache

and uses

transformers/src/transformers/generation/configuration_utils.py

Line 266 in f93c1e9

self.use_cache = kwargs.pop("use_cache", True)

amyeroberts · 2023-11-21T14:27:49Z

@fxmarty OK - thanks for explaining! As a follow up, could you add use_cache=True explicitly into the tests? This way it's clearer for anyone who sees the code and isn't subject to silently not being tested anymore if the configs or config handling changes

fxmarty · 2023-11-21T15:42:10Z

For sure @amyeroberts I will ping you there. Sorry I should have waited before merging..

amyeroberts · 2023-11-21T15:47:50Z

@fxmarty No worries! It doesn't affect the functionality of this PR so it's fine to be done separately :)

fxmarty added 5 commits November 21, 2023 10:45

fix various bugs with flash attention

26f0455

bump

f0dd02c

fix test

278ef41

fix mistral

321bf4b

use skiptest instead of return that may be misleading

4712b48

fxmarty commented Nov 21, 2023

View reviewed changes

fxmarty requested review from amyeroberts, ArthurZucker and younesbelkada November 21, 2023 10:08

younesbelkada approved these changes Nov 21, 2023

View reviewed changes

src/transformers/models/falcon/modeling_falcon.py Show resolved Hide resolved

ArthurZucker approved these changes Nov 21, 2023

View reviewed changes

amyeroberts reviewed Nov 21, 2023

View reviewed changes

fxmarty mentioned this pull request Nov 21, 2023

Flash Attention 2 support for RoCm #27611

Merged

fix on review

62e708f

fxmarty merged commit 82cc0a7 into huggingface:main Nov 21, 2023

fxmarty mentioned this pull request Nov 21, 2023

Explicitely specify use_cache=True in Flash Attention tests #27635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flash attention bugs with Mistral and Falcon #27625

Fix flash attention bugs with Mistral and Falcon #27625

fxmarty commented Nov 21, 2023 •

edited

Loading

fxmarty Nov 21, 2023

HuggingFaceDocBuilderDev commented Nov 21, 2023 •

edited

Loading

younesbelkada left a comment

ArthurZucker left a comment

ArthurZucker Nov 21, 2023

amyeroberts left a comment

fxmarty commented Nov 21, 2023

amyeroberts commented Nov 21, 2023

fxmarty commented Nov 21, 2023 •

edited

Loading

amyeroberts commented Nov 21, 2023 •

edited

Loading

fxmarty commented Nov 21, 2023

amyeroberts commented Nov 21, 2023

Fix flash attention bugs with Mistral and Falcon #27625

Fix flash attention bugs with Mistral and Falcon #27625

Conversation

fxmarty commented Nov 21, 2023 • edited Loading

fxmarty Nov 21, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 21, 2023 • edited Loading

younesbelkada left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Nov 21, 2023

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

fxmarty commented Nov 21, 2023

amyeroberts commented Nov 21, 2023

fxmarty commented Nov 21, 2023 • edited Loading

amyeroberts commented Nov 21, 2023 • edited Loading

fxmarty commented Nov 21, 2023

amyeroberts commented Nov 21, 2023

fxmarty commented Nov 21, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 21, 2023 •

edited

Loading

fxmarty commented Nov 21, 2023 •

edited

Loading

amyeroberts commented Nov 21, 2023 •

edited

Loading