static cache implementation is not compatible with attn_implementation==flash_attention_2 #32040

faaany · 2024-07-18T07:05:36Z

System Info

transformers version: 4.43.0.dev0
Platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.5
Safetensors version: 0.4.3
Accelerate version: 0.33.0.dev0
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA A100 80GB PCIe

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

pytest -rA tests/test_cache_utils.py::CacheIntegrationTest -k "test_static_cache_greedy_decoding_pad_left and flash_attention"

fails with

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.LongTensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Cache] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
        cache_position: Optional[torch.LongTensor] = None,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        if isinstance(past_key_value, StaticCache):
>           raise ValueError(
                "`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` "
                "make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers"
            )
E           ValueError: `static` cache implementation is not compatible with `attn_implementation==flash_attention_2` make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers

src/transformers/models/llama/modeling_llama.py:388: ValueError

And the right padding test case also fails:

pytest -rA tests/test_cache_utils.py::CacheIntegrationTest -k "test_static_cache_greedy_decoding_pad_right and flash_attention"

Expected behavior

Either we don't test flash_attention in this case, or we should add a if check to skip setting cache_implementation to static.

The text was updated successfully, but these errors were encountered:

faaany · 2024-07-18T07:18:08Z

I made a possible fix suggestion in this PR draft: #32039. But I am not sure whether this is correct. So I also filed this issue.

amyeroberts · 2024-07-18T12:02:29Z

cc @gante too

zucchini-nlp · 2024-07-23T13:34:00Z

Incompatibility also affecting Gemma2 with flash-attn, as it doesn't support dynamic cache

faaany added the bug label Jul 18, 2024

amyeroberts added the Cache label Jul 18, 2024

ArthurZucker added the Feature request Request for a new feature label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

static cache implementation is not compatible with attn_implementation==flash_attention_2 #32040

static cache implementation is not compatible with attn_implementation==flash_attention_2 #32040

faaany commented Jul 18, 2024 •

edited

Loading

faaany commented Jul 18, 2024

amyeroberts commented Jul 18, 2024

zucchini-nlp commented Jul 23, 2024

static cache implementation is not compatible with attn_implementation==flash_attention_2 #32040

static cache implementation is not compatible with attn_implementation==flash_attention_2 #32040

Comments

faaany commented Jul 18, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

faaany commented Jul 18, 2024

amyeroberts commented Jul 18, 2024

zucchini-nlp commented Jul 23, 2024

faaany commented Jul 18, 2024 •

edited

Loading