[`GPTNeoX`] Flex Attention + Refactor #34896

vasqu · 2024-11-23T16:31:38Z

What does this PR do?

Adds flex attention and the refactor according to #34809

However, I discovered several issues in the current version of gemma2 (#34282):

It seems like that flex attention needs a transpose afterwards like sdpa
Loading flex attn with from pretrained didn't work and hence, current tests use another attn implementation (eager or sdpa not sure again)
Tests could gain from similar tests like sdpa :D for now it's a bit of a hassle to always have some integration test added when it could be a more general test for all subsequent models
I'm not familiar with better transformers or limitations of flex attn --> added some todos in case we need to check in
Flex attn doesn't support dropout (or maybe I've overlooked something)
Setting model.config._attn_implementation = ... should be tracked somewhere and checked for sanity as done the first time - for now it silently overwrites and could cause some ugly errors (tested with changing to flash attention 2 while not having fa2 installed)
Documentation should be added somewhere (prolly perf or something else)

So tbh, I'm not sure whether to split this PR into several ones, e.g. a gemma fix, general loading, general tests, docs, and then subsequent models, or not

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker

vasqu

A collection of comments which partially show the issues I listed above

vasqu · 2024-11-23T16:34:27Z

src/transformers/modeling_utils.py

+                # TODO: add contribution notice?
+                raise ValueError(
+                    f"{cls.__name__} does not support an attention implementation through torch's flex_attention."
+                    " If you believe this error is a bug, please open an issue in Transformers GitHub repository"
+                    ' and load your model with the argument `attn_implementation="eager"` meanwhile.'
+                    ' Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")`'
+                )


Likely linking to #34809

vasqu · 2024-11-23T16:35:10Z

src/transformers/modeling_utils.py

+        # TODO check for more edge cases as done in the other implementations
+        # _is_bettertransformer = getattr(cls, "use_bettertransformer", False)
+        # if _is_bettertransformer:
+        #    return config


Just let it there since I'm not familiar with better transformers and if there needs to be a check or smthn

vasqu · 2024-11-23T16:35:29Z

src/transformers/utils/import_utils.py

+    # TODO check if some bugs cause push backs on the exact version
+    # NOTE: We require torch>=2.5.0 as it is the first release
+    return version.parse(_torch_version) >= version.parse("2.5.0")


Also here unsure if we will encounter bugs ;)

vasqu · 2024-11-23T16:35:50Z

tests/models/gpt_neox/test_modeling_gpt_neox.py

+    @slow
+    def test_lm_generate_flex_attn_gptneox(self):
+        tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-410m-deduped")
+        for checkpointing in [True, False]:
+            model = GPTNeoXForCausalLM.from_pretrained(
+                "EleutherAI/pythia-410m-deduped", attn_implementation="flex_attention"
+            )
+
+            if checkpointing:
+                model.gradient_checkpointing_enable()
+            else:
+                model.gradient_checkpointing_disable()
+            model.to(torch_device)
+
+            inputs = tokenizer("My favorite food is", return_tensors="pt").to(torch_device)
+            # The hub repo. is updated on 2023-04-04, resulting in poor outputs.
+            # See: https://github.com/huggingface/transformers/pull/24193
+            expected_output = "My favorite food is a good old-fashioned, old-fashioned, old-fashioned.\n\nI'm not sure"
+
+            output_ids = model.generate(**inputs, do_sample=False, max_new_tokens=20)
+            output_str = tokenizer.batch_decode(output_ids)[0]
+
+            self.assertEqual(output_str, expected_output)


Would love to have common tests in the future (instead)

vasqu · 2024-11-23T16:36:57Z