Add SDPA support for T5 Style Models #30375

abdulfatir · 2024-04-21T17:03:35Z

What does this PR do?

Adds torch's scaled_dot_product_attention support for the T5 model.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @fxmarty @sayakpaul

abdulfatir · 2024-04-21T17:15:34Z

I ran tests for T5 locally (pytest tests/models/t5/test_modeling_t5.py). Most of the tests pass except the following:

FAILED tests/models/t5/test_modeling_t5.py::T5ModelTest::test_disk_offload_bin - AssertionError: ValueError not raised
FAILED tests/models/t5/test_modeling_t5.py::T5ModelTest::test_prompt_lookup_decoding_matches_greedy_search - RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 3
FAILED tests/models/t5/test_modeling_t5.py::T5ModelTest::test_torch_fx - AssertionError: Couldn't trace module: __bool__ should return bool, returned Tensor
FAILED tests/models/t5/test_modeling_t5.py::T5ModelTest::test_torch_fx_output_loss - AssertionError: Couldn't trace module: __bool__ should return bool, returned Tensor

Not sure what's going on but this is not failing on Github CI.
Need some help figuring out what's going on. Test fails when prompt_lookup_num_tokens is set in test_prompt_lookup_decoding_matches_greedy_search.
Is this due to if self.is_decoder?
Is this due to if self.is_decoder?

Is the fix for 3 & 4 making separate encoder and decoder classes? I would like to get some suggestions on this before I go ahead and fix other T5-like models.

Another question: If sdpa is supported by a model, would it automatically be incorporated into the current tests or should I also write some unit tests to ensure that the sdpa and eager versions match? Looks like sdpa is being compared with eager in a slow test which is being skipped. I ran RUN_SLOW=True pytest -rs tests/models/t5/test_modeling_t5.py::T5ModelTest::test_eager_matches_sdpa_generate and the test passed.

src/transformers/models/t5/modeling_t5.py

sayakpaul · 2024-04-24T02:11:36Z

Thanks @abdulfatir. Did you get to compare the speed-up provided by SDPA? If not, no worries. If it helps, you can maybe adapt this script:

from transformers import AutoTokenizer, CLIPTextModel
import argparse 
import torch
import time

def bytes_to_giga_bytes(bytes):
    return bytes / 1024 / 1024 / 1024

def load_model(args):
    model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32", attn_implementation=args.attn_implementation).to("cuda")
    tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
    return model, tokenizer

def get_inputs(args, tokenizer):
    inputs = tokenizer(["a photo of a cat"] * args.batch_size, padding=True, return_tensors="pt")
    inputs = {k: v.to("cuda") for k, v in inputs.items()}
    return inputs

@torch.no_grad()
def main(args):
    model, tokenizer = load_model(args)
    inputs = get_inputs(args, tokenizer)
    # warmup
    for _ in range(5):
        _  = model(**inputs)
    
    start = time.time()
    for _ in range(args.num_iters):
        _  = model(**inputs)
    end = time.time()
    avg_inference_time = (end - start) / args.num_iters

    print(f"{args.attn_implementation=}, {args.batch_size=}, {args.num_iters=}")
    print(f"avg_inference_time: {avg_inference_time:.3f} seconds")
    print(
        f"Max memory allocated: {bytes_to_giga_bytes(torch.cuda.max_memory_allocated())} GB"
    )

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--attn_implementation", type=str, choices=["sdpa", "eager"], default="eager")
    parser.add_argument("--batch_size", type=int, default=2)
    parser.add_argument("--num_iters", type=int, default=10)
    args = parser.parse_args()
    main(args)

abdulfatir · 2024-04-24T11:17:53Z

@sayakpaul Thanks for sharing! I was testing it but found the speedup and memory savings to be quite modest. Here's the script I used:

import argparse
import time

import torch

from transformers import (
    GenerationConfig,
    T5ForConditionalGeneration,
    T5TokenizerFast,
)


def bytes_to_giga_bytes(bytes):
    return bytes / 1024 / 1024 / 1024


def load_model(args):
    model = T5ForConditionalGeneration.from_pretrained(
        "google/flan-t5-large",
        attn_implementation=args.attn_implementation,
        torch_dtype=getattr(torch, args.torch_dtype),
    ).to("cuda")
    tokenizer = T5TokenizerFast.from_pretrained("google/flan-t5-large")
    return model, tokenizer


def get_inputs(args, tokenizer):
    inputs = tokenizer(
        ["Pedro Pedro Pedro 🦝🙌" * 10] * args.batch_size,
        padding=True,
        return_tensors="pt",
    )
    inputs = {k: v.to("cuda") for k, v in inputs.items()}
    return inputs


@torch.no_grad()
def main(args):
    model, tokenizer = load_model(args)
    inputs = get_inputs(args, tokenizer)
    generation_config = GenerationConfig(max_new_tokens=20)
    # warmup
    for _ in range(5):
        _ = model.generate(**inputs, generation_config=generation_config)

    start = time.time()
    for _ in range(args.num_iters):
        _ = model.generate(**inputs, generation_config=generation_config)
    end = time.time()
    avg_inference_time = (end - start) / args.num_iters

    print(
        f"{args.attn_implementation=}, {args.torch_dtype=}, {args.batch_size=}, {args.num_iters=}"
    )
    print(f"avg_inference_time: {avg_inference_time:.3f} seconds")
    print(
        f"Max memory allocated: {bytes_to_giga_bytes(torch.cuda.max_memory_allocated())} GB"
    )


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--attn_implementation", type=str, choices=["sdpa", "eager"], default="eager"
    )
    parser.add_argument(
        "--torch_dtype",
        type=str,
        choices=["float32", "bfloat16", "float16"],
        default="float32",
    )
    parser.add_argument("--batch_size", type=int, default=2)
    parser.add_argument("--num_iters", type=int, default=10)
    args = parser.parse_args()
    main(args)

Upon investigation, I came across something interesting that affects T5 (not sure about other models) and silently makes SDPA use the less efficient kernel. Torch SDPA expects that the attn_mask should have stride=1 in the last dimension. However, due to the permute operation in the computation of position bias in T5, the stride for the last dim gets changed and torch uses the worse kernel whenever that's the case. I thought a quick solution would be to just use .contiguous() for the SDPA version. However, .contiguous() doesn't really work correctly for tensors with non-singleton dimensions (pytorch/pytorch#116333). Someone fixed the check recently for flash attention kernel but not for the memory efficient one.

abdulfatir · 2024-04-29T07:56:57Z

@ArthurZucker @fxmarty any thoughts on this? How should I proceed?

fxmarty

Thank you @abdullaholuk! For the failing tests you mention, could you try running them on the main branch and check their status there? It may be they are failing there as well (if they are slow tests).

Could you try as well: RUN_SLOW=True pytest tests/models/t5 -k "test_eager_matches_sdpa_inference" -s -vvvvv

src/transformers/models/t5/modeling_t5.py

abdulfatir · 2024-05-02T08:29:28Z

Thank you @abdullaholuk! For the failing tests you mention, could you try running them on the main branch and check their status there? It may be they are failing there as well (if they are slow tests).

Could you try as well: RUN_SLOW=True pytest tests/models/t5 -k "test_eager_matches_sdpa_inference" -s -vvvvv

@fxmarty Thanks! These tests passed (will check on a bf16 supported device later):

tests/models/t5/test_modeling_t5.py::T5ModelTest::test_eager_matches_sdpa_inference_0_float16 <- tests/test_modeling_common.py PASSED
tests/models/t5/test_modeling_t5.py::T5ModelTest::test_eager_matches_sdpa_inference_1_bfloat16 <- tests/test_modeling_common.py SKIPPED (bfloat16 not supported on cuda (on the
specific device currently used, e.g. Nvidia T4 GPU))
tests/models/t5/test_modeling_t5.py::T5ModelTest::test_eager_matches_sdpa_inference_2_float32 <- tests/test_modeling_common.py PASSED
tests/models/t5/test_modeling_t5.py::T5EncoderOnlyModelTest::test_eager_matches_sdpa_inference_0_float16 <- tests/test_modeling_common.py PASSED
tests/models/t5/test_modeling_t5.py::T5EncoderOnlyModelTest::test_eager_matches_sdpa_inference_1_bfloat16 <- tests/test_modeling_common.py SKIPPED (bfloat16 not supported on cuda (on
the specific device currently used, e.g. Nvidia T4 GPU))
tests/models/t5/test_modeling_t5.py::T5EncoderOnlyModelTest::test_eager_matches_sdpa_inference_2_float32 <- tests/test_modeling_common.py PASSED

I would like to get your thoughts on the tracing test failures. Have you seen this before and do you know why these could be failing? If not, I can take a deeper look to figure out what's wrong.
Also, do you have any thoughts on the stride issue I mentioned in a comment above.

github-actions · 2024-05-27T08:03:32Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

abdulfatir · 2024-06-01T14:55:20Z

@fxmarty @ArthurZucker

I addressed the comments. The test_prompt_lookup_decoding_matches_greedy_search test is failing and I am not completely sure why. Do you have any insights?

Once we fix this and everything else looks okay, I will fix the copies of this model.

fxmarty · 2024-06-06T08:21:52Z

@abdulfatir is there still the issue with

FAILED tests/models/t5/test_modeling_t5.py::T5ModelTest::test_torch_fx - AssertionError: Couldn't trace module: __bool__ should return bool, returned Tensor
FAILED tests/models/t5/test_modeling_t5.py::T5ModelTest::test_torch_fx_output_loss - AssertionError: Couldn't trace module: __bool__ should return bool, returned Tensor

?

fxmarty · 2024-06-06T08:17:27Z

src/transformers/models/t5/modeling_t5.py

+        # spda kernels require tensors to have stride=1 in the last dimension
+        # .contiguous() does not behave correctly for tensors with singleton dimensions
+        # .clone(memory_format=torch.contiguous_format) is a workaround


You could like to pytorch/pytorch#127523 here

fxmarty · 2024-06-06T08:23:27Z

For test_prompt_lookup_decoding_matches_greedy_search, I am not sure... I could have a look if I have time, but otherwise you could add some prints and see where the shape mismatch occurs.

abdulfatir · 2024-06-19T12:30:24Z

@ArthurZucker I changed it to use clone instead of copy_ but now the test breaks with:

FAILED tests/models/pop2piano/test_modeling_pop2piano.py::Pop2PianoModelTest::test_export_to_onnx - torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::scaled_dot_product_attention' to ONNX opset version 9 is not supported. Support for this operator was added in version 14, try exporting with this version.

But I believe this should also be an issue for other sdpa calls, right? Why is only this test breaking?

abdulfatir · 2024-06-19T13:14:10Z

Tried a higher opset_version as suggested by the error but something is still failing with SDPA + ONNX. This maybe relevant: pytorch/pytorch#96944 cc @fxmarty

fxmarty · 2024-06-19T15:17:19Z

@abdulfatir you can remove this test. Transformers ONNX export has been deprecated for >1 year now. Or alternatively, use attn_implementation="eager" when loading the model with from_pretrained in the test.

The test test_export_to_onnx is apparently only left for mt5, fsmt, t5, longt5, pop2piano, switch_transformers and umt5 in transformers. The onnx export is otherwise tested and handled in https://github.com/huggingface/optimum

abdulfatir · 2024-06-19T15:27:44Z

Thanks @fxmarty! I skipped the test. Can you please also review the PR (and run the remaining workflows)?

fxmarty

LGTM as long as the slow tests pass & some real models have been validated

HuggingFaceDocBuilderDev · 2024-06-19T15:52:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

abdulfatir · 2024-06-19T15:54:38Z

@fxmarty for slow tests do we need a label?

fxmarty · 2024-06-24T13:37:49Z

@abdulfatir I don't think we can run them from github actions. You just need to run RUN_SLOW=1 pytest tests/models/t5 -s -vvvvv and check that no more tests fail than on main branch (some might already be failing).

amyeroberts

Thanks for all the work adding this!

cc @ylacombe regarding pop2piano

All of the models which have this added should have information added to their modeling page, on how to use and speedups e.g. like here for Mistral

Are SDPA tests run for all of these models now?

The slow integration tests will need to be run for all the models with SDPA added. They can be triggered by pushing a commit with the message: [run_slow] pop2piano,mt5,t5. Another HF member or I will need to approve the workflow for it to run

amyeroberts · 2024-06-25T16:42:44Z

tests/models/pop2piano/test_modeling_pop2piano.py

@@ -610,7 +609,7 @@ def test_model_from_pretrained(self):
        model = Pop2PianoForConditionalGeneration.from_pretrained(model_name)
        self.assertIsNotNone(model)

-    @require_onnx
+    @unittest.skip("ONNX support deprecated")


This doesn't seem to be related to this PR?

We shouldn't be skipping tests because they're deprecated - deprecation happens with main code, but if a test is deprecated then we should just remove it

amyeroberts · 2024-06-25T16:43:49Z

tests/models/mt5/test_modeling_mt5.py

@@ -848,7 +848,7 @@ def test_export_to_onnx(self):
                (config_and_inputs[1], config_and_inputs[3], config_and_inputs[2]),
                f"{tmpdirname}/t5_test.onnx",
                export_params=True,
-                opset_version=9,
+                opset_version=14,


What's the reason for this?

amyeroberts · 2024-06-25T16:44:02Z

tests/models/t5/test_modeling_t5.py

@@ -851,7 +851,7 @@ def test_export_to_onnx(self):
                (config_and_inputs[1], config_and_inputs[3], config_and_inputs[2]),
                f"{tmpdirname}/t5_test.onnx",
                export_params=True,
-                opset_version=9,
+                opset_version=14,


amyeroberts · 2024-06-25T16:46:25Z

src/transformers/models/longt5/modeling_longt5.py

+        query_states = self._shape(
+            self.q(hidden_states), batch_size
+        )  # (batch_size, n_heads, seq_length, dim_per_head)


Library convention is for comments to go on the line above to avoid line splitting

Suggested change

query_states = self._shape(

self.q(hidden_states), batch_size

) # (batch_size, n_heads, seq_length, dim_per_head)

# (batch_size, n_heads, seq_length, dim_per_head)

query_states = self._shape(self.q(hidden_states), batch_size)

amyeroberts · 2024-06-25T17:27:54Z

src/transformers/models/longt5/modeling_longt5.py

+        attn_output = self._unshape(
+            torch.matmul(attn_weights, value_states), batch_size
+        )  # (batch_size, seq_length, dim)


Suggested change

attn_output = self._unshape(

torch.matmul(attn_weights, value_states), batch_size

) # (batch_size, seq_length, dim)

# (batch_size, seq_length, dim)

attn_output = self._unshape(torch.matmul(attn_weights, value_states), batch_size)

amyeroberts · 2024-06-25T17:35:08Z

src/transformers/models/t5/modeling_t5.py

+        query_states = self._shape(
+            self.q(hidden_states), batch_size
+        )  # (batch_size, n_heads, seq_length, dim_per_head)


Suggested change

query_states = self._shape(

self.q(hidden_states), batch_size

) # (batch_size, n_heads, seq_length, dim_per_head)

# (batch_size, n_heads, seq_length, dim_per_head)

query_states = self._shape(self.q(hidden_states), batch_size)

amyeroberts · 2024-06-25T17:35:23Z

src/transformers/models/t5/modeling_t5.py

+        attn_output = self._unshape(
+            torch.matmul(attn_weights, value_states), batch_size
+        )  # (batch_size, seq_length, dim)


Suggested change

attn_output = self._unshape(

torch.matmul(attn_weights, value_states), batch_size

) # (batch_size, seq_length, dim)

# (batch_size, seq_length, dim)

attn_output = self._unshape(torch.matmul(attn_weights, value_states), batch_size)

amyeroberts · 2024-06-25T17:35:56Z

src/transformers/models/udop/modeling_udop.py

+        query_states = self._shape(
+            self.q(hidden_states), batch_size
+        )  # (batch_size, n_heads, seq_length, dim_per_head)


Suggested change

query_states = self._shape(

self.q(hidden_states), batch_size

) # (batch_size, n_heads, seq_length, dim_per_head)

# (batch_size, n_heads, seq_length, dim_per_head)

query_states = self._shape(self.q(hidden_states), batch_size)

amyeroberts · 2024-06-25T17:36:13Z

src/transformers/models/udop/modeling_udop.py

+        attn_output = self._unshape(
+            torch.matmul(attn_weights, value_states), batch_size
+        )  # (batch_size, seq_length, dim)


Suggested change

attn_output = self._unshape(

torch.matmul(attn_weights, value_states), batch_size

) # (batch_size, seq_length, dim)

# (batch_size, seq_length, dim)

attn_output = self._unshape(torch.matmul(attn_weights, value_states), batch_size)

amyeroberts · 2024-06-25T17:37:31Z

src/transformers/models/longt5/modeling_longt5.py

@@ -1007,6 +1019,7 @@ def unshape(states):

 # Copied from transformers.models.t5.modeling_t5.T5LayerSelfAttention with T5->LongT5
 class LongT5LayerSelfAttention(nn.Module):


What's the reason for not adding support for LongT5?

ylacombe · 2024-06-26T12:06:46Z

Hey @amyeroberts, thanks for pinging, I'll take a closer look to pop2piano once your commentaries have been addressed if that's okay with you!

A small comment from my side @abdulfatir, have you been able to test speed-ups again now that you found the reasons for the apparent lack of speed-ups ?

Also, to properly measure inference speed-ups, you should also modify a bit the script you've used to:

set a fixed number of tokens to generate and benchmark over a range of number of tokens to generate
use torch.cuda.Event to properly measure time spent in inference

I haven't tested the following script but it should work with little to any modifs!

import argparse
import time

import torch

from transformers import (
    GenerationConfig,
    T5ForConditionalGeneration,
    T5TokenizerFast,
)
from transformers import set_seed


def bytes_to_giga_bytes(bytes):
    return bytes / 1024 / 1024 / 1024


def load_model(args):
    model = T5ForConditionalGeneration.from_pretrained(
        "google/flan-t5-large",
        attn_implementation=args.attn_implementation,
        torch_dtype=getattr(torch, args.torch_dtype),
    ).to("cuda")
    tokenizer = T5TokenizerFast.from_pretrained("google/flan-t5-large")
    return model, tokenizer


def get_inputs(args, tokenizer):
    inputs = tokenizer(
        ["Pedro Pedro Pedro 🦝🙌" * 10] * args.batch_size,
        padding=True,
        return_tensors="pt",
    )
    inputs = {k: v.to("cuda") for k, v in inputs.items()}
    return inputs

def measure_latency_and_memory_use(model, device, inputs, generation_config, nb_loops):

  # define Events that measure start and end of the generate pass
  start_event = torch.cuda.Event(enable_timing=True)
  end_event = torch.cuda.Event(enable_timing=True)

  # reset cuda memory stats and empty cache
  torch.cuda.reset_peak_memory_stats(device)
  torch.cuda.empty_cache()
  torch.cuda.synchronize()

  # get the start time
  start_event.record()

  # actually generate
  for _ in range(nb_loops):
        # set seed for reproducibility
        set_seed(0)
        generation = model.generate(**inputs, generation_config=generation_config)

  # get the end time
  end_event.record()
  torch.cuda.synchronize()

  # measure memory footprint and elapsed time
  max_memory = torch.cuda.max_memory_allocated(device)
  elapsed_time = start_event.elapsed_time(end_event) * 1.0e-3


  execution_time_in_s =  elapsed_time/nb_loops
  max_memory_footprint_in_GB = bytes_to_giga_bytes(max_memory)

  return execution_time_in_s, max_memory_footprint_in_GB


@torch.no_grad()
def main(args):
    model, tokenizer = load_model(args)
    inputs = get_inputs(args, tokenizer)
    generation_config = GenerationConfig(min_new_tokens=args.num_tokens, max_new_tokens=args.num_tokens)
    # warmup
    for _ in range(5):
        _ = model.generate(**inputs, generation_config=generation_config)

    avg_inference_time, max_memory_footprint_in_GB = measure_latency_and_memory_use(model, model.device, inputs, generation_config, args.num_iters)

    print(
        f"{args.attn_implementation=}, {args.torch_dtype=}, {args.batch_size=}, {args.num_iters=}, {args.num_tokens=}"
    )
    print(f"avg_inference_time: {avg_inference_time:.3f} seconds")
    print(
        f"Max memory allocated: {max_memory_footprint_in_GB} GB"
    )


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--attn_implementation", type=str, choices=["sdpa", "eager"], default="eager"
    )
    parser.add_argument(
        "--torch_dtype",
        type=str,
        choices=["float32", "bfloat16", "float16"],
        default="float32",
    )
    parser.add_argument("--batch_size", type=int, default=2)
    parser.add_argument("--num_iters", type=int, default=10)
    parser.add_argument("--num_tokens", type=int, default=20)
    args = parser.parse_args()
    main(args)

abdulfatir · 2024-06-26T12:13:34Z

Thanks @amyeroberts and @ylacombe! I will probably look into this later over the weekend.

agemagician · 2024-07-20T21:57:22Z

@abdulfatir any update on when this PR will be merged ?

github-actions · 2024-08-14T08:07:08Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

alvaropp · 2024-10-09T12:34:58Z

Hi 👋🏽 any updates on this?

ArthurZucker · 2024-10-22T12:57:03Z

I think #34089 enabled it!

alvaropp · 2024-10-22T15:04:33Z

I think #34089 enabled it!

Hi @ArthurZucker, I've tested compiling the T5 model and I do get a very modest speedup:

# %%
import time

import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# %% Example prompt
tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base")

text_input = "The theory of relativity usually encompasses two interrelated physics theories by Albert Einstein: special relativity and general relativity, proposed and published in 1905 and 1915, respectively. Special relativity applies to all physical phenomena in the absence of gravity. General relativity explains the law of gravitation and its relation to the forces of nature. It applies to the cosmological and astrophysical realm, "
input_ids = tokenizer(text_input, return_tensors="pt").input_ids.to(device)

# %% Without compilation
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-base").to(device)

# First pass to initialise any caches/JITs/whatever
_ = model.generate(input_ids, max_length=500)

times = []
for _ in tqdm(range(20)):
    start = time.time()
    _ = model.generate(input_ids, max_length=500)
    times.append(time.time() - start)

print(f"Times without compilation: {np.mean(times):.4f} ± {np.std(times):.4f}")

# %% With compilation
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-base").to(device)
compiled_model = torch.compile(model)

# First pass to initialise any caches/JITs/whatever
_ = compiled_model.generate(input_ids, max_length=500)

times = []
for _ in tqdm(range(20)):
    start = time.time()
    _ = compiled_model.generate(input_ids, max_length=500)
    times.append(time.time() - start)

print(f"Times with compilation: {np.mean(times):.4f} ± {np.std(times):.4f}")

However, I've tried using:

model = AutoModelForSeq2SeqLM.from_pretrained(
    "google-t5/t5-base", device_map="cuda:1", attn_implementation="SDPA"
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    "google-t5/t5-base", device_map="cuda:1", attn_implementation="flash_attention_2"
)

but it complains that neither SDPA nor FlashAttention are implemented for the T5 model.

Any ideas? 😄

ArthurZucker · 2024-10-24T15:53:46Z

Ah sorry it's not implemented indeed !
Will handle it after #34282 !

alvaropp · 2024-10-24T15:55:50Z

Ah sorry it's not implemented indeed ! Will handle it after #34282 !

Thanks, will keep an eye open!

rubenweitzman · 2024-11-14T17:26:17Z

Hi @ArthurZucker wondering where the state of affair is for a t5 with sdpa/flash with working compile?

ArthurZucker · 2024-11-25T13:31:53Z

Now that the PR is merged, if you want you can open one for T5 or I'll do it in a bit!

abdulfatir mentioned this pull request Apr 21, 2024

Open to contribution: adding torch.nn.functional.scaled_dot_product_attention support for more architectures #28005

Closed

6 tasks

abdulfatir commented Apr 21, 2024

View reviewed changes

src/transformers/models/t5/modeling_t5.py Show resolved Hide resolved

abdulfatir changed the title ~~Initial commit~~ Add SDPA support for T5 Apr 21, 2024

fxmarty reviewed Apr 29, 2024

View reviewed changes

src/transformers/models/t5/modeling_t5.py Outdated Show resolved Hide resolved

davidgxue mentioned this pull request May 10, 2024

Community contribution: Adding Flash Attention 2 support for more architectures #26350

Open

24 tasks

abdulfatir mentioned this pull request May 31, 2024

SDPA memory efficient and flash attention kernels don't work with singleton dimensions pytorch/pytorch#127523

Closed

Abdul Fatir Ansari and others added 4 commits June 1, 2024 13:40

Initial commit

76d7699

Use contiguous for attn_mask

bf3fb94

Address comment about duplicated code

1cb1af6

Fix

ff5142b

abdulfatir force-pushed the add-sdpa-t5 branch from a3e291d to ff5142b Compare June 1, 2024 13:41

Fix style

53d2fbb

Fix stride issue

151751f

ArthurZucker requested a review from fxmarty June 6, 2024 06:59

fxmarty reviewed Jun 6, 2024

View reviewed changes

abdulfatir mentioned this pull request Jun 9, 2024

Use efficient implementation of attention amazon-science/chronos-forecasting#33

Open

abdulfatir added 2 commits June 15, 2024 11:37

Add link to issue

a54cea4

Fix default attention_mask

e7499ac

abdulfatir changed the title ~~Add SDPA support for T5~~ Add SDPA support for T5 Style Models Jun 15, 2024

Fix copies

43e8035

abdulfatir force-pushed the add-sdpa-t5 branch from fd916f9 to 43e8035 Compare June 15, 2024 14:10

Use higher opset version

8748841

abdulfatir added 2 commits June 19, 2024 15:20

Skip test

0160651

fix style

44107ac

fxmarty approved these changes Jun 19, 2024

View reviewed changes

fxmarty requested a review from ArthurZucker June 24, 2024 13:37

fxmarty assigned fxmarty and unassigned fxmarty Jun 24, 2024

fxmarty requested a review from amyeroberts June 24, 2024 13:38

amyeroberts added the run-slow label Jun 25, 2024

amyeroberts reviewed Jun 25, 2024

View reviewed changes

ddaspit mentioned this pull request Aug 9, 2024

Add support for SDPA to NLLB in Huggingface Transformers sillsdev/silnlp#478

Closed

github-actions bot closed this Aug 22, 2024

		@@ -1007,6 +1019,7 @@ def unshape(states):

		# Copied from transformers.models.t5.modeling_t5.T5LayerSelfAttention with T5->LongT5
		class LongT5LayerSelfAttention(nn.Module):

Add SDPA support for T5 Style Models #30375

Add SDPA support for T5 Style Models #30375

Conversation

abdulfatir commented Apr 21, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

abdulfatir commented Apr 21, 2024 • edited Loading

sayakpaul commented Apr 24, 2024 • edited Loading

abdulfatir commented Apr 24, 2024 • edited Loading

abdulfatir commented Apr 29, 2024

fxmarty left a comment

Choose a reason for hiding this comment

abdulfatir commented May 2, 2024

github-actions bot commented May 27, 2024

abdulfatir commented Jun 1, 2024

fxmarty commented Jun 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fxmarty commented Jun 6, 2024

abdulfatir commented Jun 19, 2024

abdulfatir commented Jun 19, 2024

fxmarty commented Jun 19, 2024 • edited Loading

abdulfatir commented Jun 19, 2024 • edited Loading

fxmarty left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 19, 2024

abdulfatir commented Jun 19, 2024

fxmarty commented Jun 24, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylacombe commented Jun 26, 2024 • edited Loading

abdulfatir commented Jun 26, 2024

agemagician commented Jul 20, 2024

github-actions bot commented Aug 14, 2024

alvaropp commented Oct 9, 2024

ArthurZucker commented Oct 22, 2024

alvaropp commented Oct 22, 2024

ArthurZucker commented Oct 24, 2024

alvaropp commented Oct 24, 2024

rubenweitzman commented Nov 14, 2024

ArthurZucker commented Nov 25, 2024

abdulfatir commented Apr 21, 2024 •

edited

Loading

abdulfatir commented Apr 21, 2024 •

edited

Loading

sayakpaul commented Apr 24, 2024 •

edited

Loading

abdulfatir commented Apr 24, 2024 •

edited

Loading

fxmarty commented Jun 19, 2024 •

edited

Loading

abdulfatir commented Jun 19, 2024 •

edited

Loading

ylacombe commented Jun 26, 2024 •

edited

Loading