Kernels flash attn #39474

ArthurZucker · 2025-07-17T11:09:00Z

What does this PR do?

pip install transformers[torch] kernels

from transformers import AutoModelForCausalLM, AutoTokenizer
import time

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    device_map="auto",
    torch_dtype="auto",
    attn_implementation="flash_attention_2",
).eval()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(
    ["Hello, how are you?", "is this life?"],
    padding=True,
    padding_side="left",
    return_tensors="pt",
).to(model.device)


start = time.time()
outputs = model.generate(**inputs, max_new_tokens=50)
print(f"Generation time: {time.time() - start:.2f} seconds")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

model.set_attn_implementation("kernels-community/flash-attn3")
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=50)
print(f"Generation time: {time.time() - start:.2f} seconds")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

HuggingFaceDocBuilderDev · 2025-07-17T12:15:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kadirnar · 2025-07-17T14:12:57Z

tf install:

uv pip install git+https://github.com/huggingface/transformers.git@kernels-flash-attn

env:

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.54.0.dev0
- Platform: Linux-6.11.0-29-generic-x86_64-with-glibc2.40
- Python version: 3.10.16
- Huggingface_hub version: 0.33.4
- Safetensors version: 0.5.3
- Accelerate version: 1.9.0
- Accelerate config:    - compute_environment: LOCAL_MACHINE

Error Message:

[rank0]: ValueError: Specified `attn_implementation="https://huggingface.co/kernels-community/flash-attn3:flash_attention"` is not supported. The only possible arguments are `attn_implementation="eager"` (manual attention implementation), `"attn_implementation=flash_attention_3"` (implementation using flash attention 3), `"attn_implementation=flash_attention_2"` (implementation using flash attention 2), `"attn_implementation=sdpa"` (implementation using torch.nn.functional.scaled_dot_product_attention), `"attn_implementation=flex_attention"` (implementation using torch's flex_attention).

…ls-flash-attn

ArthurZucker · 2025-07-17T14:45:03Z

You can't pass the full http! You need to pass kernels-community/flash-attn3:flash_attention

…ers into kernels-flash-attn

kadirnar · 2025-07-17T15:39:17Z

You can't pass the full http! You need to pass kernels-community/flash-attn3:flash_attention

I think writing the URL is silly too. However, since you shared it like this on Twitter, I gave it a try.
https://x.com/art_zucker/status/1945821883858915695

New Error Message:

[rank1]:     cache_position = torch.arange(
[rank1]: RuntimeError: CUDA error: device-side assert triggered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

] Assertion `srcIndex < srcSelectDimSize,0,0: indexSelectLargeIndex:1553: indexSelectLargeIndex` failed.
,0,0: block: [237: indexSelectLargeIndex: block: [569/pytorch/aten/src/ATen/native/cuda/Indexing.cu], thread: [96], thread: [69,0: block: [173,0:1553,0,0,0,0,0: indexSelectLargeIndex,0,0], thread: [49,0], thread: [62: block: [844] Assertion `srcIndex < srcSelectDimSize] Assertion `srcIndex < srcSelectDimSize,0], thread: [45,0,0` failed.
` failed.

Should I wait for you to finish your development?

ArthurZucker · 2025-07-17T15:41:14Z

Ah that's weird can you share a small reproducer?

ArthurZucker · 2025-07-17T15:54:58Z

run-slow: llama,mistral,gemma

github-actions · 2025-07-17T15:56:21Z

This comment contains run-slow, running the specified jobs:

models: ['models/gemma', 'models/llama', 'models/mistral']
quantizations: [] ...

kadirnar · 2025-07-17T16:05:29Z

@ArthurZucker I tried it with a different LLM model, and it worked. It seems that the dataset of the Qwen model is faulty. I will fix this and provide feedback on the performance.

ArthurZucker · 2025-07-17T16:06:09Z

Thanks @kadirnar !

…oing it?

* update docs * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * applu suggestions --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

src/transformers/integrations/flash_paged.py

…ace/transformers into kernels-flash-attn

kadirnar · 2025-08-02T00:17:00Z

@ArthurZucker This method only supports LLM models, right? What should we do to add kernel support for speech models?

Example: https://huggingface.co/docs/transformers/main/en/model_doc/dia

ArthurZucker · 2025-08-02T04:06:25Z

This should be supported by all models as long as they have a the ALL_ATTENTION_FUNCTION refactor + you can set the attention for sub_modules!

* use partial to wrap around `transformers` utils! * try to refactor? * revert one wrong change * just a nit * push * reverter watever was wrong! * some nits * fixes when there is no attention mask * bring the licence back * some fixes * nit * style * remove prints * correct dtype * fa flags for testing * update * use paged attention if requested! * updates * a clone was needed, not sure why * automatically create cu seq lens when input is flash, this at least makes sure layers don't re-compute * simplify and improve? * flash attention is kinda broken on recent cuda version so allow the opportunity to use something else * fix! * protect kernels import * update * properly parse generation config being passed * revert and update * add two tests * some fixes * fix test FA2 * takes comment into account * fixup * revert changes * revert the clone, it is only needed because the metal kernel is not doing it? * [docs] update attention implementation and cache docs (huggingface#39547) * update docs * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * applu suggestions --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * fix mps on our side for now * Update src/transformers/integrations/flash_paged.py * no qa --------- Co-authored-by: Vasqu <antonprogamer@gmail.com> Co-authored-by: Raushan Turganbay <raushan@huggingface.co> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

ArthurZucker added 4 commits July 17, 2025 12:50

use partial to wrap around transformers utils!

cd4c7cb

try to refactor?

005f482

revert one wrong change

1b834a4

just a nit

d93f366

ArthurZucker added 4 commits July 17, 2025 14:29

push

2b7d411

reverter watever was wrong!

affba20

some nits

1959eb2

fixes when there is no attention mask

888cd40

Merge branch 'main' of github.com:huggingface/transformers into kerne…

8f5e62b

…ls-flash-attn

ArthurZucker added 6 commits July 17, 2025 16:46

bring the licence back

5a7ae11

some fixes

c57673b

nit

7d69d83

Merge branch 'kernels-flash-attn' of github.com:huggingface/transform…

7e94910

…ers into kernels-flash-attn

style

112e2a6

remove prints

501aa7e

correct dtype

04088be

ArthurZucker added the Flash Attention label Jul 17, 2025

vasqu and others added 4 commits July 17, 2025 18:38

fa flags for testing

b1e104b

update

7087e7b

Merge branch 'main' into kernels-flash-attn

cc58aca

use paged attention if requested!

6a2996a

ArthurZucker and others added 5 commits July 22, 2025 14:49

fixup

21e07f7

revert changes

a8b7ec6

revert the clone, it is only needed because the metal kernel is not d…

f111d33

…oing it?

fix mps on our side for now

f457a08

ArthurZucker commented Jul 22, 2025

View reviewed changes

src/transformers/integrations/flash_paged.py Outdated Show resolved Hide resolved

ArthurZucker and others added 3 commits July 22, 2025 15:13

Update src/transformers/integrations/flash_paged.py

38d241b

Merge branches 'main' and 'kernels-flash-attn' of github.com:huggingf…

cb58187

…ace/transformers into kernels-flash-attn

no qa

c0f4f09

ArthurZucker enabled auto-merge (squash) July 22, 2025 13:28

ArthurZucker disabled auto-merge July 22, 2025 13:40

ArthurZucker merged commit efceeaf into main Jul 22, 2025
24 of 26 checks passed

ArthurZucker deleted the kernels-flash-attn branch July 22, 2025 13:41

This was referenced Jul 22, 2025

handle refactor upstream for flash attention axolotl-ai-cloud/axolotl#2966

Merged

use updated var from hf refactor zhuzilin/ring-flash-attention#77

Merged

revert behavior of _prepare_from_posids #39622

Merged

manueldeprada mentioned this pull request Jul 24, 2025

Update for new version of HF transformers. NVIDIA/kvpress#104

Closed

winglian mentioned this pull request Jul 25, 2025

revert change to cu_seqlen_k and max_k when preparing from position_ids #39653

Merged

5 tasks

FightingZhen mentioned this pull request Aug 1, 2025

[bugfix] fix flash_attention_2 unavailable error on Ascend NPU #39844

Merged

5 tasks

manueldeprada mentioned this pull request Aug 13, 2025

Transformers compatability NVIDIA/kvpress#115

Merged

6 tasks

vasqu mentioned this pull request Aug 25, 2025

[FA] Remaining Cleanup #40424

Merged

Kernels flash attn #39474

Kernels flash attn #39474

Uh oh!

Conversation

ArthurZucker commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 17, 2025

Uh oh!

kadirnar commented Jul 17, 2025

Uh oh!

ArthurZucker commented Jul 17, 2025

Uh oh!

kadirnar commented Jul 17, 2025

Uh oh!

ArthurZucker commented Jul 17, 2025

Uh oh!

ArthurZucker commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

kadirnar commented Jul 17, 2025

Uh oh!

ArthurZucker commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

kadirnar commented Aug 2, 2025

Uh oh!

ArthurZucker commented Aug 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ArthurZucker commented Jul 17, 2025 •

edited

Loading