[core] support flash attention through `kernels` #12387

sayakpaul · 2025-09-25T08:01:49Z

What does this PR do?

Follow-up of #12236.

Testing code:

import torch
from diffusers import FluxPipeline

model_id = "black-forest-labs/FLUX.1-dev"
pipe = FluxPipeline.from_pretrained(
    model_id, torch_dtype=torch.bfloat16
).to("cuda")

pipe.transformer.set_attention_backend("flash_hub")
pipe.transformer.compile(fullgraph=True)

prompt = "A cat holding a sign that says 'hello world'"

with torch._dynamo.config.patch(error_on_recompile=True):
    image = pipe(
        prompt, num_inference_steps=28, guidance_scale=4.0, generator=torch.manual_seed(0)
    ).images[0]
    image.save("output.png")

Tip

Works with torch.compile fullgraph compatibility.

I have tested the code on H100 and A100, and it works.

sayakpaul · 2025-09-25T08:06:28Z

src/diffusers/models/attention_dispatch.py

    # `flash-attn`
    FLASH = "flash"
    FLASH_VARLEN = "flash_varlen"
+    FLASH_HUB = "flash_hub"


Flash Attention is stable. So, we don't have to mark it private like FA3.

HuggingFaceDocBuilderDev · 2025-09-25T08:09:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

MekkCyber

Very cool integration 🔥 ! I just left some nits

MekkCyber · 2025-09-25T08:12:17Z

src/diffusers/models/attention_dispatch.py

+    fa3_interface_hub = _get_fa3_from_hub()
+    flash_attn_3_func_hub = fa3_interface_hub.flash_attn_func
+    fa_interface_hub = _get_fa_from_hub()
+    flash_attn_func_hub = fa_interface_hub.flash_attn_func


Why are we fetching both kernels here ?

Because of the way APIs for attention backends are designed and also to support torch.compile with fullgraph traceability (when possible).

We will let it grow a bit and upon feedback, we can revisit how to better deal with this.

MekkCyber · 2025-09-25T08:16:48Z

src/diffusers/models/attention_dispatch.py

    FLASH = "flash"
    FLASH_VARLEN = "flash_varlen"
+    FLASH_HUB = "flash_hub"
+    # FLASH_VARLEN_HUB = "flash_varlen_hub" # not supported yet.


is this related to the kernel or it just needs more time to be integrated ?

We don't have models that use varlen.

@sayakpaul qwen image uses varlen. also, native fused qkv+mlp attn requires varlen function.

sayakpaul · 2025-10-06T04:58:13Z

@DN6 a gentle ping on this one.

DN6 · 2025-10-22T12:25:39Z

src/diffusers/utils/kernels_utils.py

+        raise
+
+
+def _get_fa3_from_hub():


This is a very thin wrapper. I would just call _get_from_hub("fa3") directly in attention_dispatch.

DN6 · 2025-10-22T12:29:08Z

src/diffusers/utils/kernels_utils.py



-def _get_fa3_from_hub():
+def _get_from_hub(key: str):


Suggested change

def _get_from_hub(key: str):

def _get_kernel_from_hub(key: str):

sayakpaul added 2 commits September 25, 2025 13:05

up

c386f22

support fa (2) through kernels.

d252c02

sayakpaul requested a review from DN6 September 25, 2025 08:01

sayakpaul added the performance Anything related to performance improvements, profiling and benchmarking label Sep 25, 2025

sayakpaul mentioned this pull request Sep 25, 2025

use kernels to support _flash_hub attention backend #12318

Closed

6 tasks

sayakpaul commented Sep 25, 2025

View reviewed changes

MekkCyber reviewed Sep 25, 2025

View reviewed changes

sayakpaul mentioned this pull request Sep 25, 2025

[tests] Test attention backends #12388

Merged

sayakpaul added 3 commits September 26, 2025 11:10

up

1b96ed7

Merge branch 'main' into fa-hub

474b995

Merge branch 'main' into fa-hub

b0fc7af

Merge branch 'main' into fa-hub

029975e

This was referenced Oct 13, 2025

Support flash-attn kernel support for non-Hopper GPUs #12308

Open

[core] Refactor hub attn kernels #12475

Open

Merge branch 'main' into fa-hub

d72bbcc

DN6 reviewed Oct 22, 2025

View reviewed changes

sayakpaul mentioned this pull request Oct 22, 2025

[core] support sage attention through kernels #12439

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] support flash attention through `kernels` #12387

[core] support flash attention through `kernels` #12387

Uh oh!

sayakpaul commented Sep 25, 2025

Uh oh!

sayakpaul Sep 25, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 25, 2025

Uh oh!

MekkCyber left a comment

Uh oh!

MekkCyber Sep 25, 2025

Uh oh!

sayakpaul Sep 25, 2025

Uh oh!

MekkCyber Sep 25, 2025

Uh oh!

sayakpaul Sep 25, 2025

Uh oh!

bghira Sep 29, 2025

Uh oh!

sayakpaul commented Oct 6, 2025

Uh oh!

DN6 Oct 22, 2025

Uh oh!

DN6 Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	def _get_from_hub(key: str):
	def _get_kernel_from_hub(key: str):

[core] support flash attention through kernels #12387

Are you sure you want to change the base?

[core] support flash attention through kernels #12387

Uh oh!

Conversation

sayakpaul commented Sep 25, 2025

What does this PR do?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Sep 25, 2025

Uh oh!

MekkCyber left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Oct 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[core] support flash attention through `kernels` #12387

[core] support flash attention through `kernels` #12387