memory_efficient_attention faster than flash attention 2 backend? #1180

asahni04 · 2024-12-19T22:28:50Z

❓ Questions and Help

expected other way around. what is the fastest kernel i can use here?

            q = q.to(dtype)  # XFormers needs manual casting of the operators
            k = k.to(dtype)
            v = v.to(dtype)
            x = memory_efficient_attention(
                q,
                k,
                v,
                p=self.attn_drop.p if self.training else 0.0,
                op=self.efficient_attention_ops,
            )
            vs

            q,k,v = map(lambda t: rearrange(t, "b n h d -> b h n d", d=self.head_dim), (q, k, v))
            with sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):
                    x = scaled_dot_product_attention(
                        q, k, v, dropout_p=self.attn_drop.p if self.training else 0.0
                    )  # Scale is automatically computed by the torch implementation

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory_efficient_attention faster than flash attention 2 backend? #1180

memory_efficient_attention faster than flash attention 2 backend? #1180

asahni04 commented Dec 19, 2024 •

edited

Loading

memory_efficient_attention faster than flash attention 2 backend? #1180

memory_efficient_attention faster than flash attention 2 backend? #1180

Comments

asahni04 commented Dec 19, 2024 • edited Loading

❓ Questions and Help

asahni04 commented Dec 19, 2024 •

edited

Loading