CUDA: faster tile FA (Pascal/AMD), headsize 256 #15769

JohannesGaessler · 2025-09-03T14:03:19Z

This PR refactors and deduplicates the CUDA "tile" FlashAttention kernels, adds support for head size 256, and improves performance for Pascal/AMD without FP16 mma. If fast FP16 math is available it is now used, but the KQ accumulation and the softmax are always down with FP32 precision (as these seem to be the numerically problematic parts of the kernel). The kernel now has a more flexible parameterization, I tuned kq_stride and kq_nbatch for P40/RX 6800/Mi 50. It is also possible to use a warp_size of 64 rather than 32 but I was not able to get better performance this way; I'm keeping the functionality since it's possible I'm currently ignorant about something and may want to revisit this in the future.

Performance changes

GPU	Model	Microbatch size	Test	t/s master	t/s `c9a318a`	Speedup
MI50	llama 1B Q4_0	16	pp16384	188.69	682.28	3.62
MI50	llama 1B Q4_0	32	pp16384	184.60	956.19	5.18
MI50	llama 1B Q4_0	512	pp16384	185.61	1510.04	8.14
MI50	llama 8B Q4_0	16	pp16384	37.08	194.09	5.23
MI50	llama 8B Q4_0	32	pp16384	38.41	163.29	4.25
MI50	llama 8B Q4_0	512	pp16384	38.35	203.44	5.31
RX 6800	llama 1B Q4_0	16	pp16384	378.27	645.38	1.71
RX 6800	llama 1B Q4_0	32	pp16384	275.62	659.44	2.39
RX 6800	llama 1B Q4_0	512	pp16384	332.24	898.23	2.70
RX 6800	llama 8B Q4_0	16	pp16384	55.59	172.44	3.10
RX 6800	llama 8B Q4_0	32	pp16384	71.36	174.42	2.44
RX 6800	llama 8B Q4_0	512	pp16384	84.39	231.21	2.74
P40	llama 1B Q4_0	16	pp16384	1109.57	1189.42	1.07
P40	llama 1B Q4_0	32	pp16384	1500.00	1686.71	1.12
P40	llama 1B Q4_0	512	pp16384	2052.54	2449.76	1.19
P40	llama 8B Q4_0	16	pp16384	279.17	296.43	1.06
P40	llama 8B Q4_0	32	pp16384	357.32	356.78	1.00
P40	llama 8B Q4_0	512	pp16384	497.47	499.94	1.00

JohannesGaessler · 2025-09-03T15:06:45Z

FA on vs. off

GPU	model	n_ubatch	fa	test	t/s
Mi 50	llama 1B Q4_0	16	0	pp16384	536.59
Mi 50	llama 1B Q4_0	16	1	pp16384	680.67
Mi 50	llama 1B Q4_0	32	0	pp16384	828.74
Mi 50	llama 1B Q4_0	32	1	pp16384	955.58
Mi 50	llama 1B Q4_0	512	0	pp16384	1633.99
Mi 50	llama 1B Q4_0	512	1	pp16384	1508.16
Mi 50	llama 8B Q4_0	16	0	pp16384	119.80
Mi 50	llama 8B Q4_0	16	1	pp16384	193.42
Mi 50	llama 8B Q4_0	32	0	pp16384	184.17
Mi 50	llama 8B Q4_0	32	1	pp16384	163.32
Mi 50	llama 8B Q4_0	512	0	pp16384	340.08
Mi 50	llama 8B Q4_0	512	1	pp16384	202.60
Mi 50	gemma 2B Q4_0	16	0	pp16384	522.88
Mi 50	gemma 2B Q4_0	16	1	pp16384	330.48
Mi 50	gemma 2B Q4_0	32	0	pp16384	754.68
Mi 50	gemma 2B Q4_0	32	1	pp16384	313.09
Mi 50	gemma 2B Q4_0	512	0	pp16384	1937.73
Mi 50	gemma 2B Q4_0	512	1	pp16384	404.83
P40	llama 1B Q4_0	16	0	pp16384	194.19
P40	llama 1B Q4_0	16	1	pp16384	1190.37
P40	llama 1B Q4_0	32	0	pp16384	408.35
P40	llama 1B Q4_0	32	1	pp16384	1685.89
P40	llama 1B Q4_0	512	0	pp16384	1278.13
P40	llama 1B Q4_0	512	1	pp16384	2459.82
P40	llama 8B Q4_0	16	0	pp16384	69.48
P40	llama 8B Q4_0	16	1	pp16384	296.45
P40	llama 8B Q4_0	32	0	pp16384	132.42
P40	llama 8B Q4_0	32	1	pp16384	356.64
P40	llama 8B Q4_0	512	0	pp16384	426.63
P40	llama 8B Q4_0	512	1	pp16384	499.74
P40	gemma 2B Q4_0	16	0	pp16384	307.76
P40	gemma 2B Q4_0	16	1	pp16384	791.60
P40	gemma 2B Q4_0	32	0	pp16384	564.02
P40	gemma 2B Q4_0	32	1	pp16384	1061.81
P40	gemma 2B Q4_0	512	0	pp16384	1715.35
P40	gemma 2B Q4_0	512	1	pp16384	1347.84
RX 6800	llama 1B Q4_0	16	0	pp16384	444.98
RX 6800	llama 1B Q4_0	16	1	pp16384	645.99
RX 6800	llama 1B Q4_0	32	0	pp16384	687.48
RX 6800	llama 1B Q4_0	32	1	pp16384	659.64
RX 6800	llama 1B Q4_0	512	0	pp16384	1060.44
RX 6800	llama 1B Q4_0	512	1	pp16384	898.50
RX 6800	llama 8B Q4_0	16	0	pp16384	90.17
RX 6800	llama 8B Q4_0	16	1	pp16384	172.26
RX 6800	llama 8B Q4_0	32	0	pp16384	145.16
RX 6800	llama 8B Q4_0	32	1	pp16384	174.27
RX 6800	llama 8B Q4_0	512	0	pp16384	240.45
RX 6800	llama 8B Q4_0	512	1	pp16384	231.19
RX 6800	gemma 2B Q4_0	16	0	pp16384	480.66
RX 6800	gemma 2B Q4_0	16	1	pp16384	399.05
RX 6800	gemma 2B Q4_0	32	0	pp16384	738.28
RX 6800	gemma 2B Q4_0	32	1	pp16384	313.55
RX 6800	gemma 2B Q4_0	512	0	pp16384	1275.12
RX 6800	gemma 2B Q4_0	512	1	pp16384	568.60

IMbackK · 2025-09-06T17:18:03Z

Im currently traveling and wont be able to look at anything until the 13th.

Dampfinchen · 2025-09-07T13:34:17Z

Hmm. I was excited for the headsize 256 support but Flash Attention + Partial Offloading + Quantized KV Cache still destroys prompt processing performance for Gemma 3 12B (but not 27B).

JohannesGaessler · 2025-09-07T13:38:07Z

The support for the combination of head size 256 + quantized KV cache still has other issues and requires a refactor of the "vector" kernels.

…)" This reverts commit 79bc429.

@ikawrakow

)" This reverts commit 75a3a6c. d Update cudart64_12.dll Revert "Cudart 12.9" This reverts commit f79c687. Revert "Allow compile exe, pdf features off" This reverts commit 5e1c154. Update fattn.cu Update set-rows.cu batches Revert "try fix fattn again, porting some older code. the cc detection is not working well, so its hacky" This reverts commit 7b04191. Update ggml-cuda.cu Update fattn.cu Update fattn.cu Update fattn.cu Add option to disable MMA support on Turing Author : pt13762104 GGML_CUDA_NO_PEER_COPY to try to fix a crash on Gemma 3 Deactivate SWA when Fast Forwarding, commented Wrench Fix for the SWA I borked Clean-up quantkv algo comment warp sizes for now in IQ_K MMQ Kernels KV 24 -> KV 31 Add a readme. ngxson's commented hack Try some hack for gpt-oss Update llama-vocab.cpp Bump Windows max open files from 512 to 2048 Author : Thireus CLI - Specify GGML_TYPE to quantize for the main tensors. (#91) To complement the token_embd.weight and output.weight : attn_v.weight attn_k.weight. attn_q_weight attn_output.weight attn_qkv.weight ffn_gate ffn_down ffn_up EsoCroK naming v1.99430_b6645-6_Q6-IO2346_RMv1.17.99m Disable I2_K cpu quantization. To allow compilation. MMQ code adaptation Update mmq.cuh MMQ Initial code for IQ2,3,4,5,6_K IQ_K quants first gen (4, 5, 6) Some logs back Batches Croco Bench. Double the anti-abuse limits Allow compile exe, pdf features off Revert "Allow compile exe, pdf features off" This reverts commit 5e2451f129f0bca326f74aae24df475c0410cdbf. Update koboldcpp.py Revert "Allow compile exe, pdf features off" This reverts commit 2a7e9e004e8578a05fb67967d09cf36263867b9b. Revert "Allow compile exe, pdf features off" This reverts commit b4fd7809a4f77ff18bd415fcfb2d5f435e3b63a3. quantization tweaks iq3_ks quantization tweaks Minor iq3_k tweak q2_K tweaks q3_K tweaks q4_K tweaks q5_K tweaks GGUF v14 attempt of second fix. loosen gguf restrictions. Quantization improvements #295 and #302, GGML part only Improved IQ2_XS quantization #312 Improved IQ1_M quantization #327 ggml_row_size accounting fix for GGUF v14 Credits : @ikawrakow Fighting with cmake #279 Drop the GGML count limitation limit Old markings Customize KCPP.py Croco additional chat adapters andtemplates Reinstate "skip barrier of noop" Allow q8_0 KV cache for head size 256 #330 Up FA KV modes 256 candidates (1024 with Grammar) Adapt q6_0 MMQ to llama.cpp mainline Q6_0 MMQ Kernel attempt MMQ for Q6_0 authored by Ikawrakow Add Q6_0 MMQ to template generator authored by Ikawrakow Q6_0 KVQ for KCPP/Croco -> KV22 For release. fix a few lazy-cuts and hiccups left during the merge of IQ4_NL. dequantize for q6_0 and related cpy Enable q6_0 for flash attention As with IQ4_NL, just for head size of 128 for now. Without GGML_CUDA_FA_ALL_QUANTS set, only Q6_0 + Q5_0 and Q8_0 + Q6_0 are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache. PR by Ikawrakow on ik_llama.cpp Adding Q6_0 (#77) Rev 20240807 * Adding q6_0 - basics + AVX2/Zen4 working * Adding q6_0: CUDA dequantize works, but not mmvq * Adding q6_0: CUDA mmvq works * Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache * Add q6_0 to CPU flash attention Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache gives about the same PPL as q8_0 K-cache and q4_0 V-cache, while needing the exact same RAM. I.e., what was the point? * q6_0: slightly better kv-cache result Better than q8_0+q4_0, but not as good as q8_0+iq4_nl * q6_0: works on ARM_NEON * q6_0: dequantize works on Metal, but not vector dot product * q6_0: it now works on Metal Outperforms q5_0 by a significant margin. E.g. | model | size | params | backend | ngl | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 | * q6_0: can now be used for kv-cache on Metal -> skipped. --------- Adaptation to mainline by me! IQ4_NL KVQ for KCPP/Croco missing templates instances for KVQ IQ4_NL Update fattn.cu for KVQ IQ4_NL Update fattn-vec-f16.cuh for KVQ IQ4_NL Update fattn-vec-f32.cuh for KVQ IQ4_NL CML and Makefile FOR IQ4_NL KV_IQ4_NL uncommenting VEC16 cases KV_IQ4_NL uncommenting VEC32 cases Enable IQ4_NL for V-cache in token generation Add IQ4_NL + IQ4_NL to FA This is a better alternative than Q4_0 + Q4_0 for the VRAM poor. Comment unwanted add-in in makefile iq4_nl: faster quantization (#76) CUDA: faster float -> iq4_nl conversion (#73) * iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. Default Blas Batch Size = 128 Quant KV and Draft QKV, 24 modes With customizable QKV for the draft as well. And reduced Blas Batch Size for the draft model. Default Draft Amount = 4 Bench context size Max contextsize and steps Croco CML SCHED_MAX_COPIES = 1 And Croco usual additions to the CMakeList Cudart 12.9 Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769)" This reverts commit 79bc429. Revert "HIP: use v_dot2_f32_f16 instruction for FA (ggml-org#15884)" This reverts commit 17bc5a8. Revert "CUDA: larger SRAM reads for tile FA, AMD FP16 dot (ggml-org#15927)" This reverts commit 0e6ff00. Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)" This reverts commit c959b67. Revert "CUDA: fix compilation on CC 6.0 (ggml-org#16091)" This reverts commit 368560a. Co-Authored-By: Kawrakow <iwankawrakow@gmail.com> Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>

…)" This reverts commit 79bc429.

CUDA: faster tile FA (Pascal/AMD), headsize 256

97d804e

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 3, 2025

slaren requested a review from IMbackK September 6, 2025 14:09

slaren approved these changes Sep 6, 2025

View reviewed changes

JohannesGaessler merged commit 79bc429 into ggml-org:master Sep 6, 2025
48 checks passed

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025

CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769)

6b4a425

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 9, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

8da70a1

…)" This reverts commit 79bc429.

njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025

CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769)

a59b4fc

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 13, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

3853003

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 14, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

c7db17a

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 19, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

54b3dff

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 23, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

73ab678

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 24, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

f708f1b

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 25, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

6d1e531

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 25, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

76e7455

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 26, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

1a264ac

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 27, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

676356b

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 29, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

e777469

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 30, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

f9964e9

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 1, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

427bc5f

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

389c68c

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

906afa1

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 3, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

0a0fa64

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 4, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

f6792ac

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 5, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

81ba979

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 7, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

73ffbca

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 7, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

57e948e

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 9, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

817be31

…)" This reverts commit 79bc429.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 9, 2025

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769…

6cebbdd

…)" This reverts commit 79bc429.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: faster tile FA (Pascal/AMD), headsize 256 #15769

CUDA: faster tile FA (Pascal/AMD), headsize 256 #15769

JohannesGaessler commented Sep 3, 2025

Uh oh!

JohannesGaessler commented Sep 3, 2025

Uh oh!

IMbackK commented Sep 6, 2025

Uh oh!

Uh oh!

Dampfinchen commented Sep 7, 2025

Uh oh!

JohannesGaessler commented Sep 7, 2025

Uh oh!

Uh oh!

CUDA: faster tile FA (Pascal/AMD), headsize 256 #15769

CUDA: faster tile FA (Pascal/AMD), headsize 256 #15769

Conversation

JohannesGaessler commented Sep 3, 2025

Uh oh!

JohannesGaessler commented Sep 3, 2025

Uh oh!

IMbackK commented Sep 6, 2025

Uh oh!

Uh oh!

Dampfinchen commented Sep 7, 2025

Uh oh!

JohannesGaessler commented Sep 7, 2025

Uh oh!

Uh oh!