Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

This PR refactors and deduplicates the CUDA "tile" FlashAttention kernels, adds support for head size 256, and improves performance for Pascal/AMD without FP16 mma. If fast FP16 math is available it is now used, but the KQ accumulation and the softmax are always down with FP32 precision (as these seem to be the numerically problematic parts of the kernel). The kernel now has a more flexible parameterization, I tuned kq_stride and kq_nbatch for P40/RX 6800/Mi 50. It is also possible to use a warp_size of 64 rather than 32 but I was not able to get better performance this way; I'm keeping the functionality since it's possible I'm currently ignorant about something and may want to revisit this in the future.

Performance changes
GPU Model Microbatch size Test t/s master t/s c9a318a Speedup
MI50 llama 1B Q4_0 16 pp16384 188.69 682.28 3.62
MI50 llama 1B Q4_0 32 pp16384 184.60 956.19 5.18
MI50 llama 1B Q4_0 512 pp16384 185.61 1510.04 8.14
MI50 llama 8B Q4_0 16 pp16384 37.08 194.09 5.23
MI50 llama 8B Q4_0 32 pp16384 38.41 163.29 4.25
MI50 llama 8B Q4_0 512 pp16384 38.35 203.44 5.31
RX 6800 llama 1B Q4_0 16 pp16384 378.27 645.38 1.71
RX 6800 llama 1B Q4_0 32 pp16384 275.62 659.44 2.39
RX 6800 llama 1B Q4_0 512 pp16384 332.24 898.23 2.70
RX 6800 llama 8B Q4_0 16 pp16384 55.59 172.44 3.10
RX 6800 llama 8B Q4_0 32 pp16384 71.36 174.42 2.44
RX 6800 llama 8B Q4_0 512 pp16384 84.39 231.21 2.74
P40 llama 1B Q4_0 16 pp16384 1109.57 1189.42 1.07
P40 llama 1B Q4_0 32 pp16384 1500.00 1686.71 1.12
P40 llama 1B Q4_0 512 pp16384 2052.54 2449.76 1.19
P40 llama 8B Q4_0 16 pp16384 279.17 296.43 1.06
P40 llama 8B Q4_0 32 pp16384 357.32 356.78 1.00
P40 llama 8B Q4_0 512 pp16384 497.47 499.94 1.00

@JohannesGaessler
Copy link
Collaborator Author

FA on vs. off
GPU model n_ubatch fa test t/s
Mi 50 llama 1B Q4_0 16 0 pp16384 536.59
Mi 50 llama 1B Q4_0 16 1 pp16384 680.67
Mi 50 llama 1B Q4_0 32 0 pp16384 828.74
Mi 50 llama 1B Q4_0 32 1 pp16384 955.58
Mi 50 llama 1B Q4_0 512 0 pp16384 1633.99
Mi 50 llama 1B Q4_0 512 1 pp16384 1508.16
Mi 50 llama 8B Q4_0 16 0 pp16384 119.80
Mi 50 llama 8B Q4_0 16 1 pp16384 193.42
Mi 50 llama 8B Q4_0 32 0 pp16384 184.17
Mi 50 llama 8B Q4_0 32 1 pp16384 163.32
Mi 50 llama 8B Q4_0 512 0 pp16384 340.08
Mi 50 llama 8B Q4_0 512 1 pp16384 202.60
Mi 50 gemma 2B Q4_0 16 0 pp16384 522.88
Mi 50 gemma 2B Q4_0 16 1 pp16384 330.48
Mi 50 gemma 2B Q4_0 32 0 pp16384 754.68
Mi 50 gemma 2B Q4_0 32 1 pp16384 313.09
Mi 50 gemma 2B Q4_0 512 0 pp16384 1937.73
Mi 50 gemma 2B Q4_0 512 1 pp16384 404.83
P40 llama 1B Q4_0 16 0 pp16384 194.19
P40 llama 1B Q4_0 16 1 pp16384 1190.37
P40 llama 1B Q4_0 32 0 pp16384 408.35
P40 llama 1B Q4_0 32 1 pp16384 1685.89
P40 llama 1B Q4_0 512 0 pp16384 1278.13
P40 llama 1B Q4_0 512 1 pp16384 2459.82
P40 llama 8B Q4_0 16 0 pp16384 69.48
P40 llama 8B Q4_0 16 1 pp16384 296.45
P40 llama 8B Q4_0 32 0 pp16384 132.42
P40 llama 8B Q4_0 32 1 pp16384 356.64
P40 llama 8B Q4_0 512 0 pp16384 426.63
P40 llama 8B Q4_0 512 1 pp16384 499.74
P40 gemma 2B Q4_0 16 0 pp16384 307.76
P40 gemma 2B Q4_0 16 1 pp16384 791.60
P40 gemma 2B Q4_0 32 0 pp16384 564.02
P40 gemma 2B Q4_0 32 1 pp16384 1061.81
P40 gemma 2B Q4_0 512 0 pp16384 1715.35
P40 gemma 2B Q4_0 512 1 pp16384 1347.84
RX 6800 llama 1B Q4_0 16 0 pp16384 444.98
RX 6800 llama 1B Q4_0 16 1 pp16384 645.99
RX 6800 llama 1B Q4_0 32 0 pp16384 687.48
RX 6800 llama 1B Q4_0 32 1 pp16384 659.64
RX 6800 llama 1B Q4_0 512 0 pp16384 1060.44
RX 6800 llama 1B Q4_0 512 1 pp16384 898.50
RX 6800 llama 8B Q4_0 16 0 pp16384 90.17
RX 6800 llama 8B Q4_0 16 1 pp16384 172.26
RX 6800 llama 8B Q4_0 32 0 pp16384 145.16
RX 6800 llama 8B Q4_0 32 1 pp16384 174.27
RX 6800 llama 8B Q4_0 512 0 pp16384 240.45
RX 6800 llama 8B Q4_0 512 1 pp16384 231.19
RX 6800 gemma 2B Q4_0 16 0 pp16384 480.66
RX 6800 gemma 2B Q4_0 16 1 pp16384 399.05
RX 6800 gemma 2B Q4_0 32 0 pp16384 738.28
RX 6800 gemma 2B Q4_0 32 1 pp16384 313.55
RX 6800 gemma 2B Q4_0 512 0 pp16384 1275.12
RX 6800 gemma 2B Q4_0 512 1 pp16384 568.60

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 3, 2025
@slaren slaren requested a review from IMbackK September 6, 2025 14:09
@IMbackK
Copy link
Collaborator

IMbackK commented Sep 6, 2025

Im currently traveling and wont be able to look at anything until the 13th.

@JohannesGaessler JohannesGaessler merged commit 79bc429 into ggml-org:master Sep 6, 2025
48 checks passed
@Dampfinchen
Copy link

Hmm. I was excited for the headsize 256 support but Flash Attention + Partial Offloading + Quantized KV Cache still destroys prompt processing performance for Gemma 3 12B (but not 27B).

@JohannesGaessler
Copy link
Collaborator Author

The support for the combination of head size 256 + quantized KV cache still has other issues and requires a refactor of the "vector" kernels.

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 9, 2025
njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 13, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 14, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 19, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 23, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 24, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 25, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 25, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 26, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 27, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 29, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 30, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 30, 2025
)"

This reverts commit 75a3a6c.

d

Update cudart64_12.dll

Revert "Cudart 12.9"

This reverts commit f79c687.

Revert "Allow compile exe, pdf features off"

This reverts commit 5e1c154.

Update fattn.cu

Update set-rows.cu

batches

Revert "try fix fattn again, porting some older code. the cc detection is not working well, so its hacky"

This reverts commit 7b04191.

Update ggml-cuda.cu

Update fattn.cu

Update fattn.cu

Update fattn.cu

Add option to disable MMA support on Turing

Author : pt13762104

GGML_CUDA_NO_PEER_COPY to try to fix a crash on Gemma 3

Deactivate SWA when Fast Forwarding, commented

Wrench Fix for the SWA I borked

Clean-up quantkv algo

comment warp sizes for now in IQ_K MMQ Kernels

KV 24 -> KV 31

Add a readme.

ngxson's commented hack

Try some hack for gpt-oss

Update llama-vocab.cpp

Bump Windows max open files from 512 to 2048

Author : Thireus

CLI - Specify GGML_TYPE to quantize for the main tensors. (#91)

To complement the token_embd.weight and output.weight :

attn_v.weight
attn_k.weight.
attn_q_weight
attn_output.weight
attn_qkv.weight
ffn_gate
ffn_down
ffn_up

EsoCroK naming

v1.99430_b6645-6_Q6-IO2346_RMv1.17.99m

Disable I2_K cpu quantization.

To allow compilation.

MMQ code adaptation

Update mmq.cuh

MMQ Initial code for IQ2,3,4,5,6_K

IQ_K quants first gen (4, 5, 6)

Some logs back

Batches

Croco Bench.

Double the anti-abuse limits

Allow compile exe, pdf features off

Revert "Allow compile exe, pdf features off"

This reverts commit 5e2451f129f0bca326f74aae24df475c0410cdbf.

Update koboldcpp.py

Revert "Allow compile exe, pdf features off"

This reverts commit 2a7e9e004e8578a05fb67967d09cf36263867b9b.

Revert "Allow compile exe, pdf features off"

This reverts commit b4fd7809a4f77ff18bd415fcfb2d5f435e3b63a3.

quantization tweaks

iq3_ks quantization tweaks

Minor iq3_k tweak

q2_K tweaks

q3_K tweaks

q4_K tweaks

q5_K tweaks

GGUF v14 attempt of second fix.

loosen gguf restrictions.

Quantization improvements #295 and #302, GGML part only

Improved IQ2_XS quantization #312

Improved IQ1_M quantization #327

ggml_row_size accounting fix for GGUF v14

Credits : @ikawrakow

Fighting with cmake #279

Drop the GGML count limitation limit

Old markings

Customize KCPP.py

Croco additional chat adapters andtemplates

Reinstate "skip barrier of noop"

Allow q8_0 KV cache for head size 256 #330

Up FA KV modes

256 candidates (1024 with Grammar)

Adapt q6_0 MMQ to llama.cpp mainline

Q6_0 MMQ Kernel attempt

MMQ for Q6_0 authored by Ikawrakow

Add Q6_0 MMQ to template generator authored by Ikawrakow

Q6_0 KVQ for KCPP/Croco -> KV22

For release.

fix a few lazy-cuts and hiccups left during the merge of IQ4_NL.

dequantize for q6_0 and related cpy

Enable q6_0 for flash attention

As with IQ4_NL, just for head size of 128 for now. Without GGML_CUDA_FA_ALL_QUANTS set, only Q6_0 + Q5_0 and Q8_0 + Q6_0 are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache.

PR by Ikawrakow on ik_llama.cpp

Adding Q6_0 (#77) Rev 20240807

* Adding q6_0 - basics + AVX2/Zen4 working

* Adding q6_0: CUDA dequantize works, but not mmvq

* Adding q6_0: CUDA mmvq works

* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache

* Add q6_0 to CPU flash attention

Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?

* q6_0: slightly better kv-cache result

Better than q8_0+q4_0, but not as good as q8_0+iq4_nl

* q6_0: works on ARM_NEON

* q6_0: dequantize works on Metal, but not vector dot product

* q6_0: it now works on Metal

Outperforms q5_0 by a significant margin. E.g.
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     44.02 ± 0.08 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     40.13 ± 0.12 |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    500.55 ± 0.32 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    448.02 ± 0.27 |

* q6_0: can now be used for kv-cache on Metal -> skipped.

---------

Adaptation to mainline by me!

IQ4_NL KVQ for KCPP/Croco

missing templates instances for KVQ IQ4_NL
Update fattn.cu for KVQ IQ4_NL
Update fattn-vec-f16.cuh for KVQ IQ4_NL
Update fattn-vec-f32.cuh for KVQ IQ4_NL
CML and Makefile FOR IQ4_NL

KV_IQ4_NL uncommenting VEC16 cases
KV_IQ4_NL uncommenting VEC32 cases

Enable IQ4_NL for V-cache in token generation

Add IQ4_NL + IQ4_NL to FA

This is a better alternative than Q4_0 + Q4_0 for the VRAM poor.

Comment unwanted add-in in makefile

iq4_nl: faster quantization (#76)

CUDA: faster float -> iq4_nl conversion (#73)

* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

Default Blas Batch Size = 128

Quant KV and Draft QKV, 24 modes

With customizable QKV for the draft as well.
And reduced Blas Batch Size for the draft model.

Default Draft Amount = 4

Bench context size

Max contextsize and steps

Croco CML

SCHED_MAX_COPIES = 1

And Croco usual additions to the CMakeList

Cudart 12.9

Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769)"

This reverts commit 79bc429.

Revert "HIP: use v_dot2_f32_f16 instruction for FA (ggml-org#15884)"

This reverts commit 17bc5a8.

Revert "CUDA: larger SRAM reads for tile FA, AMD FP16 dot (ggml-org#15927)"

This reverts commit 0e6ff00.

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

This reverts commit c959b67.

Revert "CUDA: fix compilation on CC 6.0 (ggml-org#16091)"

This reverts commit 368560a.

Co-Authored-By: Kawrakow <iwankawrakow@gmail.com>
Co-Authored-By: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 1, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 3, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 4, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 5, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 7, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 7, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 9, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants