vulkan: coopmat2 mul_mat optimizations #14934

jeffbolznv · 2025-07-29T04:46:57Z

Increase tile size for k-quants, to match non-k-quants
Choose more carefully between large and medium tiles, considering how it interacts with split_k
Allow larger/non-power of two split_k, and make the splits a multiple of 256
Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used

Perf results on 4070 and 5090. The split_k and medium/large selection improvements particularly benefit 5090 where the GPU often wasn't being filled.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 50 --prio 1 -m C:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m C:\models\gemma-3n-E4B-it-Q4_K_M.gguf -m C:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\meta-llama-3-8b-instruct.Q4_K_M.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |       1920.60 ± 1.79 |
| gemma3n E4B Q4_K - Medium      |   4.22 GiB |     6.87 B | Vulkan     |  99 |  1 |           pp512 |      2609.35 ± 41.34 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |     8254.77 ± 121.91 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |       3765.21 ± 8.59 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |      6759.22 ± 30.63 |

after:

ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |       2239.43 ± 3.60 |
| gemma3n E4B Q4_K - Medium      |   4.22 GiB |     6.87 B | Vulkan     |  99 |  1 |           pp512 |      2694.36 ± 46.88 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |     8178.23 ± 134.56 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |       3990.27 ± 5.69 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |      6636.85 ± 13.82 |

before:

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |      5469.64 ± 45.64 |
| gemma3n E4B Q4_K - Medium      |   4.22 GiB |     6.87 B | Vulkan     |  99 |  1 |           pp512 |      4763.36 ± 70.06 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |    24341.22 ± 246.44 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |     9577.41 ± 143.94 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |    15541.69 ± 134.82 |

after:

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |      6351.47 ± 30.99 |
| gemma3n E4B Q4_K - Medium      |   4.22 GiB |     6.87 B | Vulkan     |  99 |  1 |           pp512 |     5170.98 ± 170.61 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |   26591.76 ± 1879.27 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |    11046.12 ± 215.04 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |    20919.16 ± 788.06 |

- Increase tile size for k-quants, to match non-k-quants - Choose more carefully between large and medium tiles, considering how it interacts with split_k - Allow larger/non-power of two split_k, and make the splits a multiple of 256 - Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used

0cc4m

LGTM

jeffbolznv requested a review from 0cc4m as a code owner July 29, 2025 04:46

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 29, 2025

0cc4m approved these changes Aug 2, 2025

View reviewed changes

0cc4m merged commit 4cb208c into ggml-org:master Aug 2, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: coopmat2 mul_mat optimizations #14934

vulkan: coopmat2 mul_mat optimizations #14934

Uh oh!

jeffbolznv commented Jul 29, 2025

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vulkan: coopmat2 mul_mat optimizations #14934

vulkan: coopmat2 mul_mat optimizations #14934

Uh oh!

Conversation

jeffbolznv commented Jul 29, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants