Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

  • Increase tile size for k-quants, to match non-k-quants
  • Choose more carefully between large and medium tiles, considering how it interacts with split_k
  • Allow larger/non-power of two split_k, and make the splits a multiple of 256
  • Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used

Perf results on 4070 and 5090. The split_k and medium/large selection improvements particularly benefit 5090 where the GPU often wasn't being filled.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 50 --prio 1 -m C:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m C:\models\gemma-3n-E4B-it-Q4_K_M.gguf -m C:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\meta-llama-3-8b-instruct.Q4_K_M.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |       1920.60 ± 1.79 |
| gemma3n E4B Q4_K - Medium      |   4.22 GiB |     6.87 B | Vulkan     |  99 |  1 |           pp512 |      2609.35 ± 41.34 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |     8254.77 ± 121.91 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |       3765.21 ± 8.59 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |      6759.22 ± 30.63 |

after:

ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |       2239.43 ± 3.60 |
| gemma3n E4B Q4_K - Medium      |   4.22 GiB |     6.87 B | Vulkan     |  99 |  1 |           pp512 |      2694.36 ± 46.88 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |     8178.23 ± 134.56 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |       3990.27 ± 5.69 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |      6636.85 ± 13.82 |

before:

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |      5469.64 ± 45.64 |
| gemma3n E4B Q4_K - Medium      |   4.22 GiB |     6.87 B | Vulkan     |  99 |  1 |           pp512 |      4763.36 ± 70.06 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |    24341.22 ± 246.44 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |     9577.41 ± 143.94 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |    15541.69 ± 134.82 |

after:

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |      6351.47 ± 30.99 |
| gemma3n E4B Q4_K - Medium      |   4.22 GiB |     6.87 B | Vulkan     |  99 |  1 |           pp512 |     5170.98 ± 170.61 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |   26591.76 ± 1879.27 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |    11046.12 ± 215.04 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |    20919.16 ± 788.06 |

- Increase tile size for k-quants, to match non-k-quants
- Choose more carefully between large and medium tiles, considering how it
  interacts with split_k
- Allow larger/non-power of two split_k, and make the splits a multiple of 256
- Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner July 29, 2025 04:46
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 29, 2025
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@0cc4m 0cc4m merged commit 4cb208c into ggml-org:master Aug 2, 2025
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants