Skip to content

vulkan: Increase workgroup size for GLU, for performance #14345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: cisc/unary-reglu-geglu-swiglu
Choose a base branch
from

Conversation

jeffbolznv
Copy link
Collaborator

@CISC @0cc4m I noticed Vulkan perf was much worse for tg in #14158 due to the small workgroup size. This change restores the performance:

before:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m c:\models\glm-4-9b-chat-Q4_0.gguf -fa 1 -n 128 -p 512 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           pp512 |      3369.87 ± 10.71 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |         57.09 ± 0.21 |

build: ab46d11d (5752)

after:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m c:\models\glm-4-9b-chat-Q4_0.gguf -fa 1 -n 128 -p 512 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           pp512 |      3404.32 ± 11.38 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |         73.71 ± 0.24 |

build: 065b990f (5753)

@jeffbolznv jeffbolznv requested review from CISC and 0cc4m June 23, 2025 13:33
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 23, 2025
Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't know anything about the Vulkan backend, but if you say so I'm sure this is good. :)

@CISC
Copy link
Collaborator

CISC commented Jun 23, 2025

Out of curiosity, is there a similar tg boost for models with split up/gate?

@jeffbolznv
Copy link
Collaborator Author

This was fixing a regression vs what's in master, so it's just recovering the performance we already had. I've only tested this one model.

@CISC
Copy link
Collaborator

CISC commented Jun 23, 2025

This was fixing a regression vs what's in master, so it's just recovering the performance we already had. I've only tested this one model.

I understood, I guess what I was asking if you could check if there was a similar regression for split up/gate too?

@jeffbolznv
Copy link
Collaborator Author

There very likely was. Can you suggest a model to test?

@CISC
Copy link
Collaborator

CISC commented Jun 23, 2025

Qwen3 or something?

@jeffbolznv
Copy link
Collaborator Author

Yes, there is a similar issue with Qwen3, which this mostly fixes. But it's still 1-2% slower. I think I need to change the shader to do one element per thread rather than a row per workgroup. I'll push another commit later today.

@jeffbolznv
Copy link
Collaborator Author

tg perf with qwen3 is now marginally faster than with master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants