Performance of llama.cpp with Vulkan #10879
Replies: 39 comments 70 replies
-
AMD FirePro W8100
|
Beta Was this translation helpful? Give feedback.
-
AMD RX 470
|
Beta Was this translation helpful? Give feedback.
-
ubuntu 24.04, vulkan and cuda installed from official APT packages.
build: 4da69d1 (4351) vs CUDA on the same build/setup
build: 4da69d1 (4351) |
Beta Was this translation helpful? Give feedback.
-
Macbook Air M2 on Asahi Linux ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: Found 4 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
build: 0d52a69 (4439) NVIDIA GeForce RTX 3090 (NVIDIA)
AMD Radeon RX 6800 XT (RADV NAVI21) (radv)
AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)
Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)
|
Beta Was this translation helpful? Give feedback.
-
@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require |
Beta Was this translation helpful? Give feedback.
-
Build: 8d59d91 (4450)
Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
edit: retested both with the default batch size. |
Beta Was this translation helpful? Give feedback.
-
Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5. build: 914a82d (4452)
|
Beta Was this translation helpful? Give feedback.
-
Latest arch with For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason kill -STOP -1
timeout 240s $COMMAND
kill -CONT -1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
build: ff3fcab (4459)
This bit seems to underutilise both GPU and CPU in real conditions based on
|
Beta Was this translation helpful? Give feedback.
-
Intel ARC A770 on Windows:
build: ba8a1f9 (4460) |
Beta Was this translation helpful? Give feedback.
-
Single GPU VulkanRadeon Instinct MI25 ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Radeon PRO VII ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Multi GPU Vulkanggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Single GPU RocmDevice 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
build: 2739a71 (4461) Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Multi GPU RocmDevice 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Layer split
build: 2739a71 (4461) Row split
build: 2739a71 (4461) Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split. |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).
|
Beta Was this translation helpful? Give feedback.
-
I tried but there's nothing after 1 hrs , ok, might be 40 mins... Anyway I run the llama_cli for a sample eval...
Meanwhile OpenBLAS
|
Beta Was this translation helpful? Give feedback.
-
Linux 6.12.9, Radeon RX 6900 XT, build 44e18ef (4503), AMDGPU-PRO (6.3.0)
Mesa RADV 23.3.3
AMD HIP (6.2.4) for comparison
|
Beta Was this translation helpful? Give feedback.
-
Windows 11, AMD RX6900XT, AMD Ryzen 7 5800X 8-Core Processor 3.80 GHz .\llama-bench.exe -m "...\llama-2-7b.Q4_0.gguf" -ngl 99
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
FreeBSD 14.2
build: 2e3a7d3c (4650) macOS Ventura 13.7.2
build: d2fe216 (4667) |
Beta Was this translation helpful? Give feedback.
-
build: aaa5505 (4673)
llama.cpp is built using |
Beta Was this translation helpful? Give feedback.
-
Ryzen 8840HS with Radeon 780M
build: 19d3c82 (4677) |
Beta Was this translation helpful? Give feedback.
-
I have a pair of Amd Instinct MI50s that I'd like to add test results for, but I can't get Vulkan to recognize them to have llama.cpp run. I'm using the amdgpu installer script like this: |
Beta Was this translation helpful? Give feedback.
-
Ryzen 3400G
build: 0893e01 (4682) 48G DDR4-3000, Debian 12, amdgpu driver from kernel 6.12.9, Mesa 24.2.8. The '0' rows ran with 256M dedicated VRAM, but forcing host memory allocation (GGML_VK_PREFER_HOST_MEMORY=1). 'CPU?' rows were simply with '-ngl 0', but looks like llama-bench disables GPU prompt processing in that case? |
Beta Was this translation helpful? Give feedback.
-
MacBook M3 Pro with 36 GB of RAM, running macOS 15.3 with MoltenVK patched with spec constants fixed (KhronosGroup/MoltenVK#2441).
build: b9ab0a4 (4687)
build: b9ab0a4 (4687)
build: b9ab0a4 (4687)
build: b9ab0a4 (4687) |
Beta Was this translation helpful? Give feedback.
-
Intel Iris Plus Graphics G7 iGPU on Intel Core i7-1065G7 benchmarks are below. 16GB system ram, but vulkaninfo reports around 8GB of available vram. CPU vs Vulkan below. Vulkan is much faster on pp512 (almost x2), but slightly slower on tg128. Not sure if I can do anything to speed up token generation. ./llama-bench -m models/llama-2-7b.Q4_0.gguf
GGML_VK_PREFER_HOST_MEMORY=1 llama-bench -m models/llama-2-7b.Q4_0.gguf -ngl 0,10,20,30,33
|
Beta Was this translation helpful? Give feedback.
-
I did some testing on my Samsung Galaxt A34 (Mali-G68 MC4) here are the results : build: 73e2ed3 (4735) Vulkan :
Same results with
CPU :
Besides the fact that the Vulkan implementation for mobile GPUs is not there yet which I was aware of. But I'm really surprised by the big difference between a CPU build and a GPU with ngl 0. Theoretically, it should be quite the same, no? |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX 7800 XT on linux
build: ee02ad0 (4749) |
Beta Was this translation helpful? Give feedback.
-
Integrated GPU of Ultra 7 Processor 165U, installed the intel gpu driver ./llama.cpp/build-20250217-Vulkan/build/bin$ ./llama-bench -m ./models/llama-2-7b.Q4_0.gguf
build: b9ab0a4 (4687) ./llama.cpp/build-20250211-sycl/build/bin$ ./llama-bench -m ./models/llama-2-7b.Q4_0.gguf
build: b9ab0a4 (4687) AMD external GPU(7600M XT) connected via a Thunderbolt 4.0 USB-C ./llama.cpp/build-20250217-Vulkan/build/bin$ ./llama-bench -m ./models/llama-2-7b.Q4_0.gguf
./llama.cpp/build-20250213-hip/build/bin$ ./llama-bench -m ./models/llama-2-7b.Q4_0.gguf
build: b9ab0a4 (4687) For me, the Vulkan version is not my first choice, no matter use external gpu or not. |
Beta Was this translation helpful? Give feedback.
-
This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend and I think it's good to consolidate and discuss our results here.
We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.
Instructions
Either run the commands below or download one of our Vulkan releases.
Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.
If multiple entries are posted for the same device the one with the highest tg128 score will be used. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same.
Vulkan Scoreboard for Llama 2 7B, Q4_0 (no FA)
Vulkan Scoreboard for Llama 2 7B, Q4_0 (with FA)
Beta Was this translation helpful? Give feedback.
All reactions