How to run in AMD GPU with macos (with mps)? #2965

sukualam · 2023-09-02T08:55:01Z

my rx 560 actually supported in macos (mine is hackintosh macos ventura 13.4), but when i try to run llamacpp , it cant utilize mps. im already compile it with LLAMA_METAL=1 make but when i run this command:

./main -m ./models/falcon-7b-Q4_0-GGUF.gguf -n 128 -ngl 1

it error:

ggml_metal_init: loaded kernel_mul_mat_q5_K_f32            0x7fcb478145f0 | th_max =  768 | th_width =   64
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32            0x7fcb47814dd0 | th_max = 1024 | th_width =   64
ggml_metal_init: loaded kernel_mul_mm_f16_f32                         0x0 | th_max =    0 | th_width =    0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "SC compilation failure
There is a call to an undefined label" UserInfo={NSLocalizedDescription=SC compilation failure
There is a call to an undefined label}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model './models/falcon-7b-Q4_0-GGUF.gguf'
main: error: unable to load model

please make it compatible because i can run stable diffusioon on macos with this gpu (the mps is work)

The text was updated successfully, but these errors were encountered:

mfchiz · 2023-09-04T01:26:24Z

im getting the same error.

AMD Radeon Pro 5500M 4 GB
Intel UHD Graphics 630 1536 MB
2.3 GHz 8-Core Intel Core i9
Ventura 13.2.1

sukualam · 2023-09-04T02:22:52Z

it seem just work for metal specific gpu, like it just for m1/m2 only. i hope there will fallback mode like pytorch/workaround

rlanday · 2023-09-05T02:13:38Z

I am having the same issue, also on the Radeon Pro 5500M, but with 8 GB of RAM:

AMD Radeon Pro 5500M:

Chipset Model: AMD Radeon Pro 5500M
Type: GPU
Bus: PCIe
PCIe Lane Width: x16
VRAM (Total): 8 GB
Vendor: AMD (0x1002)
Device ID: 0x7340
Revision ID: 0x0040
ROM Revision: 113-D3220E-190
VBIOS Version: 113-D32206U1-019
Option ROM Version: 113-D32206U1-019
EFI Driver Version: 01.A1.190
Automatic Graphics Switching: Supported
gMux Version: 5.0.0
Metal Support: Metal 3

I was able to successfully use the Metal backend on an earlier version of llama.cpp (albeit not with clearly improved performance over using the CPU, if I recall correctly).

knweiss · 2023-09-05T07:15:17Z

As mentioned in #2407 the following kernels seem to be the culprit as they don’t compile. IOW if I comment them the remaining kernels compile on my AMD Radeon Pro 5500M (8 GB VRAM):

+        // GGML_METAL_ADD_KERNEL(mul_mm_f16_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_0_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q8_0_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_1_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q2_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q3_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q5_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q6_K_f32);

knweiss · 2023-09-05T07:19:02Z

There’s now also #3000.

ZacharyDK · 2023-09-14T17:17:53Z

Hearted. In the same boat. Trying to get metal to work with llama with:

AMD 8GB Radeon pro 5600M.
Intel UHD Graphics 630 1536 MB.
16 GB, 2.3 GHz 8 core Intel i9, Ventura.

No idea where to even begin to search on forums. Seen people have success with M1/M2 chips but literally nothing with AMD.

I wish we could just compile the shaders, cache them, and then the shaders handle all the orders that need to be done on the GPU.

I come from the Unreal Engine side of things. We have a material editor that lets us easily design functions such that the GPU does what we want. Doesn't matter M1 vs M2, AMD or NVIDIA. A universal pipeline with all the instructions --> some universal Shader code --> Specifics for GPU should be doable.

pudepiedj · 2023-10-02T11:24:29Z

Hearted. In the same boat. Trying to get metal to work with llama with:

AMD 8GB Radeon pro 5600M. Intel UHD Graphics 630 1536 MB. 16 GB, 2.3 GHz 8 core Intel i9, Ventura.

No idea where to even begin to search on forums. Seen people have success with M1/M2 chips but literally nothing with AMD.

I wish we could just compile the shaders, cache them, and then the shaders handle all the orders that need to be done on the GPU.

I come from the Unreal Engine side of things. We have a material editor that lets us easily design functions such that the GPU does what we want. Doesn't matter M1 vs M2, AMD or NVIDIA. A universal pipeline with all the instructions --> some universal Shader code --> Specifics for GPU should be doable.

I have an AMD Radeon Pro 5500M with only 4GB on a MacBook Pro 2.3GHz with 16GB and an Intel 8-core i9 with UHD 630. It will just run `.build/bin/simple ./models/llama-2-7b/ggml-model-q4_0.gguf' but the required GPU memory exceeds that available and so it runs incredibly slowly and eventually reports a segmentation fault even though it produces some output by apparently using the CPU RAM. Anything bigger than the -q4_0 and it refuses even to try. For some reason it reverts to German halfway through the response!

llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  512.00 MB
llama_new_context_with_model: compute buffer total size = 95.88 MB
llama_new_context_with_model: max tensor size =   102.54 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3060.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   691.12 MB, offs =   3101118464, ( 3751.92 /  4080.00)
ggml_metal_add_buffer: allocated 'kv      ' buffer, size =   514.00 MB, ( 4265.92 /  4080.00)
ggml_metal_add_buffer: allocated 'alloc   ' buffer, size =    90.01 MB, ( 4355.93 /  4080.00)

main: n_len = 32, n_ctx = 1024, n_kv_req = 32

 Hello my name is Katie and I am a 20 year old student from the UK. Unterscheidung zwischen „Katie“ und „Kathy

main: decoded 27 tokens in 22.26 s, speed: 1.21 t/s

llama_print_timings:        load time =  5257.04 ms
llama_print_timings:      sample time =     2.09 ms /    28 runs   (    0.07 ms per token, 13397.13 tokens per second)
llama_print_timings: prompt eval time =  4245.90 ms /     5 tokens (  849.18 ms per token,     1.18 tokens per second)
llama_print_timings:        eval time = 22240.42 ms /    27 runs   (  823.72 ms per token,     1.21 tokens per second)
llama_print_timings:       total time = 27512.82 ms

ggml_metal_free: deallocating
zsh: segmentation fault  ./build/bin/simple ./models/llama-2-7b/ggml-model-q4_0.gguf "Hello my name is"

kovyrin · 2023-10-15T00:54:00Z

I have an older iMac Retina 5K, 27-inch, 2017 with Radeon Pro 570 4 GB inside. I tried to run whisper from main and it failed with the "SC compilation failure". Removing all add kernel lines with "mul_mm_" prefixes and recompiling solved the issue:

diff --git a/ggml-metal.m b/ggml-metal.m
index 1139ee3..06a436c 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -251,16 +251,16 @@ @implementation GGMLMetalClass
         GGML_METAL_ADD_KERNEL(mul_mat_q4_K_f32);
         GGML_METAL_ADD_KERNEL(mul_mat_q5_K_f32);
         GGML_METAL_ADD_KERNEL(mul_mat_q6_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_f32_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_f16_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q4_0_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q8_0_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q4_1_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q2_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q3_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q4_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q5_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q6_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_f32_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_f16_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_0_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q8_0_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_1_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q2_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q3_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q5_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q6_K_f32);
         GGML_METAL_ADD_KERNEL(rope);
         GGML_METAL_ADD_KERNEL(alibi_f32);
         GGML_METAL_ADD_KERNEL(cpy_f32_f16);

github-actions · 2024-04-05T01:06:29Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

akeyhero mentioned this issue Sep 8, 2023

Converting kfkas Llama-2-ko-7b-Chat to GGUF fails #2865

Closed

jhen0409 mentioned this issue Sep 15, 2023

Skip mm_mul kernel functions additions if on Intel ggml-org/whisper.cpp#1294

Closed

cracksauce mentioned this issue Jan 7, 2024

Support AMD GPUs on Intel Macs ollama/ollama#1016

Open

github-actions bot added the stale label Mar 21, 2024

github-actions bot closed this as completed Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to run in AMD GPU with macos (with mps)? #2965

How to run in AMD GPU with macos (with mps)? #2965

sukualam commented Sep 2, 2023 •

edited

Loading

mfchiz commented Sep 4, 2023

Uh oh!

sukualam commented Sep 4, 2023

Uh oh!

rlanday commented Sep 5, 2023

Uh oh!

knweiss commented Sep 5, 2023

Uh oh!

knweiss commented Sep 5, 2023

Uh oh!

ZacharyDK commented Sep 14, 2023

Uh oh!

pudepiedj commented Oct 2, 2023

Uh oh!

kovyrin commented Oct 15, 2023

Uh oh!

github-actions bot commented Apr 5, 2024

Uh oh!

How to run in AMD GPU with macos (with mps)? #2965

How to run in AMD GPU with macos (with mps)? #2965

Comments

sukualam commented Sep 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mfchiz commented Sep 4, 2023

Uh oh!

sukualam commented Sep 4, 2023

Uh oh!

rlanday commented Sep 5, 2023

Uh oh!

knweiss commented Sep 5, 2023

Uh oh!

knweiss commented Sep 5, 2023

Uh oh!

ZacharyDK commented Sep 14, 2023

Uh oh!

pudepiedj commented Oct 2, 2023

Uh oh!

kovyrin commented Oct 15, 2023

Uh oh!

github-actions bot commented Apr 5, 2024

Uh oh!

sukualam commented Sep 2, 2023 •

edited

Loading