Skip to content

How to run in AMD GPU with macos (with mps)? #2965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sukualam opened this issue Sep 2, 2023 · 9 comments
Closed

How to run in AMD GPU with macos (with mps)? #2965

sukualam opened this issue Sep 2, 2023 · 9 comments
Labels

Comments

@sukualam
Copy link

sukualam commented Sep 2, 2023

my rx 560 actually supported in macos (mine is hackintosh macos ventura 13.4), but when i try to run llamacpp , it cant utilize mps. im already compile it with LLAMA_METAL=1 make but when i run this command:

./main -m ./models/falcon-7b-Q4_0-GGUF.gguf -n 128 -ngl 1

it error:

ggml_metal_init: loaded kernel_mul_mat_q5_K_f32            0x7fcb478145f0 | th_max =  768 | th_width =   64
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32            0x7fcb47814dd0 | th_max = 1024 | th_width =   64
ggml_metal_init: loaded kernel_mul_mm_f16_f32                         0x0 | th_max =    0 | th_width =    0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "SC compilation failure
There is a call to an undefined label" UserInfo={NSLocalizedDescription=SC compilation failure
There is a call to an undefined label}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model './models/falcon-7b-Q4_0-GGUF.gguf'
main: error: unable to load model

please make it compatible because i can run stable diffusioon on macos with this gpu (the mps is work)

@mfchiz
Copy link

mfchiz commented Sep 4, 2023

im getting the same error.

AMD Radeon Pro 5500M 4 GB
Intel UHD Graphics 630 1536 MB
2.3 GHz 8-Core Intel Core i9
Ventura 13.2.1

@sukualam
Copy link
Author

sukualam commented Sep 4, 2023

it seem just work for metal specific gpu, like it just for m1/m2 only. i hope there will fallback mode like pytorch/workaround

@rlanday
Copy link
Contributor

rlanday commented Sep 5, 2023

I am having the same issue, also on the Radeon Pro 5500M, but with 8 GB of RAM:

AMD Radeon Pro 5500M:

Chipset Model: AMD Radeon Pro 5500M
Type: GPU
Bus: PCIe
PCIe Lane Width: x16
VRAM (Total): 8 GB
Vendor: AMD (0x1002)
Device ID: 0x7340
Revision ID: 0x0040
ROM Revision: 113-D3220E-190
VBIOS Version: 113-D32206U1-019
Option ROM Version: 113-D32206U1-019
EFI Driver Version: 01.A1.190
Automatic Graphics Switching: Supported
gMux Version: 5.0.0
Metal Support: Metal 3

I was able to successfully use the Metal backend on an earlier version of llama.cpp (albeit not with clearly improved performance over using the CPU, if I recall correctly).

@knweiss
Copy link
Contributor

knweiss commented Sep 5, 2023

As mentioned in #2407 the following kernels seem to be the culprit as they don’t compile. IOW if I comment them the remaining kernels compile on my AMD Radeon Pro 5500M (8 GB VRAM):

+        // GGML_METAL_ADD_KERNEL(mul_mm_f16_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_0_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q8_0_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_1_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q2_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q3_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q5_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q6_K_f32);

@knweiss
Copy link
Contributor

knweiss commented Sep 5, 2023

There’s now also #3000.

@ZacharyDK
Copy link

Hearted. In the same boat. Trying to get metal to work with llama with:

AMD 8GB Radeon pro 5600M.
Intel UHD Graphics 630 1536 MB.
16 GB, 2.3 GHz 8 core Intel i9, Ventura.

No idea where to even begin to search on forums. Seen people have success with M1/M2 chips but literally nothing with AMD.

I wish we could just compile the shaders, cache them, and then the shaders handle all the orders that need to be done on the GPU.

I come from the Unreal Engine side of things. We have a material editor that lets us easily design functions such that the GPU does what we want. Doesn't matter M1 vs M2, AMD or NVIDIA. A universal pipeline with all the instructions --> some universal Shader code --> Specifics for GPU should be doable.

@pudepiedj
Copy link
Contributor

Hearted. In the same boat. Trying to get metal to work with llama with:

AMD 8GB Radeon pro 5600M. Intel UHD Graphics 630 1536 MB. 16 GB, 2.3 GHz 8 core Intel i9, Ventura.

No idea where to even begin to search on forums. Seen people have success with M1/M2 chips but literally nothing with AMD.

I wish we could just compile the shaders, cache them, and then the shaders handle all the orders that need to be done on the GPU.

I come from the Unreal Engine side of things. We have a material editor that lets us easily design functions such that the GPU does what we want. Doesn't matter M1 vs M2, AMD or NVIDIA. A universal pipeline with all the instructions --> some universal Shader code --> Specifics for GPU should be doable.

I have an AMD Radeon Pro 5500M with only 4GB on a MacBook Pro 2.3GHz with 16GB and an Intel 8-core i9 with UHD 630. It will just run `.build/bin/simple ./models/llama-2-7b/ggml-model-q4_0.gguf' but the required GPU memory exceeds that available and so it runs incredibly slowly and eventually reports a segmentation fault even though it produces some output by apparently using the CPU RAM. Anything bigger than the -q4_0 and it refuses even to try. For some reason it reverts to German halfway through the response!

llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  512.00 MB
llama_new_context_with_model: compute buffer total size = 95.88 MB
llama_new_context_with_model: max tensor size =   102.54 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3060.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   691.12 MB, offs =   3101118464, ( 3751.92 /  4080.00)
ggml_metal_add_buffer: allocated 'kv      ' buffer, size =   514.00 MB, ( 4265.92 /  4080.00)
ggml_metal_add_buffer: allocated 'alloc   ' buffer, size =    90.01 MB, ( 4355.93 /  4080.00)

main: n_len = 32, n_ctx = 1024, n_kv_req = 32

 Hello my name is Katie and I am a 20 year old student from the UK. Unterscheidung zwischen „Katie“ und „Kathy

main: decoded 27 tokens in 22.26 s, speed: 1.21 t/s

llama_print_timings:        load time =  5257.04 ms
llama_print_timings:      sample time =     2.09 ms /    28 runs   (    0.07 ms per token, 13397.13 tokens per second)
llama_print_timings: prompt eval time =  4245.90 ms /     5 tokens (  849.18 ms per token,     1.18 tokens per second)
llama_print_timings:        eval time = 22240.42 ms /    27 runs   (  823.72 ms per token,     1.21 tokens per second)
llama_print_timings:       total time = 27512.82 ms

ggml_metal_free: deallocating
zsh: segmentation fault  ./build/bin/simple ./models/llama-2-7b/ggml-model-q4_0.gguf "Hello my name is"

@kovyrin
Copy link

kovyrin commented Oct 15, 2023

I have an older iMac Retina 5K, 27-inch, 2017 with Radeon Pro 570 4 GB inside. I tried to run whisper from main and it failed with the "SC compilation failure". Removing all add kernel lines with "mul_mm_" prefixes and recompiling solved the issue:

diff --git a/ggml-metal.m b/ggml-metal.m
index 1139ee3..06a436c 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -251,16 +251,16 @@ @implementation GGMLMetalClass
         GGML_METAL_ADD_KERNEL(mul_mat_q4_K_f32);
         GGML_METAL_ADD_KERNEL(mul_mat_q5_K_f32);
         GGML_METAL_ADD_KERNEL(mul_mat_q6_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_f32_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_f16_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q4_0_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q8_0_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q4_1_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q2_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q3_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q4_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q5_K_f32);
-        GGML_METAL_ADD_KERNEL(mul_mm_q6_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_f32_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_f16_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_0_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q8_0_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_1_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q2_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q3_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q5_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q6_K_f32);
         GGML_METAL_ADD_KERNEL(rope);
         GGML_METAL_ADD_KERNEL(alibi_f32);
         GGML_METAL_ADD_KERNEL(cpy_f32_f16);

Copy link
Contributor

github-actions bot commented Apr 5, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants