distributed inference is very slow with Mac m2 ultra #1233

gaord · 2025-02-17T13:16:57Z

Describe the bug

with 2 Mac studio m2 Ultra: 192GB and 64GB, create a gpu cluster. In resource displays two workers ready. deploy DeepSeek-R1-UD-IQ1_S.gguf(131GB) locally in one big file with the following distribution configuration:

Result
inference is very slow: 0.69 tokens/s

Expected behavior

commonly the same hardware could provide 17 tokens/s with Ollama or llama.cpp backend. GPUStack could catch up with this anyway.

Environment

GPUStack version:0.5.1
OS:macos 14/15
GPU: Mac Studio m2 ultra

thxCode · 2025-02-17T13:25:37Z

Hi, can you provide where you get the 17 TPS?

thxCode · 2025-02-17T13:34:53Z

the result seems close to run the 32B distilled model in some hardware. please let me know if I've missed anything.

pengjiang80 · 2025-02-17T13:38:31Z

As a temporary workaround, you can restart the 64G one which should be the remote RPC server. It should be helpful. We are still investigating in this.

pengjiang80 · 2025-02-19T01:28:20Z

@gaord What's the OS version of your MacStudio? We are still trying to reproduce and debug this issue. In an environment that has similar problems, it doesn't occur again after upgrading it to the latest macOS 15.3.1.

gaord · 2025-02-20T15:03:05Z

macOS 15.2 and macOS 14.2. Wonder why os versions make difference in inference. May not be the root reason? I will restart the remote server for a try. Many thanks.

pengjiang80 · 2025-02-20T16:28:20Z

Refer to this: https://gist.github.com/awni/ec071fd27940698edd14a4191855bba6?permalink_comment_id=5415441#gistcomment-5415441. There is a feature called residency set which may impact the inference performance.

thxCode · 2025-03-02T02:57:29Z

we used the following command to construct the testing env: llama-box --host 0.0.0.0 --embeddings --gpu-layers 99 --parallel 4 --port 8080 --model /Users/seal/DeepSeek-R1/UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf --alias DeepSeek-R1-UD-Q2_K_XL --no-mmap --no-warmup --rpc ${RPC_HOST_AND_PORT} --tensor-split 1,1 -c 12288

we have tested with the below cases, including testing single requests more than 10 times and batch concurrency more than 10 times, but we could not get a significant TPS degradation.

case 1: main server macOS 15.3(with residency set), rpc server macOS 14.7(without residency set); 
case 2: main server macOS 14.7(without residency set), rpc server macOS 15.3(with residency set); 
case 3: main server macOS 14.7(without residency set), rpc server macOS 15.3(disabled residency set).

with llama-box v0.0.120, we could only get 5-6 TPS when deploying DeepSeek-R1-UD-Q2_K_XL within two Apple M2 Ultra boxes.

we also observed that using the residency set, when the main server actively interrupts during the transmission of tensors(the top graph), it will cause the memory of the rpc server to leak (the bottom graph). so we disabled the residency set using of the rpc server in v0.0.120+. we are not sure how much efficiency degradation this will bring but from the test of llama.cpp, it does not seem to improve much: ggml-org/llama.cpp#11427.

thxCode · 2025-03-02T03:11:46Z

Describe the bug

with 2 Mac studio m2 Ultra: 192GB and 64GB, create a gpu cluster. In resource displays two workers ready. deploy DeepSeek-R1-UD-IQ1_S.gguf(131GB) locally in one big file with the following distribution configuration:

Result inference is very slow: 0.69 tokens/s

Expected behavior

commonly the same hardware could provide 17 tokens/s with Ollama or llama.cpp backend. GPUStack could catch up with this anyway.

Environment

GPUStack version:0.5.1

OS:macos 14/15

GPU: Mac Studio m2 ultra

@gaord I can only find a post about DS-R1 output on different hardware, but with 17 TPS, it could be a data from 32B distilled model. can you provide the data source of 17 TPS on DS-R1-UD model?

gitlawr added this to the v0.6.0 milestone Feb 18, 2025

gitlawr added the P1 High Priority / Should Have label Feb 18, 2025

thxCode self-assigned this Feb 19, 2025

gitlawr added the rpc server llama-box RPC server issues label Feb 19, 2025

thxCode removed the P1 High Priority / Should Have label Mar 2, 2025

thxCode removed this from the v0.6.0 milestone Mar 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

distributed inference is very slow with Mac m2 ultra #1233

distributed inference is very slow with Mac m2 ultra #1233

gaord commented Feb 17, 2025

thxCode commented Feb 17, 2025

Uh oh!

thxCode commented Feb 17, 2025

Uh oh!

pengjiang80 commented Feb 17, 2025 •

edited

Loading

Uh oh!

pengjiang80 commented Feb 19, 2025

Uh oh!

gaord commented Feb 20, 2025

Uh oh!

pengjiang80 commented Feb 20, 2025

Uh oh!

thxCode commented Mar 2, 2025

Uh oh!

thxCode commented Mar 2, 2025

Uh oh!

distributed inference is very slow with Mac m2 ultra #1233

distributed inference is very slow with Mac m2 ultra #1233

Comments

gaord commented Feb 17, 2025

thxCode commented Feb 17, 2025

Uh oh!

thxCode commented Feb 17, 2025

Uh oh!

pengjiang80 commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pengjiang80 commented Feb 19, 2025

Uh oh!

gaord commented Feb 20, 2025

Uh oh!

pengjiang80 commented Feb 20, 2025

Uh oh!

thxCode commented Mar 2, 2025

Uh oh!

thxCode commented Mar 2, 2025

Uh oh!

pengjiang80 commented Feb 17, 2025 •

edited

Loading