Skip to content

distributed inference is very slow with Mac m2 ultra #1233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gaord opened this issue Feb 17, 2025 · 8 comments
Open

distributed inference is very slow with Mac m2 ultra #1233

gaord opened this issue Feb 17, 2025 · 8 comments
Assignees
Labels
rpc server llama-box RPC server issues

Comments

@gaord
Copy link

gaord commented Feb 17, 2025

Describe the bug

with 2 Mac studio m2 Ultra: 192GB and 64GB, create a gpu cluster. In resource displays two workers ready. deploy DeepSeek-R1-UD-IQ1_S.gguf(131GB) locally in one big file with the following distribution configuration:

Image

Image

Result
inference is very slow: 0.69 tokens/s
Image

Expected behavior

commonly the same hardware could provide 17 tokens/s with Ollama or llama.cpp backend. GPUStack could catch up with this anyway.

Environment

  • GPUStack version:0.5.1
  • OS:macos 14/15
  • GPU: Mac Studio m2 ultra
@thxCode
Copy link
Contributor

thxCode commented Feb 17, 2025

Hi, can you provide where you get the 17 TPS?

@thxCode
Copy link
Contributor

thxCode commented Feb 17, 2025

the result seems close to run the 32B distilled model in some hardware. please let me know if I've missed anything.

@pengjiang80
Copy link
Contributor

pengjiang80 commented Feb 17, 2025

As a temporary workaround, you can restart the 64G one which should be the remote RPC server. It should be helpful. We are still investigating in this.

@gitlawr gitlawr added this to the v0.6.0 milestone Feb 18, 2025
@gitlawr gitlawr added the P1 High Priority / Should Have label Feb 18, 2025
@pengjiang80
Copy link
Contributor

@gaord What's the OS version of your MacStudio? We are still trying to reproduce and debug this issue. In an environment that has similar problems, it doesn't occur again after upgrading it to the latest macOS 15.3.1.

@thxCode thxCode self-assigned this Feb 19, 2025
@gitlawr gitlawr added the rpc server llama-box RPC server issues label Feb 19, 2025
@gaord
Copy link
Author

gaord commented Feb 20, 2025

macOS 15.2 and macOS 14.2. Wonder why os versions make difference in inference. May not be the root reason? I will restart the remote server for a try. Many thanks.

@pengjiang80
Copy link
Contributor

Refer to this: https://gist.github.com/awni/ec071fd27940698edd14a4191855bba6?permalink_comment_id=5415441#gistcomment-5415441. There is a feature called residency set which may impact the inference performance.

@thxCode
Copy link
Contributor

thxCode commented Mar 2, 2025

we used the following command to construct the testing env: llama-box --host 0.0.0.0 --embeddings --gpu-layers 99 --parallel 4 --port 8080 --model /Users/seal/DeepSeek-R1/UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf --alias DeepSeek-R1-UD-Q2_K_XL --no-mmap --no-warmup --rpc ${RPC_HOST_AND_PORT} --tensor-split 1,1 -c 12288

we have tested with the below cases, including testing single requests more than 10 times and batch concurrency more than 10 times, but we could not get a significant TPS degradation.

case 1: main server macOS 15.3(with residency set), rpc server macOS 14.7(without residency set); 
case 2: main server macOS 14.7(without residency set), rpc server macOS 15.3(with residency set); 
case 3: main server macOS 14.7(without residency set), rpc server macOS 15.3(disabled residency set).

with llama-box v0.0.120, we could only get 5-6 TPS when deploying DeepSeek-R1-UD-Q2_K_XL within two Apple M2 Ultra boxes.

we also observed that using the residency set, when the main server actively interrupts during the transmission of tensors(the top graph), it will cause the memory of the rpc server to leak (the bottom graph). so we disabled the residency set using of the rpc server in v0.0.120+. we are not sure how much efficiency degradation this will bring but from the test of llama.cpp, it does not seem to improve much: ggml-org/llama.cpp#11427.

Image

@thxCode
Copy link
Contributor

thxCode commented Mar 2, 2025

Describe the bug

with 2 Mac studio m2 Ultra: 192GB and 64GB, create a gpu cluster. In resource displays two workers ready. deploy DeepSeek-R1-UD-IQ1_S.gguf(131GB) locally in one big file with the following distribution configuration:

Image

Image

Result inference is very slow: 0.69 tokens/s Image

Expected behavior

commonly the same hardware could provide 17 tokens/s with Ollama or llama.cpp backend. GPUStack could catch up with this anyway.

Environment

  • GPUStack version:0.5.1
  • OS:macos 14/15
  • GPU: Mac Studio m2 ultra

@gaord I can only find a post about DS-R1 output on different hardware, but with 17 TPS, it could be a data from 32B distilled model. can you provide the data source of 17 TPS on DS-R1-UD model?

@thxCode thxCode removed the P1 High Priority / Should Have label Mar 2, 2025
@thxCode thxCode removed this from the v0.6.0 milestone Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rpc server llama-box RPC server issues
Projects
None yet
Development

No branches or pull requests

4 participants