-
Notifications
You must be signed in to change notification settings - Fork 283
distributed inference is very slow with Mac m2 ultra #1233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, can you provide where you get the 17 TPS? |
the result seems close to run the 32B distilled model in some hardware. please let me know if I've missed anything. |
As a temporary workaround, you can restart the 64G one which should be the remote RPC server. It should be helpful. We are still investigating in this. |
@gaord What's the OS version of your MacStudio? We are still trying to reproduce and debug this issue. In an environment that has similar problems, it doesn't occur again after upgrading it to the latest macOS 15.3.1. |
macOS 15.2 and macOS 14.2. Wonder why os versions make difference in inference. May not be the root reason? I will restart the remote server for a try. Many thanks. |
Refer to this: https://gist.github.com/awni/ec071fd27940698edd14a4191855bba6?permalink_comment_id=5415441#gistcomment-5415441. There is a feature called residency set which may impact the inference performance. |
we used the following command to construct the testing env: we have tested with the below cases, including testing single requests more than 10 times and batch concurrency more than 10 times, but we could not get a significant TPS degradation.
with llama-box v0.0.120, we could only get 5-6 TPS when deploying DeepSeek-R1-UD-Q2_K_XL within two Apple M2 Ultra boxes. we also observed that using the residency set, when the main server actively interrupts during the transmission of tensors(the top graph), it will cause the memory of the rpc server to leak (the bottom graph). so we disabled the residency set using of the rpc server in v0.0.120+. we are not sure how much efficiency degradation this will bring but from the test of llama.cpp, it does not seem to improve much: ggml-org/llama.cpp#11427. ![]() |
@gaord I can only find a post about DS-R1 output on different hardware, but with 17 TPS, it could be a data from 32B distilled model. can you provide the data source of 17 TPS on DS-R1-UD model? |
Describe the bug
with 2 Mac studio m2 Ultra: 192GB and 64GB, create a gpu cluster. In resource displays two workers ready. deploy DeepSeek-R1-UD-IQ1_S.gguf(131GB) locally in one big file with the following distribution configuration:
Result

inference is very slow: 0.69 tokens/s
Expected behavior
commonly the same hardware could provide 17 tokens/s with Ollama or llama.cpp backend. GPUStack could catch up with this anyway.
Environment
The text was updated successfully, but these errors were encountered: