-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't we use multiple GPUs independently? #2165
Comments
I won't implement it but feel free to make a PR yourself if you want that feature. |
I am working on refactoring the CUDA implementation, and one of the goals is to remove the global state. After that, associating a device or set of devices to a ggml-cuda context should be fairly straightforward. |
@slaren are there estimate for when it will be implemented? week or two .. or some months? |
I hope to open a draft PR sometime this week, but it will still take some time (weeks) until it is ready to merge. If you need this now, I would suggest hacking it yourself in the meanwhile. |
Couldn't you run multiple processes with each using a different model on a different GPU? Or did I not understand correctly? |
@SlyEcho In my project llama.cpp is integrated into Golang server, so it much easier to have init llama.cpp once and manipulate with different contexts after that. It works fine with CPU. It's possible to use CPU + GPU the same time too. It just doesn't work with multiple GPUs and that's a shame. llama.cpp always assumes we are going to split one model between them, not the case when we are going to have independent inference on each GPU. |
How it's going? MultiGPU parallel independent inference will be very useful for cloud LLM farms |
Multiple GPUs can help render frames much faster, Higher FPS in games, improved multitasking, 4K gaming becomes a reality and it might also enable having a multi-monitor setup |
@slaren any movement on this feature? |
We are getting closer, you can track the progress in the ggml repository. ggerganov/ggml#586 is the most recent update of the framework that will allow us to support this and other features, and it was just merged yesterday. It is still going to take a few weeks before this is ready to be used in llama.cpp, but we are working on it. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I'm trying to use llama.cpp as a backend for scalable inference and it seems the current architecture just doesn't allow to use multiple GPUs working in parallel with different models.
From what I've understand reading code, it always suggest we are going to SPLIT the same model between multiple GPUs, not to use 1 .. N models for 1 .. N GPUs.
There global vars like
g_main_gpu
, etc, and from my POV this should be set within context, thus allow inference within GPU0 from CTX0, GPU1 from CTX1 - all at the same time.The text was updated successfully, but these errors were encountered: