Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't we use multiple GPUs independently? #2165

Closed
gotzmann opened this issue Jul 10, 2023 · 11 comments
Closed

Can't we use multiple GPUs independently? #2165

gotzmann opened this issue Jul 10, 2023 · 11 comments
Labels

Comments

@gotzmann
Copy link

gotzmann commented Jul 10, 2023

I'm trying to use llama.cpp as a backend for scalable inference and it seems the current architecture just doesn't allow to use multiple GPUs working in parallel with different models.

From what I've understand reading code, it always suggest we are going to SPLIT the same model between multiple GPUs, not to use 1 .. N models for 1 .. N GPUs.

There global vars like g_main_gpu, etc, and from my POV this should be set within context, thus allow inference within GPU0 from CTX0, GPU1 from CTX1 - all at the same time.

@JohannesGaessler
Copy link
Collaborator

I won't implement it but feel free to make a PR yourself if you want that feature.

@slaren
Copy link
Collaborator

slaren commented Jul 10, 2023

I am working on refactoring the CUDA implementation, and one of the goals is to remove the global state. After that, associating a device or set of devices to a ggml-cuda context should be fairly straightforward.

@gotzmann
Copy link
Author

@slaren are there estimate for when it will be implemented? week or two .. or some months?

@slaren
Copy link
Collaborator

slaren commented Jul 10, 2023

I hope to open a draft PR sometime this week, but it will still take some time (weeks) until it is ready to merge. If you need this now, I would suggest hacking it yourself in the meanwhile.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 13, 2023

Couldn't you run multiple processes with each using a different model on a different GPU? Or did I not understand correctly?

@gotzmann
Copy link
Author

@SlyEcho In my project llama.cpp is integrated into Golang server, so it much easier to have init llama.cpp once and manipulate with different contexts after that. It works fine with CPU. It's possible to use CPU + GPU the same time too. It just doesn't work with multiple GPUs and that's a shame. llama.cpp always assumes we are going to split one model between them, not the case when we are going to have independent inference on each GPU.

@gotzmann
Copy link
Author

gotzmann commented Aug 3, 2023

How it's going? MultiGPU parallel independent inference will be very useful for cloud LLM farms

@Rajansharma44
Copy link

Multiple GPUs can help render frames much faster, Higher FPS in games, improved multitasking, 4K gaming becomes a reality and it might also enable having a multi-monitor setup

@daddydrac
Copy link

@slaren any movement on this feature?

@slaren
Copy link
Collaborator

slaren commented Oct 31, 2023

We are getting closer, you can track the progress in the ggml repository. ggerganov/ggml#586 is the most recent update of the framework that will allow us to support this and other features, and it was just merged yesterday. It is still going to take a few weeks before this is ready to be used in llama.cpp, but we are working on it.

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants