Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding distributed computing... #946

Closed
snapo opened this issue Apr 13, 2023 · 8 comments
Closed

Question regarding distributed computing... #946

snapo opened this issue Apr 13, 2023 · 8 comments
Labels

Comments

@snapo
Copy link

snapo commented Apr 13, 2023

I have currently access to 20 old computers, each with 32GB ram and 4 cores, 256gb ssd, 1 gbit speed network, connected to a 48port switch. (i could get a lot lot more computers but i dont have enough electricity currently)
Would it be somehow possible to distribute the llama model with llama.cpp to the 20 computers to being able to run the 65b model at a moderate speed?
What would i have to do to distribute the model on many computers to run it on cpu?
i am only interested in inference, not training..... for training i can rent cloud gpu's.

Thanks for any input that would help me / recommendation / problems.

What i see as a problem is how to split the model / models (in case i use other models) efficiently so that network bandwidth isnt the limiting factor.

@Loufe
Copy link

Loufe commented Apr 14, 2023

Consider the discussion in this PR. They're discussing limiting much more integrated high-core count CPUs to only 8 (or 4) as more cores does not seem to positively correlate with better performance. I might be misunderstanding, but I think you need faster threads, not more.

@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 14, 2023

more cores does not seem to positively correlate with better performance

This is somewhat false. The issue in #934 was about the interference of hyperthreaded logical "cores" and Efficiency cores (E-cores) on M1 and recent Intel chips (alderlake and above).

What would i have to do to distribute the model on many computers to run it on cpu?

I think it's a better idea to stick to a single node. Distributed inference is a pretty terrible idea and has high overhead, unless you have a HPC setup. I would suggest sticking to the model (e.g. 30B 4bit quantized) that can run on a single node with 32GB RAM, and then load distributing your requests over those nodes.

@snapo
Copy link
Author

snapo commented Apr 14, 2023

i understand the single inference.... but wouldnt it be possible to distribute it to 20 computers?
I mean with that each layer on a single computer that runs on 4 threads (because there are 4 cores).
the connection between the layers contain only the transformer block output (even it means upgrading the disk on all the 20 computers so they have the full network each)

image
image

(Image from Wikipedia)

What i mean is for example PC1 provides the input embedding, the last pc provides the softmax output and the decoder, all pc's in between do 1 or multiple transformer blocks.

network wise in this way only layer to layer transfer (at least from my noobie understanding) would happen which is very small (input and output from the transformer).

I understand there is no speedup in computing, but i could if that works create thousands of requests parallel (which speeds up total compute).

on the 65B model for example there should be around 10 trillion calculations required / token therefore a single output token will be maximum as fast as the operations and readspeed of the disk.

But what the multi computer system allows is creating a API where we can let multiple "auto-gpt" run or even distribute it like a seti@home computing system where a huge number of requests can happen in parallel.

Even assuming 1 token takes 5 seconds, if you can process with 20 computers 5000 requests in parallel it means 1000 tokens/s/batch which is pretty fast. But each request takes then approximately 10 minute to complete.

Just my 2 cents on the idea why it would be nice to have in my view.

@jon-chuang
Copy link
Contributor

There already exists many ways to distribute across tensor and operator. See e.g. https://alpa.ai/index.html

I believe this is out of scope for llama.cpp

@snapo
Copy link
Author

snapo commented Apr 14, 2023

Thank you very much, i will check out alpa.ai and if it would fit my need :-)

@ggerganov
Copy link
Owner

ggerganov commented Apr 14, 2023

From ggml point of view, such distributed computing is completely possible. You simply have to partition your transformer the way you like and load the respective tensors on the respective nodes. You then create the partial compute graphs and should be ready to compute.

The main thing to solve is make the nodes communicate with each other - for example over the network.
This is something that will likely never be part of ggml or even llama.cpp since it will bring 3rd party dependencies. So a distributed computing example will likely have to be demonstrated in a separate respository / fork.

Unless, you find a very elegant way to pass and queue messages between the nodes that fits in a few hundred lines of C/C++ code. In that case, this can become a llama.cpp example and I think it will be of great interest. Even if it works only on Linux for example.

@jon-chuang
Copy link
Contributor

Unless, you find a very elegant way to pass and queue messages between the nodes that fits in a few hundred lines of C/C++ code. In that case, this can become a llama.cpp example and I think it will be of great interest. Even if it works only on Linux for example.

If you accept MPI as a dependency, this is actually very possible.

The test should be written using multiple processes to simulate multiple nodes.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants