distributed llama without gpu, using only cpu #1235

hiqsociety · 2023-04-29T17:12:17Z

possible to do a distributed llama without gpu, using only cpu? 64gb ram is kind of pricey in most cases but 2x 32gb ram is cheaperly available almost everywhere.

possible to put distributed llama on roadmap?

mikeggh · 2023-04-29T17:30:32Z

I'm sure it would require something like RDMA to be efficient, and even then it may not be worth it. GPU distribution is nice because GPU matrices are so much faster than CPU. Are you saying 2x 32gb cloud or dedicated instances VS 1x 32gb? Or are you saying 2x 32gb RAM chips in one single machine? If you are speaking of individual DIMMs, then the operating system will pair it properly and handle all mappings...

hiqsociety · 2023-04-29T18:34:09Z

2x 32gb vps instances.
instead of RDMA, through 10gbps line. will that help?
anyway, 4x32gb is still cheaper than 1x 128gb instance so yes, even if "slower" a bit at least i dont need to quantize it.

i read somewhere that the more u quantize, the quality gets lower. so why should people go for 4bit just coz it can run on PC now? i mean it's really terrible the output if u ask me. i rather go for something a bit better and usable.

mikeggh · 2023-04-29T22:30:51Z

I believe when its split across GPUs its acceptable because of the multiplier of GPU vs CPU matrice instructions.. the gain would dwindle significantly. I don't know of any VPS which offer RDMA, as its usually for distributed tasks in labs built for it.

The 4bit is insignificant here because the gains/losses of distributing would be similar across all models. It would just be faster to use on a single machine without designing an entire framework for distribution specifically for the operation you have in mind..

I assume the majority of people are using 4bit, and it will allow you to use the bigger models, but none the less... your concept of distribution would have the high losses during transfer of the memory between two machines, since the logits, and layers would I think would essentially have to be handled in sequence on the same framework. I would suspect the main reason for distribution would be able to handle multiple clients/separate sessions, and in this case you could use a round robin HTTP proxy before a web server, or a custom protocol which keeps track of the usage for each backend server.

good luck.. hope it works out

mikeggh · 2023-04-29T22:34:43Z

btw if you have any questions regarding splitting sessions across multiple machines then that is something i can try to help with, but thats different than modifying llama to distribute its essential functionality.=]

hiqsociety · 2023-04-30T03:09:35Z

distributed ai should be the next step then.

really hope to see distributed ai like bloom's petal is what we really need. i havent tried petal but i think llama is better than bloom now so will see how.

thx for the feedback.

p.s. : leaving this issue open so some genius who can do a petal equivalent for llama can provide some solutions here. thx!

mikeggh · 2023-04-30T16:00:47Z

Nice thank for comment, i'll read up on petals. =] missed that entire press release..

If the layers of LLaMa, and these other similar models can be split easily then you could even perform tasks on machines visiting sites like the bitcoin miners do. There were already some javascript (gpt in browser) projects popping up on github trending. I know the save sesssion file for 13b is 1.6gb, so even 1/3rd of that is quite a bit to send through sockets per token.

However the petals concept for performing a regularly session with something like llama-cpp-python and flask/REST api, or some other wrapper for the llama.cpp could actually work well. If you kept all layers on a single distributed client helping provide inferrence.

SlyEcho · 2023-07-05T15:03:40Z

I think #2099 solves this.

gjmulder added the enhancement New feature or request label May 2, 2023

ggerganov closed this as completed Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed llama without gpu, using only cpu #1235

distributed llama without gpu, using only cpu #1235

hiqsociety commented Apr 29, 2023

mikeggh commented Apr 29, 2023

hiqsociety commented Apr 29, 2023

mikeggh commented Apr 29, 2023

mikeggh commented Apr 29, 2023 •

edited

Loading

hiqsociety commented Apr 30, 2023

mikeggh commented Apr 30, 2023 •

edited

Loading

SlyEcho commented Jul 5, 2023

distributed llama without gpu, using only cpu #1235

distributed llama without gpu, using only cpu #1235

Comments

hiqsociety commented Apr 29, 2023

mikeggh commented Apr 29, 2023

hiqsociety commented Apr 29, 2023

mikeggh commented Apr 29, 2023

mikeggh commented Apr 29, 2023 • edited Loading

hiqsociety commented Apr 30, 2023

mikeggh commented Apr 30, 2023 • edited Loading

SlyEcho commented Jul 5, 2023

mikeggh commented Apr 29, 2023 •

edited

Loading

mikeggh commented Apr 30, 2023 •

edited

Loading