Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed llama without gpu, using only cpu #1235

Closed
hiqsociety opened this issue Apr 29, 2023 · 7 comments
Closed

distributed llama without gpu, using only cpu #1235

hiqsociety opened this issue Apr 29, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@hiqsociety
Copy link

possible to do a distributed llama without gpu, using only cpu? 64gb ram is kind of pricey in most cases but 2x 32gb ram is cheaperly available almost everywhere.

possible to put distributed llama on roadmap?

@mikeggh
Copy link

mikeggh commented Apr 29, 2023

I'm sure it would require something like RDMA to be efficient, and even then it may not be worth it. GPU distribution is nice because GPU matrices are so much faster than CPU. Are you saying 2x 32gb cloud or dedicated instances VS 1x 32gb? Or are you saying 2x 32gb RAM chips in one single machine? If you are speaking of individual DIMMs, then the operating system will pair it properly and handle all mappings...

@hiqsociety
Copy link
Author

2x 32gb vps instances.
instead of RDMA, through 10gbps line. will that help?
anyway, 4x32gb is still cheaper than 1x 128gb instance so yes, even if "slower" a bit at least i dont need to quantize it.

i read somewhere that the more u quantize, the quality gets lower. so why should people go for 4bit just coz it can run on PC now? i mean it's really terrible the output if u ask me. i rather go for something a bit better and usable.

@mikeggh
Copy link

mikeggh commented Apr 29, 2023

I believe when its split across GPUs its acceptable because of the multiplier of GPU vs CPU matrice instructions.. the gain would dwindle significantly. I don't know of any VPS which offer RDMA, as its usually for distributed tasks in labs built for it.

The 4bit is insignificant here because the gains/losses of distributing would be similar across all models. It would just be faster to use on a single machine without designing an entire framework for distribution specifically for the operation you have in mind..

I assume the majority of people are using 4bit, and it will allow you to use the bigger models, but none the less... your concept of distribution would have the high losses during transfer of the memory between two machines, since the logits, and layers would I think would essentially have to be handled in sequence on the same framework. I would suspect the main reason for distribution would be able to handle multiple clients/separate sessions, and in this case you could use a round robin HTTP proxy before a web server, or a custom protocol which keeps track of the usage for each backend server.

good luck.. hope it works out

@mikeggh
Copy link

mikeggh commented Apr 29, 2023

btw if you have any questions regarding splitting sessions across multiple machines then that is something i can try to help with, but thats different than modifying llama to distribute its essential functionality.=]

@hiqsociety
Copy link
Author

distributed ai should be the next step then.

really hope to see distributed ai like bloom's petal is what we really need. i havent tried petal but i think llama is better than bloom now so will see how.

thx for the feedback.

p.s. : leaving this issue open so some genius who can do a petal equivalent for llama can provide some solutions here. thx!

@mikeggh
Copy link

mikeggh commented Apr 30, 2023

Nice thank for comment, i'll read up on petals. =] missed that entire press release..

If the layers of LLaMa, and these other similar models can be split easily then you could even perform tasks on machines visiting sites like the bitcoin miners do. There were already some javascript (gpt in browser) projects popping up on github trending. I know the save sesssion file for 13b is 1.6gb, so even 1/3rd of that is quite a bit to send through sockets per token.

However the petals concept for performing a regularly session with something like llama-cpp-python and flask/REST api, or some other wrapper for the llama.cpp could actually work well. If you kept all layers on a single distributed client helping provide inferrence.

@gjmulder gjmulder added the enhancement New feature or request label May 2, 2023
@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Jul 5, 2023

I think #2099 solves this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants