-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed llama without gpu, using only cpu #1235
Comments
I'm sure it would require something like RDMA to be efficient, and even then it may not be worth it. GPU distribution is nice because GPU matrices are so much faster than CPU. Are you saying 2x 32gb cloud or dedicated instances VS 1x 32gb? Or are you saying 2x 32gb RAM chips in one single machine? If you are speaking of individual DIMMs, then the operating system will pair it properly and handle all mappings... |
2x 32gb vps instances. i read somewhere that the more u quantize, the quality gets lower. so why should people go for 4bit just coz it can run on PC now? i mean it's really terrible the output if u ask me. i rather go for something a bit better and usable. |
I believe when its split across GPUs its acceptable because of the multiplier of GPU vs CPU matrice instructions.. the gain would dwindle significantly. I don't know of any VPS which offer RDMA, as its usually for distributed tasks in labs built for it. The 4bit is insignificant here because the gains/losses of distributing would be similar across all models. It would just be faster to use on a single machine without designing an entire framework for distribution specifically for the operation you have in mind.. I assume the majority of people are using 4bit, and it will allow you to use the bigger models, but none the less... your concept of distribution would have the high losses during transfer of the memory between two machines, since the logits, and layers would I think would essentially have to be handled in sequence on the same framework. I would suspect the main reason for distribution would be able to handle multiple clients/separate sessions, and in this case you could use a round robin HTTP proxy before a web server, or a custom protocol which keeps track of the usage for each backend server. good luck.. hope it works out |
btw if you have any questions regarding splitting sessions across multiple machines then that is something i can try to help with, but thats different than modifying llama to distribute its essential functionality.=] |
distributed ai should be the next step then. really hope to see distributed ai like bloom's petal is what we really need. i havent tried petal but i think llama is better than bloom now so will see how. thx for the feedback. p.s. : leaving this issue open so some genius who can do a petal equivalent for llama can provide some solutions here. thx! |
Nice thank for comment, i'll read up on petals. =] missed that entire press release.. If the layers of LLaMa, and these other similar models can be split easily then you could even perform tasks on machines visiting sites like the bitcoin miners do. There were already some javascript (gpt in browser) projects popping up on github trending. I know the save sesssion file for 13b is 1.6gb, so even 1/3rd of that is quite a bit to send through sockets per token. However the petals concept for performing a regularly session with something like llama-cpp-python and flask/REST api, or some other wrapper for the llama.cpp could actually work well. If you kept all layers on a single distributed client helping provide inferrence. |
I think #2099 solves this. |
possible to do a distributed llama without gpu, using only cpu? 64gb ram is kind of pricey in most cases but 2x 32gb ram is cheaperly available almost everywhere.
possible to put distributed llama on roadmap?
The text was updated successfully, but these errors were encountered: