-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example for llama.cpp #174
Add example for llama.cpp #174
Conversation
a3b66d4
to
333d338
Compare
thanks @justinsb for the contribution! could you add a readme for this example like this? (https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/vllm/README.md) |
Not (yet) using the leader functionality, just going direct to a worker.
Previously we weren't actually running on multiple pods; now we are.
333d338
to
3565efe
Compare
# GGML_RPC=ON: Builds RPC support | ||
# BUILD_SHARED_LIBS=OFF: Don't rely on shared libraries like libggml | ||
RUN cmake . -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF -DGGML_DEBUG=1 | ||
RUN cmake --build . --config Release --parallel 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the parallel here "tensor parallelism" or "pipeline parallelism"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just running the cmake build in parallel. Just a slightly faster docker build, no runtime effect :-)
llama.cpp began as a project to support CPU-only inference on a single node, but has | ||
since expanded to support accelerators and distributed inference. | ||
|
||
l.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this added accidentally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll remove ... I'm thinking I should add support for GPUs also next, so maybe I'll do that at the same time!
thanks @justinsb! very nice example for using CPU for multi node inference! |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: justinsb, liurupeng The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
In the first commit we just bring up llama.cpp, not really using LWS.
In the second commit we really use LeaderWorkerSet, leveraging llama.cpp's RPC support