Conversation
added 2 commits
February 20, 2025 16:09
Fix async/threading issues and refactor server endpoints. (This part was mostly cherrypicked (with small adaptations) from llama-cpp-python upstream. Thanks!) Remove unused endpoints. Remove request interruption feature completely.
This commit implements a simple load balancer capable of (somewhat) efficiently routing request to the best worker nodes in the k8s cluster context. Load balancer features: - discovers worker nodes in a cluster based on app label - performs reverse proxying of requests to the worker nodes (inclusing SSE responses) - tracks "busy" status of all workers - tracks simulated kv-cache state of all workers - routes requests according to: 1) model availability 2) busy status 3) (simulated) kv-cache match - exposes `/v1/models` endpoint and lists all models Missing/broken parts: - model names are hardcoded until we have proper model registry and alias->disgest mapping (this will be implemented in the control plane) - proper logging - auth - documentation - tests - good error handling - everything else (a lot of it) Add k8s manifests to run load balancer in nekko example k8s cluster. Cody is an ugly hack mostly intended as a proof-of-concept. Good parts a to be reused in the control plane. Also, removed llama model from examples, during development the smoller the better.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implemented example load balancer for NekkoAPI
This commit implements a simple load balancer
capable of (somewhat) efficiently routing
request to the best worker nodes in the k8s
cluster context.
Load balancer features:
(inclusing SSE responses)
/v1/modelsendpoint and lists all modelsMissing/broken parts:
registry and alias->disgest mapping (this
will be implemented in the control plane)
Add k8s manifests to run load balancer in nekko
example k8s cluster.
Cody is an ugly hack mostly intended
as a proof-of-concept. Good parts a to be reused in the
control plane.
Also, removed llama model from examples,
during development the smoller the better.
Additionaly this PR fixes threading/async bug:
Fix async/threading issues and refactor server endpoints.
(This part was mostly cherrypicked (with small adaptations)
from llama-cpp-python upstream. Thanks!)
Remove unused endpoints.
Remove request interruption feature completely.