Implement a simple example Load Balancer by vidas · Pull Request #102 · aifoundry-org/NekkoAPI

vidas · 2025-02-20T14:41:12Z

Implemented example load balancer for NekkoAPI

This commit implements a simple load balancer
capable of (somewhat) efficiently routing
request to the best worker nodes in the k8s
cluster context.

Load balancer features:

discovers worker nodes in a cluster based on app label
performs reverse proxying of requests to the worker nodes
(inclusing SSE responses)
tracks "busy" status of all workers
tracks simulated kv-cache state of all workers
routes requests according to:
1. model availability
2. busy status
3. (simulated) kv-cache match
exposes /v1/models endpoint and lists all models

Missing/broken parts:

model names are hardcoded until we have proper model
registry and alias->disgest mapping (this
will be implemented in the control plane)
proper logging
auth
documentation
tests
good error handling
everything else (a lot of it)

Add k8s manifests to run load balancer in nekko
example k8s cluster.

Cody is an ugly hack mostly intended
as a proof-of-concept. Good parts a to be reused in the
control plane.

Also, removed llama model from examples,
during development the smoller the better.

Additionaly this PR fixes threading/async bug:

Fix async/threading issues and refactor server endpoints.
(This part was mostly cherrypicked (with small adaptations)
from llama-cpp-python upstream. Thanks!)

Remove unused endpoints.

Remove request interruption feature completely.

Fix async/threading issues and refactor server endpoints. (This part was mostly cherrypicked (with small adaptations) from llama-cpp-python upstream. Thanks!) Remove unused endpoints. Remove request interruption feature completely.

This commit implements a simple load balancer capable of (somewhat) efficiently routing request to the best worker nodes in the k8s cluster context. Load balancer features: - discovers worker nodes in a cluster based on app label - performs reverse proxying of requests to the worker nodes (inclusing SSE responses) - tracks "busy" status of all workers - tracks simulated kv-cache state of all workers - routes requests according to: 1) model availability 2) busy status 3) (simulated) kv-cache match - exposes `/v1/models` endpoint and lists all models Missing/broken parts: - model names are hardcoded until we have proper model registry and alias->disgest mapping (this will be implemented in the control plane) - proper logging - auth - documentation - tests - good error handling - everything else (a lot of it) Add k8s manifests to run load balancer in nekko example k8s cluster. Cody is an ugly hack mostly intended as a proof-of-concept. Good parts a to be reused in the control plane. Also, removed llama model from examples, during development the smoller the better.

Vidas added 2 commits February 20, 2025 16:09

Bug: fix threading issues

a6444e5

Fix async/threading issues and refactor server endpoints. (This part was mostly cherrypicked (with small adaptations) from llama-cpp-python upstream. Thanks!) Remove unused endpoints. Remove request interruption feature completely.

vidas merged commit dfe18ab into main Feb 20, 2025
1 check passed

vidas deleted the lb branch February 20, 2025 14:44

vidas self-assigned this Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Implement a simple example Load Balancer#102

Implement a simple example Load Balancer#102
vidas merged 2 commits intomainfrom
lb

vidas commented Feb 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

vidas commented Feb 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant