Skip to content

Comments

Implement a simple example Load Balancer#102

Merged
vidas merged 2 commits intomainfrom
lb
Feb 20, 2025
Merged

Implement a simple example Load Balancer#102
vidas merged 2 commits intomainfrom
lb

Conversation

@vidas
Copy link
Member

@vidas vidas commented Feb 20, 2025

Implemented example load balancer for NekkoAPI

This commit implements a simple load balancer
capable of (somewhat) efficiently routing
request to the best worker nodes in the k8s
cluster context.

Load balancer features:

  • discovers worker nodes in a cluster based on app label
  • performs reverse proxying of requests to the worker nodes
    (inclusing SSE responses)
  • tracks "busy" status of all workers
  • tracks simulated kv-cache state of all workers
  • routes requests according to:
    1. model availability
    2. busy status
    3. (simulated) kv-cache match
  • exposes /v1/models endpoint and lists all models

Missing/broken parts:

  • model names are hardcoded until we have proper model
    registry and alias->disgest mapping (this
    will be implemented in the control plane)
  • proper logging
  • auth
  • documentation
  • tests
  • good error handling
  • everything else (a lot of it)

Add k8s manifests to run load balancer in nekko
example k8s cluster.

Cody is an ugly hack mostly intended
as a proof-of-concept. Good parts a to be reused in the
control plane.

Also, removed llama model from examples,
during development the smoller the better.

Additionaly this PR fixes threading/async bug:

Fix async/threading issues and refactor server endpoints.
(This part was mostly cherrypicked (with small adaptations)
from llama-cpp-python upstream. Thanks!)

Remove unused endpoints.

Remove request interruption feature completely.

Vidas added 2 commits February 20, 2025 16:09
Fix async/threading issues and refactor server endpoints.
(This part was mostly cherrypicked (with small adaptations)
from llama-cpp-python upstream. Thanks!)

Remove unused endpoints.

Remove request interruption feature completely.
This commit implements a simple load balancer
capable of (somewhat) efficiently routing
request to the best worker nodes in the k8s
cluster context.

Load balancer features:
- discovers worker nodes in a cluster based on app label
- performs reverse proxying of requests to the worker nodes
  (inclusing SSE responses)
- tracks "busy" status of all workers
- tracks simulated kv-cache state of all workers
- routes requests according to:
  1) model availability
  2) busy status
  3) (simulated) kv-cache match
- exposes `/v1/models` endpoint and lists all models

Missing/broken parts:
- model names are hardcoded until we have proper model
  registry and alias->disgest mapping (this
  will be implemented in the control plane)
- proper logging
- auth
- documentation
- tests
- good error handling
- everything else (a lot of it)

Add k8s manifests to run load balancer in nekko
example k8s cluster.

Cody is an ugly hack mostly intended
as a proof-of-concept. Good parts a to be reused in the
control plane.

Also, removed llama model from examples,
during development the smoller the better.
@vidas vidas merged commit dfe18ab into main Feb 20, 2025
1 check passed
@vidas vidas deleted the lb branch February 20, 2025 14:44
@vidas vidas self-assigned this Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant