Skip to content

Commit 0abba64

Browse files
authored
Merge pull request #12 from kfswain/llm-instance-gateway-proposal
Proposing LLM Instance Gateway
2 parents 6656bbb + ac19fec commit 0abba64

File tree

3 files changed

+195
-0
lines changed

3 files changed

+195
-0
lines changed
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
2+
# LLM Instance Gateway
3+
<!-- toc -->
4+
5+
- [Summary](#summary)
6+
- [Ownership](#ownership)
7+
- [Motivation](#motivation)
8+
- [Goals](#goals)
9+
- [Non-Goals](#non-goals)
10+
- [Proposal](#proposal)
11+
- [Gateway](#gateway)
12+
- [CRDs](#crds)
13+
- [Envoy
14+
Solution](#envoy-solution)
15+
- [Model Server Protocol](#model-server-protocol)
16+
- [PoC Design Details](#poc-design-details)
17+
- [Overview](#overview)
18+
- [Request Flow](#request-flow)
19+
- [Pod selection algorithm in PoC](#pod-selection-algorithm-in-poc)
20+
- [Artifacts](#artifacts) <!-- /toc -->
21+
22+
## Summary
23+
24+
As presented in the [demo](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458) and building further upon the [joint proposal](https://docs.google.com/document/d/1sFNHQqUWm1DIzC9GxXp3cKRm8cUtTcGuwZYkjkOkUqk/edit?tab=t.0#heading=h.9brozdsx9dqo), we are proposing that a gateway, focused on
25+
multiplexing
26+
use cases upon shared hardware has distinct advantages in enabling efficient and fair use of multiple use-cases over a shared pool of compute.
27+
28+
### Ownership
29+
As the joint proposal indictates, this effort is primarily being led by Google and Bytedance.
30+
31+
With *project owners* from each org:
32+
- [kfswain](https://github.com/kfswain)
33+
- [varungup90](https://github.com/varungup90)
34+
35+
And *stakeholders* from each org:
36+
- [smarterclayton](https://github.com/smarterclayton)
37+
- [Jeffwan](https://github.com/Jeffwan)
38+
39+
## Motivation
40+
41+
Novel advancements in fine-tuning like [LoRA](https://arxiv.org/abs/2106.09685) and [Multi-LoRA](https://arxiv.org/abs/2310.18547) have enabled multiple distinct use cases to share accelerators. As this new tech is adopted, the Day1/2 operational concerns quickly become necessary.
42+
43+
Kubernetes as long been a standard in easing and automating operational tasks of
44+
workloads. A mechanism (gateway) within the K8s ecosystem is a
45+
reasonable, and expected way for a user to support multiple LLM use cases on shared
46+
accelerators.
47+
48+
### Goals
49+
50+
#### Proposal Goals
51+
52+
- Create an Inference Gateway project group for wg-serving collaboration,
53+
including: chat channel & dedicated repo (sponsored by sig-network)
54+
55+
#### Gateway Goals
56+
57+
- Fast reconfiguration - New use cases (including LoRA adapters or client
58+
configuration) can be rolled out / back in seconds to clients without waiting for
59+
a new model server to start.
60+
- Efficient accelerator sharing - Use cases can use less than an accelerator
61+
or temporarily burst without needing to start a new model server leading to
62+
fewer wasted accelerators and better pooling of shared capacity.
63+
- Operational resilience - Use cases share available accelerators fairly and
64+
can have distinct priorities, latency objectives, and failure policies.
65+
- Standardized LoRA - Simple recommended patterns for deploying and loading
66+
LoRA adapters on a wide range of Kubernetes environments into model servers.
67+
- Composability - Approach should be composable with:
68+
- K8s Gateway API
69+
- Other gateway features and projects, including high level LLM gateways
70+
- existing deployment tools like kserve or kaito
71+
- different model servers
72+
73+
### Non-Goals
74+
75+
- Replacing the features of pre-existing Gateways
76+
- Defining how serving workloads must be deployed
77+
78+
## Proposal
79+
80+
We will create an LLM serving API consistent with the needs of LLM multiplexed use cases, composable with existing Gateway APIs, with an OSS implementation on top of Envoy. The project will start by targeting the LLM use case as it has the highest potential for optimization and widest potential use and deliver a usable and minimal component as quickly as possible. After that, we will consider expanded scope including deeper optimizations, additional features, and other generative AI use case needs (such as diffusion models or long generative queueing). We recognize the space is rapidly changing and therefore will remain open to lessons learned - requiring us to bring workloads to production quickly and experiment safely without breaking our early adopters. The gateway we deliver should be suitable for extension and reuse when it comes to algorithmic improvements and composable with other gateway tools.
81+
82+
### Gateway
83+
84+
#### CRD(s)
85+
86+
To adequately achieve the above goals, we propose the addition of 1 or more CRDs
87+
to express:
88+
89+
- The boundaries of a compute pool that shares a base model
90+
- Including the deployment of a routing solution (PoC details below)
91+
- A specific use case upon one or more backend pools
92+
- The objectives that this use case needs to achieve
93+
94+
API specifics will be handled in proposals in the project repo.
95+
96+
97+
#### Envoy Solution
98+
99+
Any gateway solution *must* be compatible with Envoy Proxy, and have a plan with
100+
how to integrate these features into the Envoy ecosystem over the long term.
101+
102+
Envoy was chosen as the default due to its flexibility, and broad adoption within the ecosystem.
103+
104+
#### Model Server Protocol
105+
106+
In the PoC investigation we discovered the need for certain control and data to
107+
be exposed by the model server. In order for a model server to work properly
108+
with this LLM Instance Gateway, the model server would need to implement this
109+
protocol.
110+
111+
Key requirements would roughly look like:
112+
- A method, or set of methods to dynamically update the available LoRA catalog on a model server
113+
- Metrics, shared as a header on response data, or some other similar mechanism, for data like:
114+
- Networking-friendly metric share (shared as a header, or other
115+
lightweight mechanism, just not in the body)
116+
- Adapter State
117+
- Available catalog
118+
- Queue data (per adapter)
119+
120+
121+
## PoC Design
122+
123+
From the proof of concept we believe the following architecture is a starting point for this proposal:
124+
125+
- Envoy Proxy
126+
- An OSS starting point that is generally accepted and used
127+
- Ext proc
128+
- A necessary tool to extend the capabilities of Envoy to allow for routing based on the Open AI model field (within the body)
129+
- An agile tool for development of novel LLM Instance Gateway features
130+
- CRD/K8s API interface
131+
- Model server modifications
132+
- Necessary to extend existing tooling to provide the proper routing data to Envoy
133+
- Potentially extend further to support [ORCA](https://github.com/envoyproxy/envoy/issues/6614) headers as a method of metrics transfer
134+
135+
### Overview
136+
137+
Our very high level diagram of how this looked:
138+
![high level design](./images/high_level_design.png)
139+
140+
To briefly describe how the components work together:
141+
142+
- When an `LLMRoute` is defined, our gateway recognizes this new service, and
143+
allows traffic for the specified adapter to be admitted to the backend pool.
144+
- We support and expect Open AI API spec as the default when reading the
145+
adapter.
146+
147+
- Incoming traffic for a validated service is then routed to ExtProc, where
148+
routing and fairness decisions are made.
149+
150+
- We attempt to route to a model server that has the adapter already loaded,
151+
so long as there is batch capacity
152+
153+
154+
### Request Flow
155+
156+
Below is an example of a
157+
life of a request using this described design:
158+
![request flow](./images/flow_diagram.png)
159+
160+
> Notes:
161+
>
162+
> 1. Ext Proc: External processing calls an external gRPC service to
163+
> process HTTP requests and responses
164+
>
165+
> 2. Original Dst: Original destination
166+
> cluster can be used when incoming connections are redirected to Envoy either
167+
> via an iptables REDIRECT or TPROXY target or with Proxy Protocol. In these
168+
> cases requests routed to an original destination cluster are forwarded to
169+
> upstream hosts as addressed by the redirection metadata, without any explicit
170+
> host configuration or upstream host discovery. We implemented this using the
171+
> bootstrap feature of Envoy Gateway
172+
173+
### Pod selection algorithm in PoC
174+
175+
Metrics stored in Ext Proc Cache:
176+
- Active adapters in Each pod
177+
- Number of pending requests in each adapter in each pod.
178+
179+
Given a request, read the relevant metrics from the cache and find which pods have that lora adapter loaded.
180+
Out of the set of pods that have the lora adapter loaded and the number of pending requests in that adapter is below a threshold, pick the one with the
181+
most amount of pending requests (we pick the most to prevent flopping).
182+
- If no pods satisfy 1 or 2 then pick a pod with: (in following priority):
183+
1. Least number of active adapters.
184+
1. Least total pending requests
185+
186+
### Artifacts:
187+
188+
- [Ext-proc/Envoy/Benchmarking repo](https://github.com/tomatillo-and-multiverse/lora-inference-gateway)
189+
- Repo we used to develop the ext proc image used in the PoC
190+
- Also contains the manifests required to deploy gateway
191+
- [vLLM fork](https://github.com/kaushikmitr/vllm)
192+
- Presentation:
193+
- [Slides](https://docs.google.com/presentation/d/1I1XDf6fQQEtHxJtZxFdIaUcUA3lLBC7neW823diWS78/edit?usp=sharing)
194+
- [Recording](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458)
195+
- [PoC Design & Experimentation data](https://docs.google.com/document/d/17wB0BgeV8JrGtccxZqkOqFyNC4gPBNqdKg8Oe9xMkio/edit#heading=h.eeeqp85g68qy)
146 KB
Loading
167 KB
Loading

0 commit comments

Comments
 (0)