You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: wg-serving/proposals/llm_instance_gateway/README.md
+10-18Lines changed: 10 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,10 +26,8 @@ use cases upon shared hardware has distinct advantages in enabling efficient and
26
26
27
27
## Motivation
28
28
29
-
Novel advancements in fine-tuning like [LoRA](https://arxiv.org/abs/2106.09685) and [Multi-LoRA](https://arxiv.org/abs/2310.18547) have enabled
30
-
multiple distinct use cases to share accelerators. As this new tech is adopted,
31
-
the Day1/2 operational
32
-
concerns quickly become necessary.
29
+
Novel advancements in fine-tuning like [LoRA](https://arxiv.org/abs/2106.09685) and [Multi-LoRA](https://arxiv.org/abs/2310.18547) have enabled multiple distinct use cases to share accelerators. As this new tech is adopted, the Day1/2 operational concerns quickly become necessary.
30
+
33
31
Kubernetes as long been a standard in easing and automating operational tasks of
34
32
workloads. A mechanism (gateway) within the K8s ecosystem is a
35
33
reasonable, and expected way for a user to support multiple LLM use cases on shared
@@ -44,11 +42,9 @@ accelerators.
44
42
45
43
#### Gateway Goals
46
44
47
-
- Fast reconfiguration - New use cases (including LoRA adapters
48
-
or client configuration) can be rolled out / back in seconds to clients without
49
-
50
-
waiting for a new
51
-
model server to start.
45
+
- Fast reconfiguration - New use cases (including LoRA adapters or client
46
+
configuration) can be rolled out / back in seconds to clients without waiting for
47
+
a new model server to start.
52
48
- Efficient accelerator sharing - Use cases can use less than an accelerator
53
49
or temporarily burst without needing to start a new model server leading to
54
50
fewer wasted accelerators and better pooling of shared capacity.
@@ -57,12 +53,10 @@ or client configuration) can be rolled out / back in seconds to clients witho
57
53
- Standardized LoRA - Simple recommended patterns for deploying and loading
58
54
LoRA adapters on a wide range of Kubernetes environments into model servers.
59
55
- Composability - Approach should be composable with:
60
-
- K8s Gateway
61
-
API
62
-
- Other gateway features and projects, including high level LLM gateways
63
-
- existing deployment tools like kserve or kaito
64
-
65
-
- different model servers
56
+
- K8s Gateway API
57
+
- Other gateway features and projects, including high level LLM gateways
58
+
- existing deployment tools like kserve or kaito
59
+
- different model servers
66
60
67
61
### Non-Goals
68
62
@@ -161,9 +155,7 @@ To briefly describe how the components work together:
161
155
162
156
- When an `LLMRoute` is defined, our gateway recognizes this new service, and
163
157
allows traffic for the specified adapter to be admitted to the backend pool.
164
-
165
-
- We support and expect Open AI API spec as the default when
166
-
reading the
158
+
- We support and expect Open AI API spec as the default when reading the
167
159
adapter.
168
160
169
161
- Incoming traffic for a validated service is then routed to ExtProc, where
0 commit comments