Skip to content

Commit ab2942e

Browse files
committed
formatting fixes
1 parent e6906ab commit ab2942e

File tree

1 file changed

+10
-18
lines changed
  • wg-serving/proposals/llm_instance_gateway

1 file changed

+10
-18
lines changed

wg-serving/proposals/llm_instance_gateway/README.md

Lines changed: 10 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,8 @@ use cases upon shared hardware has distinct advantages in enabling efficient and
2626

2727
## Motivation
2828

29-
Novel advancements in fine-tuning like [LoRA](https://arxiv.org/abs/2106.09685) and [Multi-LoRA](https://arxiv.org/abs/2310.18547) have enabled
30-
multiple distinct use cases to share accelerators. As this new tech is adopted,
31-
the Day1/2 operational
32-
concerns quickly become necessary.
29+
Novel advancements in fine-tuning like [LoRA](https://arxiv.org/abs/2106.09685) and [Multi-LoRA](https://arxiv.org/abs/2310.18547) have enabled multiple distinct use cases to share accelerators. As this new tech is adopted, the Day1/2 operational concerns quickly become necessary.
30+
3331
Kubernetes as long been a standard in easing and automating operational tasks of
3432
workloads. A mechanism (gateway) within the K8s ecosystem is a
3533
reasonable, and expected way for a user to support multiple LLM use cases on shared
@@ -44,11 +42,9 @@ accelerators.
4442

4543
#### Gateway Goals
4644

47-
- Fast reconfiguration - New use cases (including LoRA adapters
48-
or client configuration) can be rolled out / back in seconds to clients without
49-
50-
waiting for a new
51-
model server to start.
45+
- Fast reconfiguration - New use cases (including LoRA adapters or client
46+
configuration) can be rolled out / back in seconds to clients without waiting for
47+
a new model server to start.
5248
- Efficient accelerator sharing - Use cases can use less than an accelerator
5349
or temporarily burst without needing to start a new model server leading to
5450
fewer wasted accelerators and better pooling of shared capacity.
@@ -57,12 +53,10 @@ or client configuration) can be rolled out / back in seconds to clients witho
5753
- Standardized LoRA - Simple recommended patterns for deploying and loading
5854
LoRA adapters on a wide range of Kubernetes environments into model servers.
5955
- Composability - Approach should be composable with:
60-
- K8s Gateway
61-
API
62-
- Other gateway features and projects, including high level LLM gateways
63-
- existing deployment tools like kserve or kaito
64-
65-
- different model servers
56+
- K8s Gateway API
57+
- Other gateway features and projects, including high level LLM gateways
58+
- existing deployment tools like kserve or kaito
59+
- different model servers
6660

6761
### Non-Goals
6862

@@ -161,9 +155,7 @@ To briefly describe how the components work together:
161155

162156
- When an `LLMRoute` is defined, our gateway recognizes this new service, and
163157
allows traffic for the specified adapter to be admitted to the backend pool.
164-
165-
- We support and expect Open AI API spec as the default when
166-
reading the
158+
- We support and expect Open AI API spec as the default when reading the
167159
adapter.
168160

169161
- Incoming traffic for a validated service is then routed to ExtProc, where

0 commit comments

Comments
 (0)