diff --git a/keps/94-engine-runtime/README.md b/keps/94-engine-runtime/README.md new file mode 100644 index 00000000..c10d9585 --- /dev/null +++ b/keps/94-engine-runtime/README.md @@ -0,0 +1,260 @@ +# KEP-92 Engine Runtime API + + + + + + +- [Motivation](#motivation) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Implementation](#implementation) + - [Test Plan](#test-plan) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) + - [End to End Tests](#end-to-end-tests) + + +## Motivation + + + +1. Running inference engines in a Kubernetes cluster usually requires network topology information of inference engine Pods. Prefill-Decode disaggregation deployment typically consists of three roles: router, prefill worker, and decode worker. The Router maintains the network topology information of the active prefill worker pods and decode worker pods (e.g., Pod IPs or Pod domain names). If any of them scales in or out, the router should update topology to accurately adapt request scheduling behavior. + +2. To maximize GPU resource utilization efficiency, inference engines support modifying their internal state without downtime. Such capabilities include dynamic updates for LoRA([SGLang](https://docs.sglang.ai/advanced_features/lora.html#Dynamic-LoRA-loading)、[vLLM](https://docs.vllm.ai/en/stable/features/lora.html#dynamically-serving-lora-adapters)), as well as memory release and restore ([SGLang](https://github.com/fzyzcjy/torch_memory_saver)) . When the inference engine runs in containerized mode, these state changes are invisible to the Kubernetes cluster. It's hard to orchestrate such changes with Kubernetes resources. + + +## Proposal + + + +### User Stories (Optional) + + + +1. As an AI engineer who deploys inference services, I expect the deployed Prefill-Decode disaggregation inference service to support autoscaling. The scaling process should not affect the availability of the inference service. Furthermore, once scaling operations are completed, the routing layer (e.g., sgl-router) should update the inference service's topology as quickly as possible to ensure that the scaling works. + +2. As an AI engineer who deploys inference services, I'd like to load/unload LoRA dynamically without restarting a inference engine. Also, operator, gateway and other Kubernetes components should change their behaviors to respond to the LoRA changes. + +3. As an AI engineer who deploys inference services, I hope to debug, profile or audit a inference engine's execution. + +## Design Details + + +### Engine Runtime API + +A new CRD called ClusterEngineRuntimeProfilewill be introduced. It describes a cluster-scoped engine runtime sidecar contaienr spec that can be reused anywhere in RBG's role specs. A short example: + +```yaml +apiVersion: workloads.x-k8s.io/v1alpha1 +kind: ClusterEngineRuntimeProfile +metadata: + name: sglang-pd-runtime +spec: + containers: + - image: + imagePullPolicy: Always + name: engine-runtime + env: + - name: TOPO_TYPE + value: "sglang" + - name: SGL_ROUTER_ROLE_NAME + value: "router" + updateStrategy: NoUpdate +``` + +If a RBG needs the engine runtime sidecar container, specify the ClusterEngineRuntimeProfile's name in RBG's role spec. Engine runtime sidecar container's spec will be merged into the container spec defined in ClusterEngineRuntimeProfile. + +```yaml +roles: +- name: prefill + ... + engineRuntimes: + - profileName: sglang-pd-runtime + containers: + - name: patio-runtime + args: + - --instance-info={"data":{"port":8000,"worker_type":"prefill", "bootstrap_port":34000}} + env: + - name: SGL_ROUTER_PORT + value: "8000" +``` + +### Architecture + +```mermaid +%%{init: {'theme': 'neutral'}}%% +graph TD; + CT["Controller"] + subgraph "RBG (myrbg)" + subgraph "Role A (myrbg-A)" + subgraph "Pod (myrbg-A-0)" + direction LR + D[Engine Runtime Sidecar] + E[App Container] + D <--> E + end + end + + subgraph "Role B (myrbg-B)" + subgraph "Pod (myrbg-B-0)" + direction LR + F[Engine Runtime Sidecar] + G[App Container] + F <--> G + end + end + end + CT --> F + CT --> D +``` + +## Implementation Details for Engine Runtime Sidecar + +### Inference Engine (Base Class) + +```python +@dataclass +class InferenceEngine(ABC): + name: str + version: str + endpoint: str + headers: Optional[dict] = None + + async def ready(self) -> bool: + return _is_url_ready(self.endpoint) if self.endpoint else True + + async def load_lora_adapter( + self, request: LoadLoraAdapterRequest + ) -> Union[ErrorResponse, str]: + return not_implemented_error( + f"Inference engine {self.name} with version {self.version} not support load lora adapter") + + async def unload_lora_adapter( + self, request: UnLoadLoraAdapterRequest + ) -> Union[ErrorResponse, str]: + return not_implemented_error( + f"Inference engine {self.name} with version {self.version} not support unload lora adapter") +``` + +The InferenceEngineclass currently supporting LoRA-related load/unload operations. For different inference engines (vLLM or SGLang), InferenceEngine will send proper LoRA loading or unloading requests to the HTTP endpoints of inference engine. +In the future, additional operations can be extended within InferenceEngine. + + +### Topology Manager + +```python +class TopoManager(ABC): + def __init__(self, engine: InferenceEngine, worker_info: dict): + self.engine = engine + wait_until_engine_ready(self.engine, timeout=180) + self.register(worker_info) + self.worker_info = worker_info + signal.signal(signal.SIGTERM, gracefully_stop_handler) + signal.signal(signal.SIGINT, gracefully_stop_handler) + + @abstractmethod + def register(self, worker_info: dict): + raise NotImplementedError + + @abstractmethod + def unregister(self): + raise NotImplementedError + + @abstractmethod + def refresh_worker_info(new_worker_info: dict): + raise NotImplementedError +``` +- `register`: register a worker to scheduler/router role. +- `unregister`: unregister a worker from scheduler/router role. +- `refresh_worker_info`: refresh worker info. Useful for updating request scheduling priority or weight. + +### Test Plan + + + +#### Unit Tests + + + + + +#### Integration tests + + + +#### End to End Tests + + +## Alternatives + + \ No newline at end of file diff --git a/keps/94-engine-runtime/kep.yaml b/keps/94-engine-runtime/kep.yaml new file mode 100644 index 00000000..ef0e8fba --- /dev/null +++ b/keps/94-engine-runtime/kep.yaml @@ -0,0 +1,51 @@ +title: KEP Template +kep-number: NNNN +authors: + - "@jane.doe" +owning-sig: sig-xyz +participating-sigs: + - sig-aaa + - sig-bbb +status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced +creation-date: yyyy-mm-dd +reviewers: + - TBD + - "@alice.doe" +approvers: + - TBD + - "@oscar.doe" + +see-also: + - "/keps/sig-aaa/1234-we-heard-you-like-keps" + - "/keps/sig-bbb/2345-everyone-gets-a-kep" +replaces: + - "/keps/sig-ccc/3456-replaced-kep" + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha|beta|stable + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.19" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.19" + beta: "v1.20" + stable: "v1.22" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: MyFeature + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - my_feature_metric \ No newline at end of file