-
Notifications
You must be signed in to change notification settings - Fork 32
kep: add engine runtime API support #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
TrafalgarZZZ
wants to merge
2
commits into
sgl-project:main
Choose a base branch
from
TrafalgarZZZ:doc/engine_runtime_kep
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,260 @@ | ||
| # KEP-92 Engine Runtime API | ||
|
|
||
| <!-- | ||
| This is the title of your KEP. Keep it short, simple, and descriptive. A good | ||
| title can help communicate what the KEP is and should be considered as part of | ||
| any review. | ||
| --> | ||
|
|
||
| <!-- | ||
| A table of contents is helpful for quickly jumping to sections of a KEP and for | ||
| highlighting any additional information provided beyond the standard KEP | ||
| template. | ||
|
|
||
| Ensure the TOC is wrapped with | ||
| <code><!-- toc --&rt;<!-- /toc --&rt;</code> | ||
| tags, and then generate with `hack/update-toc.sh`. | ||
| --> | ||
|
|
||
| <!-- toc --> | ||
| - [Motivation](#motivation) | ||
| - [Proposal](#proposal) | ||
| - [User Stories (Optional)](#user-stories-optional) | ||
| - [Story 1](#story-1) | ||
| - [Risks and Mitigations](#risks-and-mitigations) | ||
| - [Design Details](#design-details) | ||
| - [Implementation](#implementation) | ||
| - [Test Plan](#test-plan) | ||
| - [Unit Tests](#unit-tests) | ||
| - [Integration tests](#integration-tests) | ||
| - [End to End Tests](#end-to-end-tests) | ||
| <!-- /toc --> | ||
|
|
||
| ## Motivation | ||
|
|
||
| <!-- | ||
| This section is for explicitly listing the motivation, goals, and non-goals of | ||
| this KEP. Describe why the change is important and the benefits to users. The | ||
| motivation section can optionally provide links to [experience reports] to | ||
| demonstrate the interest in a KEP within the wider Kubernetes community. | ||
|
|
||
| [experience reports]: https://github.com/golang/go/wiki/ExperienceReports | ||
| --> | ||
|
|
||
| 1. Running inference engines in a Kubernetes cluster usually requires network topology information of inference engine Pods. Prefill-Decode disaggregation deployment typically consists of three roles: router, prefill worker, and decode worker. The Router maintains the network topology information of the active prefill worker pods and decode worker pods (e.g., Pod IPs or Pod domain names). If any of them scales in or out, the router should update topology to accurately adapt request scheduling behavior. | ||
|
|
||
| 2. To maximize GPU resource utilization efficiency, inference engines support modifying their internal state without downtime. Such capabilities include dynamic updates for LoRA([SGLang](https://docs.sglang.ai/advanced_features/lora.html#Dynamic-LoRA-loading)、[vLLM](https://docs.vllm.ai/en/stable/features/lora.html#dynamically-serving-lora-adapters)), as well as memory release and restore ([SGLang](https://github.com/fzyzcjy/torch_memory_saver)) . When the inference engine runs in containerized mode, these state changes are invisible to the Kubernetes cluster. It's hard to orchestrate such changes with Kubernetes resources. | ||
|
|
||
|
|
||
| ## Proposal | ||
|
|
||
| <!-- | ||
| This is where we get down to the specifics of what the proposal actually is. | ||
| This should have enough detail that reviewers can understand exactly what | ||
| you're proposing, but should not include things like API designs or | ||
| implementation. What is the desired outcome and how do we measure success?. | ||
| The "Design Details" section below is for the real | ||
| nitty-gritty. | ||
| --> | ||
|
|
||
| ### User Stories (Optional) | ||
|
|
||
| <!-- | ||
| Detail the things that people will be able to do if this KEP is implemented. | ||
| Include as much detail as possible so that people can understand the "how" of | ||
| the system. The goal here is to make this feel real for users without getting | ||
| bogged down. | ||
| --> | ||
|
|
||
| 1. As an AI engineer who deploys inference services, I expect the deployed Prefill-Decode disaggregation inference service to support autoscaling. The scaling process should not affect the availability of the inference service. Furthermore, once scaling operations are completed, the routing layer (e.g., sgl-router) should update the inference service's topology as quickly as possible to ensure that the scaling works. | ||
|
|
||
| 2. As an AI engineer who deploys inference services, I'd like to load/unload LoRA dynamically without restarting a inference engine. Also, operator, gateway and other Kubernetes components should change their behaviors to respond to the LoRA changes. | ||
|
|
||
| 3. As an AI engineer who deploys inference services, I hope to debug, profile or audit a inference engine's execution. | ||
|
|
||
| ## Design Details | ||
|
|
||
| <!-- | ||
| This section should contain enough information that the specifics of your | ||
| change are understandable. This may include API specs (though not always | ||
| required) or even code snippets. If there's any ambiguity about HOW your | ||
| proposal will be implemented, this is the place to discuss them. | ||
| --> | ||
| ### Engine Runtime API | ||
|
|
||
| A new CRD called ClusterEngineRuntimeProfilewill be introduced. It describes a cluster-scoped engine runtime sidecar contaienr spec that can be reused anywhere in RBG's role specs. A short example: | ||
|
|
||
| ```yaml | ||
| apiVersion: workloads.x-k8s.io/v1alpha1 | ||
| kind: ClusterEngineRuntimeProfile | ||
| metadata: | ||
| name: sglang-pd-runtime | ||
| spec: | ||
| containers: | ||
| - image: <IMG> | ||
| imagePullPolicy: Always | ||
| name: engine-runtime | ||
| env: | ||
| - name: TOPO_TYPE | ||
| value: "sglang" | ||
| - name: SGL_ROUTER_ROLE_NAME | ||
| value: "router" | ||
| updateStrategy: NoUpdate | ||
| ``` | ||
|
|
||
| If a RBG needs the engine runtime sidecar container, specify the ClusterEngineRuntimeProfile's name in RBG's role spec. Engine runtime sidecar container's spec will be merged into the container spec defined in ClusterEngineRuntimeProfile. | ||
|
|
||
| ```yaml | ||
| roles: | ||
| - name: prefill | ||
| ... | ||
| engineRuntimes: | ||
| - profileName: sglang-pd-runtime | ||
| containers: | ||
| - name: patio-runtime | ||
| args: | ||
| - --instance-info={"data":{"port":8000,"worker_type":"prefill", "bootstrap_port":34000}} | ||
| env: | ||
| - name: SGL_ROUTER_PORT | ||
| value: "8000" | ||
| ``` | ||
|
|
||
| ### Architecture | ||
|
|
||
| ```mermaid | ||
| %%{init: {'theme': 'neutral'}}%% | ||
| graph TD; | ||
| CT["Controller"] | ||
| subgraph "RBG (myrbg)" | ||
| subgraph "Role A (myrbg-A)" | ||
| subgraph "Pod (myrbg-A-0)" | ||
| direction LR | ||
| D[Engine Runtime Sidecar] | ||
| E[App Container] | ||
| D <--> E | ||
| end | ||
| end | ||
|
|
||
| subgraph "Role B (myrbg-B)" | ||
| subgraph "Pod (myrbg-B-0)" | ||
| direction LR | ||
| F[Engine Runtime Sidecar] | ||
| G[App Container] | ||
| F <--> G | ||
| end | ||
| end | ||
| end | ||
| CT --> F | ||
| CT --> D | ||
| ``` | ||
|
|
||
| ## Implementation Details for Engine Runtime Sidecar | ||
|
|
||
| ### Inference Engine (Base Class) | ||
|
|
||
| ```python | ||
| @dataclass | ||
| class InferenceEngine(ABC): | ||
| name: str | ||
| version: str | ||
| endpoint: str | ||
| headers: Optional[dict] = None | ||
|
|
||
| async def ready(self) -> bool: | ||
| return _is_url_ready(self.endpoint) if self.endpoint else True | ||
|
|
||
| async def load_lora_adapter( | ||
| self, request: LoadLoraAdapterRequest | ||
| ) -> Union[ErrorResponse, str]: | ||
| return not_implemented_error( | ||
| f"Inference engine {self.name} with version {self.version} not support load lora adapter") | ||
|
|
||
| async def unload_lora_adapter( | ||
| self, request: UnLoadLoraAdapterRequest | ||
| ) -> Union[ErrorResponse, str]: | ||
| return not_implemented_error( | ||
| f"Inference engine {self.name} with version {self.version} not support unload lora adapter") | ||
| ``` | ||
|
|
||
| The InferenceEngineclass currently supporting LoRA-related load/unload operations. For different inference engines (vLLM or SGLang), InferenceEngine will send proper LoRA loading or unloading requests to the HTTP endpoints of inference engine. | ||
| In the future, additional operations can be extended within InferenceEngine. | ||
|
|
||
|
|
||
| ### Topology Manager | ||
|
|
||
| ```python | ||
| class TopoManager(ABC): | ||
| def __init__(self, engine: InferenceEngine, worker_info: dict): | ||
| self.engine = engine | ||
| wait_until_engine_ready(self.engine, timeout=180) | ||
| self.register(worker_info) | ||
| self.worker_info = worker_info | ||
| signal.signal(signal.SIGTERM, gracefully_stop_handler) | ||
| signal.signal(signal.SIGINT, gracefully_stop_handler) | ||
|
|
||
| @abstractmethod | ||
| def register(self, worker_info: dict): | ||
| raise NotImplementedError | ||
|
|
||
| @abstractmethod | ||
| def unregister(self): | ||
| raise NotImplementedError | ||
|
|
||
| @abstractmethod | ||
| def refresh_worker_info(new_worker_info: dict): | ||
| raise NotImplementedError | ||
| ``` | ||
| - `register`: register a worker to scheduler/router role. | ||
| - `unregister`: unregister a worker from scheduler/router role. | ||
| - `refresh_worker_info`: refresh worker info. Useful for updating request scheduling priority or weight. | ||
|
|
||
| ### Test Plan | ||
|
|
||
| <!-- | ||
| **Note:** *Not required until targeted at a release.* | ||
| The goal is to ensure that we don't accept enhancements with inadequate testing. | ||
|
|
||
| All code is expected to have adequate tests (eventually with coverage | ||
| expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines] | ||
| when drafting this test plan. | ||
|
|
||
| [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md | ||
| --> | ||
|
|
||
| #### Unit Tests | ||
|
|
||
| <!-- | ||
| In principle every added code should have complete unit test coverage, so providing | ||
| the exact set of tests will not bring additional value. | ||
| However, if complete unit test coverage is not possible, explain the reason of it | ||
| together with explanation why this is acceptable. | ||
| --> | ||
|
|
||
| <!-- | ||
| Additionally, try to enumerate the core package you will be touching | ||
| to implement this enhancement and provide the current unit coverage for those | ||
| in the form of: | ||
| - <package>: <date> - <current test coverage> | ||
|
|
||
| This can inform certain test coverage improvements that we want to do before | ||
| extending the production code to implement this enhancement. | ||
| --> | ||
|
|
||
| #### Integration tests | ||
|
|
||
| <!-- | ||
| Describe what tests will be added to ensure proper quality of the enhancement. | ||
|
|
||
| After the implementation PR is merged, add the names of the tests here. | ||
| --> | ||
|
|
||
| #### End to End Tests | ||
|
|
||
|
|
||
| ## Alternatives | ||
|
|
||
| <!-- | ||
| What other approaches did you consider, and why did you rule them out? These do | ||
| not need to be as detailed as the proposal, but should include enough | ||
| information to express the idea and why it was not acceptable. | ||
| --> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| title: KEP Template | ||
| kep-number: NNNN | ||
| authors: | ||
| - "@jane.doe" | ||
| owning-sig: sig-xyz | ||
| participating-sigs: | ||
| - sig-aaa | ||
| - sig-bbb | ||
| status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced | ||
| creation-date: yyyy-mm-dd | ||
| reviewers: | ||
| - TBD | ||
| - "@alice.doe" | ||
| approvers: | ||
| - TBD | ||
| - "@oscar.doe" | ||
|
|
||
| see-also: | ||
| - "/keps/sig-aaa/1234-we-heard-you-like-keps" | ||
| - "/keps/sig-bbb/2345-everyone-gets-a-kep" | ||
| replaces: | ||
| - "/keps/sig-ccc/3456-replaced-kep" | ||
|
|
||
| # The target maturity stage in the current dev cycle for this KEP. | ||
| # If the purpose of this KEP is to deprecate a user-visible feature | ||
| # and a Deprecated feature gates are added, they should be deprecated|disabled|removed. | ||
| stage: alpha|beta|stable | ||
|
|
||
| # The most recent milestone for which work toward delivery of this KEP has been | ||
| # done. This can be the current (upcoming) milestone, if it is being actively | ||
| # worked on. | ||
| latest-milestone: "v1.19" | ||
|
|
||
| # The milestone at which this feature was, or is targeted to be, at each stage. | ||
| milestone: | ||
| alpha: "v1.19" | ||
| beta: "v1.20" | ||
| stable: "v1.22" | ||
|
|
||
| # The following PRR answers are required at alpha release | ||
| # List the feature gate name and the components for which it must be enabled | ||
| feature-gates: | ||
| - name: MyFeature | ||
| components: | ||
| - kube-apiserver | ||
| - kube-controller-manager | ||
| disable-supported: true | ||
|
|
||
| # The following PRR answers are required at beta release | ||
| metrics: | ||
| - my_feature_metric | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this file to keep in track.