Adapt LeaderWorkerSet to implement multi-node dirtributed inference #4001

JesseStutler · 2025-02-07T13:59:58Z

What is the problem you're trying to solve

BackGround

The development and application of large language models are experiencing explosive growth, with open-source models like DeepSeek-R1 continuously emerging, driving the demand for developers to deploy large models in local environments. However, as the scale of model parameters continues to grow, the memory capacity of a single device has become insufficient to accommodate the complete model. Some inference frameworks have begun actively exploring multi-node distributed inference solutions:

New API for multi-node distributed inference

LeaderWorkerSet

k8s sig has designed a new API for multi-node distributed inference scenario, called LeaderWorkerSet:
https://github.com/kubernetes-sigs/lws

KServe ServingRuntime/ClusterServingRuntime WorkerSpec

Even KServe has modified their serving API, add a new field called WorkerSpec to implement multi-node distributed inference

After discussing with @Monokaix @hwdef , we'd better implement LeaderWorkerSet first and get end users' feedback.

Describe the solution you'd like

LeaderWorkerSet has the concept of logical PodGroup when it is designed, corresponding to 1 Leader + n Workers. Volcano needs to keep this logical PodGroup concept consistent with Volcano's PodGroup. The replicas in LeaderWorkerSet represent the number of Volcano PodGroups to be created. One of the tasks is Leader Pod, the replica is 1, and the other task is Workers. So there are following tasks need to be adapted:

Add a LeaderWorkSet controller, reconcile to create podgroups for lws: [WIP]Support LeaderWorkerSet #4043
Implement network topology aware scheduling for worker pods

In the future, if users would like volcano to design a new native API like vcserve to serve online services like multi-node inference, we may also design a new native API, but for now it is OK to follow up on lws.

Additional context

Multi-node distributed inference research wrote by myself: https://docs.google.com/document/d/19Z0-hCdjKiL8AGA59NjZ-tj-ijDX8cFCKItZ2QqqJpY/edit?usp=sharing

The text was updated successfully, but these errors were encountered:

JesseStutler · 2025-02-07T14:01:31Z

milestone v1.12, may need to start implement soon

hwdef · 2025-02-08T02:52:44Z

This is very useful.
We need to adapt to lws, and I think volcano also needs to implement a serve API in the future, such as vcserve

I'm not sure if the following is needed, because we can implement the requirements through statefulset to avoid introducing too many third-party packages

Add a LeaderWorkSet controller, reconcile to create podgroups for lws

JesseStutler · 2025-02-08T04:36:30Z

This is very useful. We need to adapt to lws, and I think volcano also needs to implement a serve API in the future, such as vcserve

I'm not sure if the following is needed, because we can implement the requirements through statefulset to avoid introducing too many third-party packages
Add a LeaderWorkSet controller, reconcile to create podgroups for lws

I think it's okay. lws is also an api pushed by the k8s sig, things like KServe that are really third-party. If we don't add the lws controller, we'll need to add the special judgments to the PodGroup controller, which isn't very readable, and I'm not sure what kind of problems that would cause, and there's also the RestartPolicy field in lws we needed.

JesseStutler added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 7, 2025

Monokaix mentioned this issue Feb 12, 2025

How to support Volcano scheduling when using LeaderWorkerSet? #4005

Open

JesseStutler mentioned this issue Feb 25, 2025

[WIP]Support LeaderWorkerSet #4043

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt LeaderWorkerSet to implement multi-node dirtributed inference #4001

Adapt LeaderWorkerSet to implement multi-node dirtributed inference #4001

JesseStutler commented Feb 7, 2025 •

edited

Loading

JesseStutler commented Feb 7, 2025

hwdef commented Feb 8, 2025

JesseStutler commented Feb 8, 2025

Adapt LeaderWorkerSet to implement multi-node dirtributed inference #4001

Adapt LeaderWorkerSet to implement multi-node dirtributed inference #4001

Comments

JesseStutler commented Feb 7, 2025 • edited Loading

What is the problem you're trying to solve

BackGround

New API for multi-node distributed inference

LeaderWorkerSet

KServe ServingRuntime/ClusterServingRuntime WorkerSpec

Describe the solution you'd like

Additional context

JesseStutler commented Feb 7, 2025

hwdef commented Feb 8, 2025

JesseStutler commented Feb 8, 2025

JesseStutler commented Feb 7, 2025 •

edited

Loading