lws generated replicas is incorrect #390

lklkxcxc · 2025-02-17T02:32:58Z

I use kubeflow generate distributed-serving job,one master and one work but create 12 pods:

arena serve distributed \
 --name=vllm \
 --version=alpha \
 --restful-port=5000 \
 --image=vllm/vllm-openai:latest \
 --data=pvc-model:/mnt/models \
 --masters=1 \
 --replicas=1 \
 --max-surge=1 \
 --master-gpus=1 \
 --master-command="ray start --head --port=6379; vllm serve /mnt/models/Qwen2-1.5B --port 5000 --dtype half --pipeline-parallel-size 2" \
 --workers=1 \
 --worker-gpus=1 \
 --worker-command="ray start --address=\$(MASTER_ADDR):6379 --block" \
 --share-memory=4Gi \
 --startup-probe-action=httpGet \
 --startup-probe-action-option="path: /health" \
 --startup-probe-action-option="port: 5000" \
 --startup-probe-option="periodSeconds: 60" \
 --startup-probe-option="failureThreshold: 5"

kubectl get pod output:

vllm-alpha-distributed-serving-0                             0/1     Running             0          10s
vllm-alpha-distributed-serving-0-0                           1/1     Running             0          10s
vllm-alpha-distributed-serving-0-0-0                         0/1     ContainerCreating   0          10s
vllm-alpha-distributed-serving-0-0-0-0                       0/1     ContainerCreating   0          10s
vllm-alpha-distributed-serving-0-0-0-0-0                     0/1     ContainerCreating   0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0                   0/1     ContainerCreating   0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0                 0/1     ContainerCreating   0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0-0               0/1     Pending             0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0-0-0             0/1     Pending             0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0-0-0-0           0/1     Pending             0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0-0-0-0-0         0/1     Pending             0          9s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0-0-0-0-0-0       0/1     Pending             0          9s

kubectl get LeaderWorkerSet vllm-alpha-distributed-serving -o yaml output:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  annotations:
    helm.sh/created: "1739759061"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"leaderworkerset.x-k8s.io/v1","kind":"LeaderWorkerSet","metadata":{"annotations":{"helm.sh/created":"1739759061"},"labels":{"app":"distributed-serving","arena.kubeflow.org/uid":"3399d840e8b371ed7ca45dda29debeb1","chart":"distributed-serving-0.1.0","heritage":"Helm","release":"vllm-alpha","serviceName":"vllm","servingName":"vllm","servingType":"distributed-serving","servingVersion":"alpha"},"name":"vllm-alpha-distributed-serving","namespace":"default"},"spec":{"leaderWorkerTemplate":{"leaderTemplate":{"metadata":{"annotations":{"arena.kubeflow.org/username":"kubecfg:certauth:admin"},"labels":{"app":"distributed-serving","arena.kubeflow.org/uid":"3399d840e8b371ed7ca45dda29debeb1","chart":"distributed-serving-0.1.0","heritage":"Helm","release":"vllm-alpha","role":"master","serviceName":"vllm","servingName":"vllm","servingType":"distributed-serving","servingVersion":"alpha"}},"spec":{"containers":[{"command":["sh","-c","ray start --head --port=6379; vllm serve /mnt/models/Qwen2-1.5B --port 5000 --dtype half --pipeline-parallel-size 2"],"env":[{"name":"MASTER_ADDR","value":"$(LWS_LEADER_ADDRESS)"},{"name":"WORLD_SIZE","valueFrom":{"fieldRef":{"fieldPath":"metadata.annotations['leaderworkerset.sigs.k8s.io/size']"}}},{"name":"POD_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.name"}}},{"name":"POD_INDEX","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']"}}},{"name":"GROUP_INDEX","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels['leaderworkerset.sigs.k8s.io/group-index']"}}},{"name":"GPU_COUNT","value":"1"},{"name":"HOSTFILE","value":"/etc/hostfile"},{"name":"ROLE","value":"master"}],"image":"harbor.hzxingzai.cn/tools/vllm/vllm-openai:latest","imagePullPolicy":"IfNotPresent","name":"distributed-serving-master","ports":[{"containerPort":5000,"name":"restful","protocol":"TCP"}],"resources":{"limits":{"nvidia.com/gpu":1}},"startupProbe":{"failureThreshold":5,"httpGet":{"path":"/health","port":5000},"periodSeconds":60},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"},{"mountPath":"/mnt/models","name":"pvc-model"},{"mountPath":"/etc/hostfile","name":"vllm-alpha-cm","subPathExpr":"hostfile-$(GROUP_INDEX)"}]}],"volumes":[{"configMap":{"items":[{"key":"hostfile-0","mode":438,"path":"hostfile-0"}],"name":"vllm-alpha-cm"},"name":"vllm-alpha-cm"},{"emptyDir":{"medium":"Memory","sizeLimit":"4Gi"},"name":"dshm"},{"name":"pvc-model","persistentVolumeClaim":{"claimName":"pvc-model"}}]}},"restartPolicy":"RecreateGroupOnPodRestart","size":2,"workerTemplate":{"metadata":{"annotations":{"arena.kubeflow.org/username":"kubecfg:certauth:admin"},"labels":{"app":"distributed-serving","chart":"distributed-serving-0.1.0","heritage":"Helm","release":"vllm-alpha","role":"worker","serviceName":"vllm","servingName":"vllm","servingType":"distributed-serving","servingVersion":"alpha"}},"spec":{"containers":[{"command":["sh","-c","ray start --address=$(MASTER_ADDR):6379 --block"],"env":[{"name":"MASTER_ADDR","value":"$(LWS_LEADER_ADDRESS)"},{"name":"WORLD_SIZE","valueFrom":{"fieldRef":{"fieldPath":"metadata.annotations['leaderworkerset.sigs.k8s.io/size']"}}},{"name":"POD_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.name"}}},{"name":"POD_INDEX","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']"}}},{"name":"GROUP_INDEX","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels['leaderworkerset.sigs.k8s.io/group-index']"}}},{"name":"GPU_COUNT","value":"1"},{"name":"HOSTFILE","value":"/etc/hostfile"},{"name":"ROLE","value":"worker"}],"image":"harbor.hzxingzai.cn/tools/vllm/vllm-openai:latest","imagePullPolicy":"IfNotPresent","name":"distributed-serving-worker","resources":{"limits":{"nvidia.com/gpu":1}},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"},{"mountPath":"/mnt/models","name":"pvc-model"},{"mountPath":"/etc/hostfile","name":"vllm-alpha-cm","subPathExpr":"hostfile-$(GROUP_INDEX)"}]}],"volumes":[{"configMap":{"items":[{"key":"hostfile-0","mode":438,"path":"hostfile-0"}],"name":"vllm-alpha-cm"},"name":"vllm-alpha-cm"},{"emptyDir":{"medium":"Memory","sizeLimit":"4Gi"},"name":"dshm"},{"name":"pvc-model","persistentVolumeClaim":{"claimName":"pvc-model"}}]}}},"replicas":1,"rolloutStrategy":{"rollingUpdateConfiguration":{"maxSurge":1}}}}
  creationTimestamp: "2025-02-17T02:24:23Z"
  generation: 1
  labels:
    app: distributed-serving
    arena.kubeflow.org/uid: 3399d840e8b371ed7ca45dda29debeb1
    chart: distributed-serving-0.1.0
    heritage: Helm
    release: vllm-alpha
    serviceName: vllm
    servingName: vllm
    servingType: distributed-serving
    servingVersion: alpha
  name: vllm-alpha-distributed-serving
  namespace: default
  resourceVersion: "236644878"
  uid: 4f73c906-ef17-4ec4-9a0b-50ef3c9076a5
spec:
  leaderWorkerTemplate:
    leaderTemplate:
      metadata:
        annotations:
          arena.kubeflow.org/username: kubecfg:certauth:admin
        labels:
          app: distributed-serving
          arena.kubeflow.org/uid: 3399d840e8b371ed7ca45dda29debeb1
          chart: distributed-serving-0.1.0
          heritage: Helm
          release: vllm-alpha
          role: master
          serviceName: vllm
          servingName: vllm
          servingType: distributed-serving
          servingVersion: alpha
      spec:
        containers:
        - command:
          - sh
          - -c
          - ray start --head --port=6379; vllm serve /mnt/models/Qwen2-1.5B --port
            5000 --dtype half --pipeline-parallel-size 2
          env:
          - name: MASTER_ADDR
            value: $(LWS_LEADER_ADDRESS)
          - name: WORLD_SIZE
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/size']
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
          - name: GROUP_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/group-index']
          - name: GPU_COUNT
            value: "1"
          - name: HOSTFILE
            value: /etc/hostfile
          - name: ROLE
            value: master
          image: vllm/vllm-openai:latest
          imagePullPolicy: IfNotPresent
          name: distributed-serving-master
          ports:
          - containerPort: 5000
            name: restful
            protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
          startupProbe:
            failureThreshold: 5
            httpGet:
              path: /health
              port: 5000
            periodSeconds: 60
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /mnt/models
            name: pvc-model
          - mountPath: /etc/hostfile
            name: vllm-alpha-cm
            subPathExpr: hostfile-$(GROUP_INDEX)
        volumes:
        - configMap:
            items:
            - key: hostfile-0
              mode: 438
              path: hostfile-0
            name: vllm-alpha-cm
          name: vllm-alpha-cm
        - emptyDir:
            medium: Memory
            sizeLimit: 4Gi
          name: dshm
        - name: pvc-model
          persistentVolumeClaim:
            claimName: pvc-model
    restartPolicy: RecreateGroupOnPodRestart
    size: 2
    workerTemplate:
      metadata:
        annotations:
          arena.kubeflow.org/username: kubecfg:certauth:admin
        labels:
          app: distributed-serving
          chart: distributed-serving-0.1.0
          heritage: Helm
          release: vllm-alpha
          role: worker
          serviceName: vllm
          servingName: vllm
          servingType: distributed-serving
          servingVersion: alpha
      spec:
        containers:
        - command:
          - sh
          - -c
          - ray start --address=$(MASTER_ADDR):6379 --block
          env:
          - name: MASTER_ADDR
            value: $(LWS_LEADER_ADDRESS)
          - name: WORLD_SIZE
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/size']
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
          - name: GROUP_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/group-index']
          - name: GPU_COUNT
            value: "1"
          - name: HOSTFILE
            value: /etc/hostfile
          - name: ROLE
            value: worker
          image: vllm/vllm-openai:latest
          imagePullPolicy: IfNotPresent
          name: distributed-serving-worker
          resources:
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /mnt/models
            name: pvc-model
          - mountPath: /etc/hostfile
            name: vllm-alpha-cm
            subPathExpr: hostfile-$(GROUP_INDEX)
        volumes:
        - configMap:
            items:
            - key: hostfile-0
              mode: 438
              path: hostfile-0
            name: vllm-alpha-cm
          name: vllm-alpha-cm
        - emptyDir:
            medium: Memory
            sizeLimit: 4Gi
          name: dshm
        - name: pvc-model
          persistentVolumeClaim:
            claimName: pvc-model
  networkConfig:
    subdomainPolicy: Shared
  replicas: 1
  rolloutStrategy:
    rollingUpdateConfiguration:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  startupPolicy: LeaderCreated
status:
  conditions:
  - lastTransitionTime: "2025-02-17T02:24:31Z"
    message: Replicas are progressing
    reason: GroupsProgressing
    status: "True"
    type: Progressing
  - lastTransitionTime: "2025-02-17T02:24:23Z"
    message: Rolling Upgrade is in progress
    reason: GroupsUpdating
    status: "False"
    type: UpdateInProgress
  - lastTransitionTime: "2025-02-17T02:24:30Z"
    message: All replicas are ready
    reason: AllGroupsReady
    status: "False"
    type: Available
  hpaPodSelector: leaderworkerset.sigs.k8s.io/name=vllm-alpha-distributed-serving,leaderworkerset.sigs.k8s.io/worker-index=0
  readyReplicas: 5
  replicas: 1
  updatedReplicas: 12

The text was updated successfully, but these errors were encountered:

yankay · 2025-02-17T04:40:44Z

Hi @lklkxcxc

It seems to be related to HPA as well, should we also submit an issue to https://github.com/kubeflow/arena ?

lklkxcxc · 2025-02-17T05:42:14Z

@yankay thanks

WeiZhang555 · 2025-02-17T06:20:48Z

I met same problem, was this fixed in v0.5.1? @lklkxcxc

yankay · 2025-02-17T08:27:55Z

I met same problem, was this fixed in v0.5.1? @lklkxcxc

Could you please help reproduce this problem directly using the LWS manifest instead of using Kubeflow?
In this way, we can accelerate the repair of the problem.

WeiZhang555 · 2025-02-17T09:48:24Z

@yankay I didn't use Kubeflow, I just installed LWS with helm, and tried the minimal example, then it trapped into a dead loop and stopped until it exhausted all the resources of cluster, which is quite dangerous and definitely unacceptable.

The root cause is quite simple, the pod webhook keeps marking pod as leader pod, then it creates sts and continues to creating child leader pod, until all resources are gone.

This patch can fix though I'm not sure if there's any side effect:

diff --git a/pkg/webhooks/pod_webhook.go b/pkg/webhooks/pod_webhook.go
index adbafa8..391d601 100644
--- a/pkg/webhooks/pod_webhook.go
+++ b/pkg/webhooks/pod_webhook.go
@@ -141,7 +141,7 @@ func (p *PodWebhook) Default(ctx context.Context, obj runtime.Object) error {
                if workerIndex == -1 {
                        return fmt.Errorf("parsing pod ordinal for pod %s", pod.Name)
                }
-               pod.Labels[leaderworkerset.WorkerIndexLabelKey] = fmt.Sprint(workerIndex)
+               pod.Labels[leaderworkerset.WorkerIndexLabelKey] = fmt.Sprint(workerIndex + 1)
                subGroupSize, foundSubGroupSize := pod.Annotations[leaderworkerset.SubGroupSizeAnnotationKey]
                if foundSubGroupSize && pod.Labels[leaderworkerset.SubGroupIndexLabelKey] == "" {
                        subGroupSizeInt, err := strconv.Atoi(subGroupSize)

WeiZhang555 · 2025-02-17T09:50:25Z

By the way, this issue can bring BIG problem, which makes the project dangerous to be deployed on any cluster! Please fix this with highest priority.

yankay · 2025-02-17T10:46:04Z

Hi @WeiZhang555
thanks for the issue report.
Would you please tell us the lws image version in the environment.

WeiZhang555 · 2025-02-17T11:45:01Z

Hi @WeiZhang555 thanks for the issue report. Would you please tell us the lws image version in the environment.

I tried 'main' and "v0.5.0", neither works. @yankay

yankay · 2025-02-17T12:51:07Z

Seems the same as #391, but I cannot reproduce it yet

yankay · 2025-02-17T12:59:47Z

HI @WeiZhang555
Would you please share the kubectl version, kubectl get statefulset -o yaml and upload the kubectl cluster-info dump --all-namespaces > cluster-dump.json
it will be a great help :-)

yankay · 2025-02-17T13:34:45Z

HI @WeiZhang555 @lklkxcxc
Update the Kubernetes to >1.27 may solve the issue :-)
Because the statefulset ordinals feature is supported after Kubernetes 1.27 as detailed here: https://kubernetes.io/blog/2023/04/28/statefulset-start-ordinal/.

WeiZhang555 · 2025-02-18T04:24:50Z

HI @WeiZhang555 @lklkxcxc Update the Kubernetes to >1.27 may solve the issue :-) Because the statefulset ordinals feature is supported after Kubernetes 1.27 as detailed here: https://kubernetes.io/blog/2023/04/28/statefulset-start-ordinal/.

@yankay I think I got your point.

lws/pkg/controllers/pod_controller.go

Line 337 in 10a741c

WithOrdinals(appsapplyv1.StatefulSetOrdinals().WithStart(1)).

LWS replies on Start Ordinal Feature, which is supported after v1.31. I am using Kubernetes v1.26.5.

To make LWS work on my env, I need to remove dependency on this feature.

lklkxcxc closed this as completed Feb 17, 2025

yankay mentioned this issue Feb 19, 2025

Fix infinite StatefulSet creation loops by validate leader annotations #394

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lws generated replicas is incorrect #390

lws generated replicas is incorrect #390

lklkxcxc commented Feb 17, 2025

yankay commented Feb 17, 2025

lklkxcxc commented Feb 17, 2025

WeiZhang555 commented Feb 17, 2025 •

edited

Loading

yankay commented Feb 17, 2025

WeiZhang555 commented Feb 17, 2025

WeiZhang555 commented Feb 17, 2025

yankay commented Feb 17, 2025 •

edited

Loading

WeiZhang555 commented Feb 17, 2025

yankay commented Feb 17, 2025

yankay commented Feb 17, 2025 •

edited

Loading

yankay commented Feb 17, 2025

WeiZhang555 commented Feb 18, 2025

lws generated replicas is incorrect #390

lws generated replicas is incorrect #390

Comments

lklkxcxc commented Feb 17, 2025

yankay commented Feb 17, 2025

lklkxcxc commented Feb 17, 2025

WeiZhang555 commented Feb 17, 2025 • edited Loading

yankay commented Feb 17, 2025

WeiZhang555 commented Feb 17, 2025

WeiZhang555 commented Feb 17, 2025

yankay commented Feb 17, 2025 • edited Loading

WeiZhang555 commented Feb 17, 2025

yankay commented Feb 17, 2025

yankay commented Feb 17, 2025 • edited Loading

yankay commented Feb 17, 2025

WeiZhang555 commented Feb 18, 2025

WeiZhang555 commented Feb 17, 2025 •

edited

Loading

yankay commented Feb 17, 2025 •

edited

Loading

yankay commented Feb 17, 2025 •

edited

Loading