Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lws generated replicas is incorrect #390

Closed
lklkxcxc opened this issue Feb 17, 2025 · 12 comments · Fixed by #394
Closed

lws generated replicas is incorrect #390

lklkxcxc opened this issue Feb 17, 2025 · 12 comments · Fixed by #394

Comments

@lklkxcxc
Copy link

I use kubeflow generate distributed-serving job,one master and one work but create 12 pods:

arena serve distributed \
 --name=vllm \
 --version=alpha \
 --restful-port=5000 \
 --image=vllm/vllm-openai:latest \
 --data=pvc-model:/mnt/models \
 --masters=1 \
 --replicas=1 \
 --max-surge=1 \
 --master-gpus=1 \
 --master-command="ray start --head --port=6379; vllm serve /mnt/models/Qwen2-1.5B --port 5000 --dtype half --pipeline-parallel-size 2" \
 --workers=1 \
 --worker-gpus=1 \
 --worker-command="ray start --address=\$(MASTER_ADDR):6379 --block" \
 --share-memory=4Gi \
 --startup-probe-action=httpGet \
 --startup-probe-action-option="path: /health" \
 --startup-probe-action-option="port: 5000" \
 --startup-probe-option="periodSeconds: 60" \
 --startup-probe-option="failureThreshold: 5"

kubectl get pod output:

vllm-alpha-distributed-serving-0                             0/1     Running             0          10s
vllm-alpha-distributed-serving-0-0                           1/1     Running             0          10s
vllm-alpha-distributed-serving-0-0-0                         0/1     ContainerCreating   0          10s
vllm-alpha-distributed-serving-0-0-0-0                       0/1     ContainerCreating   0          10s
vllm-alpha-distributed-serving-0-0-0-0-0                     0/1     ContainerCreating   0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0                   0/1     ContainerCreating   0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0                 0/1     ContainerCreating   0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0-0               0/1     Pending             0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0-0-0             0/1     Pending             0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0-0-0-0           0/1     Pending             0          10s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0-0-0-0-0         0/1     Pending             0          9s
vllm-alpha-distributed-serving-0-0-0-0-0-0-0-0-0-0-0-0       0/1     Pending             0          9s

kubectl get LeaderWorkerSet vllm-alpha-distributed-serving -o yaml output:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  annotations:
    helm.sh/created: "1739759061"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"leaderworkerset.x-k8s.io/v1","kind":"LeaderWorkerSet","metadata":{"annotations":{"helm.sh/created":"1739759061"},"labels":{"app":"distributed-serving","arena.kubeflow.org/uid":"3399d840e8b371ed7ca45dda29debeb1","chart":"distributed-serving-0.1.0","heritage":"Helm","release":"vllm-alpha","serviceName":"vllm","servingName":"vllm","servingType":"distributed-serving","servingVersion":"alpha"},"name":"vllm-alpha-distributed-serving","namespace":"default"},"spec":{"leaderWorkerTemplate":{"leaderTemplate":{"metadata":{"annotations":{"arena.kubeflow.org/username":"kubecfg:certauth:admin"},"labels":{"app":"distributed-serving","arena.kubeflow.org/uid":"3399d840e8b371ed7ca45dda29debeb1","chart":"distributed-serving-0.1.0","heritage":"Helm","release":"vllm-alpha","role":"master","serviceName":"vllm","servingName":"vllm","servingType":"distributed-serving","servingVersion":"alpha"}},"spec":{"containers":[{"command":["sh","-c","ray start --head --port=6379; vllm serve /mnt/models/Qwen2-1.5B --port 5000 --dtype half --pipeline-parallel-size 2"],"env":[{"name":"MASTER_ADDR","value":"$(LWS_LEADER_ADDRESS)"},{"name":"WORLD_SIZE","valueFrom":{"fieldRef":{"fieldPath":"metadata.annotations['leaderworkerset.sigs.k8s.io/size']"}}},{"name":"POD_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.name"}}},{"name":"POD_INDEX","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']"}}},{"name":"GROUP_INDEX","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels['leaderworkerset.sigs.k8s.io/group-index']"}}},{"name":"GPU_COUNT","value":"1"},{"name":"HOSTFILE","value":"/etc/hostfile"},{"name":"ROLE","value":"master"}],"image":"harbor.hzxingzai.cn/tools/vllm/vllm-openai:latest","imagePullPolicy":"IfNotPresent","name":"distributed-serving-master","ports":[{"containerPort":5000,"name":"restful","protocol":"TCP"}],"resources":{"limits":{"nvidia.com/gpu":1}},"startupProbe":{"failureThreshold":5,"httpGet":{"path":"/health","port":5000},"periodSeconds":60},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"},{"mountPath":"/mnt/models","name":"pvc-model"},{"mountPath":"/etc/hostfile","name":"vllm-alpha-cm","subPathExpr":"hostfile-$(GROUP_INDEX)"}]}],"volumes":[{"configMap":{"items":[{"key":"hostfile-0","mode":438,"path":"hostfile-0"}],"name":"vllm-alpha-cm"},"name":"vllm-alpha-cm"},{"emptyDir":{"medium":"Memory","sizeLimit":"4Gi"},"name":"dshm"},{"name":"pvc-model","persistentVolumeClaim":{"claimName":"pvc-model"}}]}},"restartPolicy":"RecreateGroupOnPodRestart","size":2,"workerTemplate":{"metadata":{"annotations":{"arena.kubeflow.org/username":"kubecfg:certauth:admin"},"labels":{"app":"distributed-serving","chart":"distributed-serving-0.1.0","heritage":"Helm","release":"vllm-alpha","role":"worker","serviceName":"vllm","servingName":"vllm","servingType":"distributed-serving","servingVersion":"alpha"}},"spec":{"containers":[{"command":["sh","-c","ray start --address=$(MASTER_ADDR):6379 --block"],"env":[{"name":"MASTER_ADDR","value":"$(LWS_LEADER_ADDRESS)"},{"name":"WORLD_SIZE","valueFrom":{"fieldRef":{"fieldPath":"metadata.annotations['leaderworkerset.sigs.k8s.io/size']"}}},{"name":"POD_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.name"}}},{"name":"POD_INDEX","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']"}}},{"name":"GROUP_INDEX","valueFrom":{"fieldRef":{"fieldPath":"metadata.labels['leaderworkerset.sigs.k8s.io/group-index']"}}},{"name":"GPU_COUNT","value":"1"},{"name":"HOSTFILE","value":"/etc/hostfile"},{"name":"ROLE","value":"worker"}],"image":"harbor.hzxingzai.cn/tools/vllm/vllm-openai:latest","imagePullPolicy":"IfNotPresent","name":"distributed-serving-worker","resources":{"limits":{"nvidia.com/gpu":1}},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"},{"mountPath":"/mnt/models","name":"pvc-model"},{"mountPath":"/etc/hostfile","name":"vllm-alpha-cm","subPathExpr":"hostfile-$(GROUP_INDEX)"}]}],"volumes":[{"configMap":{"items":[{"key":"hostfile-0","mode":438,"path":"hostfile-0"}],"name":"vllm-alpha-cm"},"name":"vllm-alpha-cm"},{"emptyDir":{"medium":"Memory","sizeLimit":"4Gi"},"name":"dshm"},{"name":"pvc-model","persistentVolumeClaim":{"claimName":"pvc-model"}}]}}},"replicas":1,"rolloutStrategy":{"rollingUpdateConfiguration":{"maxSurge":1}}}}
  creationTimestamp: "2025-02-17T02:24:23Z"
  generation: 1
  labels:
    app: distributed-serving
    arena.kubeflow.org/uid: 3399d840e8b371ed7ca45dda29debeb1
    chart: distributed-serving-0.1.0
    heritage: Helm
    release: vllm-alpha
    serviceName: vllm
    servingName: vllm
    servingType: distributed-serving
    servingVersion: alpha
  name: vllm-alpha-distributed-serving
  namespace: default
  resourceVersion: "236644878"
  uid: 4f73c906-ef17-4ec4-9a0b-50ef3c9076a5
spec:
  leaderWorkerTemplate:
    leaderTemplate:
      metadata:
        annotations:
          arena.kubeflow.org/username: kubecfg:certauth:admin
        labels:
          app: distributed-serving
          arena.kubeflow.org/uid: 3399d840e8b371ed7ca45dda29debeb1
          chart: distributed-serving-0.1.0
          heritage: Helm
          release: vllm-alpha
          role: master
          serviceName: vllm
          servingName: vllm
          servingType: distributed-serving
          servingVersion: alpha
      spec:
        containers:
        - command:
          - sh
          - -c
          - ray start --head --port=6379; vllm serve /mnt/models/Qwen2-1.5B --port
            5000 --dtype half --pipeline-parallel-size 2
          env:
          - name: MASTER_ADDR
            value: $(LWS_LEADER_ADDRESS)
          - name: WORLD_SIZE
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/size']
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
          - name: GROUP_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/group-index']
          - name: GPU_COUNT
            value: "1"
          - name: HOSTFILE
            value: /etc/hostfile
          - name: ROLE
            value: master
          image: vllm/vllm-openai:latest
          imagePullPolicy: IfNotPresent
          name: distributed-serving-master
          ports:
          - containerPort: 5000
            name: restful
            protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
          startupProbe:
            failureThreshold: 5
            httpGet:
              path: /health
              port: 5000
            periodSeconds: 60
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /mnt/models
            name: pvc-model
          - mountPath: /etc/hostfile
            name: vllm-alpha-cm
            subPathExpr: hostfile-$(GROUP_INDEX)
        volumes:
        - configMap:
            items:
            - key: hostfile-0
              mode: 438
              path: hostfile-0
            name: vllm-alpha-cm
          name: vllm-alpha-cm
        - emptyDir:
            medium: Memory
            sizeLimit: 4Gi
          name: dshm
        - name: pvc-model
          persistentVolumeClaim:
            claimName: pvc-model
    restartPolicy: RecreateGroupOnPodRestart
    size: 2
    workerTemplate:
      metadata:
        annotations:
          arena.kubeflow.org/username: kubecfg:certauth:admin
        labels:
          app: distributed-serving
          chart: distributed-serving-0.1.0
          heritage: Helm
          release: vllm-alpha
          role: worker
          serviceName: vllm
          servingName: vllm
          servingType: distributed-serving
          servingVersion: alpha
      spec:
        containers:
        - command:
          - sh
          - -c
          - ray start --address=$(MASTER_ADDR):6379 --block
          env:
          - name: MASTER_ADDR
            value: $(LWS_LEADER_ADDRESS)
          - name: WORLD_SIZE
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/size']
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
          - name: GROUP_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/group-index']
          - name: GPU_COUNT
            value: "1"
          - name: HOSTFILE
            value: /etc/hostfile
          - name: ROLE
            value: worker
          image: vllm/vllm-openai:latest
          imagePullPolicy: IfNotPresent
          name: distributed-serving-worker
          resources:
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /mnt/models
            name: pvc-model
          - mountPath: /etc/hostfile
            name: vllm-alpha-cm
            subPathExpr: hostfile-$(GROUP_INDEX)
        volumes:
        - configMap:
            items:
            - key: hostfile-0
              mode: 438
              path: hostfile-0
            name: vllm-alpha-cm
          name: vllm-alpha-cm
        - emptyDir:
            medium: Memory
            sizeLimit: 4Gi
          name: dshm
        - name: pvc-model
          persistentVolumeClaim:
            claimName: pvc-model
  networkConfig:
    subdomainPolicy: Shared
  replicas: 1
  rolloutStrategy:
    rollingUpdateConfiguration:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  startupPolicy: LeaderCreated
status:
  conditions:
  - lastTransitionTime: "2025-02-17T02:24:31Z"
    message: Replicas are progressing
    reason: GroupsProgressing
    status: "True"
    type: Progressing
  - lastTransitionTime: "2025-02-17T02:24:23Z"
    message: Rolling Upgrade is in progress
    reason: GroupsUpdating
    status: "False"
    type: UpdateInProgress
  - lastTransitionTime: "2025-02-17T02:24:30Z"
    message: All replicas are ready
    reason: AllGroupsReady
    status: "False"
    type: Available
  hpaPodSelector: leaderworkerset.sigs.k8s.io/name=vllm-alpha-distributed-serving,leaderworkerset.sigs.k8s.io/worker-index=0
  readyReplicas: 5
  replicas: 1
  updatedReplicas: 12
@yankay
Copy link
Member

yankay commented Feb 17, 2025

Hi @lklkxcxc

It seems to be related to HPA as well, should we also submit an issue to https://github.com/kubeflow/arena ?

@lklkxcxc
Copy link
Author

@yankay thanks

@WeiZhang555
Copy link

WeiZhang555 commented Feb 17, 2025

I met same problem, was this fixed in v0.5.1? @lklkxcxc

@yankay
Copy link
Member

yankay commented Feb 17, 2025

I met same problem, was this fixed in v0.5.1? @lklkxcxc

Could you please help reproduce this problem directly using the LWS manifest instead of using Kubeflow?
In this way, we can accelerate the repair of the problem.

@WeiZhang555
Copy link

@yankay I didn't use Kubeflow, I just installed LWS with helm, and tried the minimal example, then it trapped into a dead loop and stopped until it exhausted all the resources of cluster, which is quite dangerous and definitely unacceptable.

The root cause is quite simple, the pod webhook keeps marking pod as leader pod, then it creates sts and continues to creating child leader pod, until all resources are gone.

This patch can fix though I'm not sure if there's any side effect:

diff --git a/pkg/webhooks/pod_webhook.go b/pkg/webhooks/pod_webhook.go
index adbafa8..391d601 100644
--- a/pkg/webhooks/pod_webhook.go
+++ b/pkg/webhooks/pod_webhook.go
@@ -141,7 +141,7 @@ func (p *PodWebhook) Default(ctx context.Context, obj runtime.Object) error {
                if workerIndex == -1 {
                        return fmt.Errorf("parsing pod ordinal for pod %s", pod.Name)
                }
-               pod.Labels[leaderworkerset.WorkerIndexLabelKey] = fmt.Sprint(workerIndex)
+               pod.Labels[leaderworkerset.WorkerIndexLabelKey] = fmt.Sprint(workerIndex + 1)
                subGroupSize, foundSubGroupSize := pod.Annotations[leaderworkerset.SubGroupSizeAnnotationKey]
                if foundSubGroupSize && pod.Labels[leaderworkerset.SubGroupIndexLabelKey] == "" {
                        subGroupSizeInt, err := strconv.Atoi(subGroupSize)

@WeiZhang555
Copy link

By the way, this issue can bring BIG problem, which makes the project dangerous to be deployed on any cluster! Please fix this with highest priority.

@yankay
Copy link
Member

yankay commented Feb 17, 2025

Hi @WeiZhang555
thanks for the issue report.
Would you please tell us the lws image version in the environment.

@WeiZhang555
Copy link

Hi @WeiZhang555 thanks for the issue report. Would you please tell us the lws image version in the environment.

I tried 'main' and "v0.5.0", neither works. @yankay

@yankay
Copy link
Member

yankay commented Feb 17, 2025

Seems the same as #391, but I cannot reproduce it yet

@yankay
Copy link
Member

yankay commented Feb 17, 2025

HI @WeiZhang555
Would you please share the kubectl version, kubectl get statefulset -o yaml and upload the kubectl cluster-info dump --all-namespaces > cluster-dump.json
it will be a great help :-)

@yankay
Copy link
Member

yankay commented Feb 17, 2025

HI @WeiZhang555 @lklkxcxc
Update the Kubernetes to >1.27 may solve the issue :-)
Because the statefulset ordinals feature is supported after Kubernetes 1.27 as detailed here: https://kubernetes.io/blog/2023/04/28/statefulset-start-ordinal/.

@WeiZhang555
Copy link

HI @WeiZhang555 @lklkxcxc Update the Kubernetes to >1.27 may solve the issue :-) Because the statefulset ordinals feature is supported after Kubernetes 1.27 as detailed here: https://kubernetes.io/blog/2023/04/28/statefulset-start-ordinal/.

@yankay I think I got your point.

WithOrdinals(appsapplyv1.StatefulSetOrdinals().WithStart(1)).

LWS replies on Start Ordinal Feature, which is supported after v1.31. I am using Kubernetes v1.26.5.

To make LWS work on my env, I need to remove dependency on this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants