paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729

nkflash · 2023-01-17T09:52:35Z

This is like #1630

paddle job hang, since pod has not been created.

job yaml like:

apiVersion: kubeflow.org/v1
kind: PaddleJob
metadata:
  creationTimestamp: "2023-01-17T08:34:04Z"
  generation: 1
  labels:
    job.baai.ac.cn/creator: elrond
    job.baai.ac.cn/creator-id: "215305"
    job.baai.ac.cn/queue-id: 506ef7b1-1943-4d6a-aac5-e93c23cff768
    job.baai.ac.cn/type: batch
  name: job-e9e13362-dbab-4709-9954-bd38d298cf59
  namespace: airs
  resourceVersion: "36758105"
  uid: 34cfd47a-bf64-465f-9b7e-35d23afde375
spec:
  paddleReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: Never
      template:
        metadata:
          annotations:
            airs-center-endpoint: airs-center.airs-citest.svc.cluster.local:6080
            proj.baai.ac.cn/id: dbe1edf4-d12d-406b-a64f-2f71f15ed613
            projset.baai.ac.cn/id: 00ffe2f5-2cf0-47ef-8631-bee744da1069
            volcano.sh/preemptable: "false"
          labels:
            job.baai.ac.cn/creator: elrond
            job.baai.ac.cn/creator-id: "215305"
            job.baai.ac.cn/name: job-e9e13362-dbab-4709-9954-bd38d298cf59
            job.baai.ac.cn/type: batch
            pod.baai.ac.cn/role: Worker
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: machine.baai.ac.cn/accelerator
                    operator: In
                    values:
                    - NVIDIA_T4
          containers:
          - command:
            - /bin/sh
            - -c
            - echo "PATH=/client-tools:$PATH" >> ~/.bashrc;env >> /etc/environment;/usr/sbin/sshd
              -f /etc/configmap/sshd_config;python -m paddle.distributed.launch run_check
            env:
            - name: TZ
              value: Asia/Shanghai
            image: harbor-dev.platform.baai-inner.ac.cn/library/paddle:gpu
            imagePullPolicy: Always
            name: paddle
            resources:
              limits:
                cpu: "10"
                memory: 30Gi
                nvidia.com/gpu: "2"
                rdma/mlnx_shared: "2"
              requests:
                cpu: "10"
                memory: 30Gi
                nvidia.com/gpu: "2"
                rdma/mlnx_shared: "2"
            securityContext:
              capabilities:
                add:
                - IPC_LOCK
            volumeMounts:
            - mountPath: /dev/shm
              name: shm-volume
            - mountPath: /home/elrond
              name: storage-volume0
            - mountPath: /etc/localtime
              name: localtime
            - mountPath: /etc/downwardapi
              name: downward-api
              readOnly: true
            - mountPath: /etc/configmap
              name: sshd-config
              readOnly: true
            - mountPath: /etc/pub
              name: sshproxy-keys-config
              readOnly: true
            - mountPath: /client-tools
              mountPropagation: HostToContainer
              name: client-tools
              readOnly: true
          imagePullSecrets:
          - name: harbor-platform-readonly-secret
          schedulerName: volcano
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 15Gi
            name: shm-volume
          - hostPath:
              path: /mnt/airs-business/airs/sharefs/00ffe2f5-2cf0-47ef-8631-bee744da1069_dbe1edf4-d12d-406b-a64f-2f71f15ed613/215305
            name: storage-volume0
          - downwardAPI:
              items:
              - fieldRef:
                  fieldPath: metadata.labels
                path: labels
              - fieldRef:
                  fieldPath: metadata.annotations
                path: annotations
            name: downward-api
          - hostPath:
              path: /etc/localtime
            name: localtime
          - configMap:
              name: sshd-config
            name: sshd-config
          - configMap:
              items:
              - key: id_rsa.pub
                path: id_rsa.pub
              name: sshproxy-keys-config
            name: sshproxy-keys-config
          - hostPath:
              path: /mnt/airs-business/client-tools/tools/bin
            name: client-tools
  runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      priorityClass: high-priority
      queue: 506ef7b1-1943-4d6a-aac5-e93c23cff768
    ttlSecondsAfterFinished: 120
status:
  conditions:
  - lastTransitionTime: "2023-01-17T08:34:05Z"
    lastUpdateTime: "2023-01-17T08:34:05Z"
    message: PaddleJob job-e9e13362-dbab-4709-9954-bd38d298cf59 is created.
    reason: PaddleJobCreated
    status: "True"
    type: Created
  lastReconcileTime: "2023-01-17T08:34:05Z"
  replicaStatuses: {}

But this only happen on paddle controller, In same env tf/pytorch work well

If I restart training-operator pod, the job will become running status.

The text was updated successfully, but these errors were encountered:

kuizhiqing · 2023-01-17T09:54:46Z

cc @shinytang6 Can you tell any difference in the paddle scenario to work with volcano ?

tenzen-y · 2023-01-17T14:17:55Z

/assign

This is a bug caused by predicates.

tenzen-y · 2023-01-18T03:39:06Z

@nkflash Probably, that bug was fixed. If you face the same error, feel free to reopen this issue.

nkflash · 2023-01-19T07:00:16Z

@nkflash Probably, that bug was fixed. If you face the same error, feel free to reopen this issue.

I have verified this fix, It work well.

google-oss-prow bot assigned tenzen-y Jan 17, 2023

tenzen-y mentioned this issue Jan 17, 2023

Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730

Merged

1 task

google-oss-prow bot closed this as completed in #1730 Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729

paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729

nkflash commented Jan 17, 2023

kuizhiqing commented Jan 17, 2023

tenzen-y commented Jan 17, 2023 •

edited

Loading

tenzen-y commented Jan 18, 2023

nkflash commented Jan 19, 2023

paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729

paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729

Comments

nkflash commented Jan 17, 2023

kuizhiqing commented Jan 17, 2023

tenzen-y commented Jan 17, 2023 • edited Loading

tenzen-y commented Jan 18, 2023

nkflash commented Jan 19, 2023

tenzen-y commented Jan 17, 2023 •

edited

Loading