We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This is like #1630
paddle job hang, since pod has not been created.
job yaml like:
apiVersion: kubeflow.org/v1 kind: PaddleJob metadata: creationTimestamp: "2023-01-17T08:34:04Z" generation: 1 labels: job.baai.ac.cn/creator: elrond job.baai.ac.cn/creator-id: "215305" job.baai.ac.cn/queue-id: 506ef7b1-1943-4d6a-aac5-e93c23cff768 job.baai.ac.cn/type: batch name: job-e9e13362-dbab-4709-9954-bd38d298cf59 namespace: airs resourceVersion: "36758105" uid: 34cfd47a-bf64-465f-9b7e-35d23afde375 spec: paddleReplicaSpecs: Worker: replicas: 2 restartPolicy: Never template: metadata: annotations: airs-center-endpoint: airs-center.airs-citest.svc.cluster.local:6080 proj.baai.ac.cn/id: dbe1edf4-d12d-406b-a64f-2f71f15ed613 projset.baai.ac.cn/id: 00ffe2f5-2cf0-47ef-8631-bee744da1069 volcano.sh/preemptable: "false" labels: job.baai.ac.cn/creator: elrond job.baai.ac.cn/creator-id: "215305" job.baai.ac.cn/name: job-e9e13362-dbab-4709-9954-bd38d298cf59 job.baai.ac.cn/type: batch pod.baai.ac.cn/role: Worker spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: machine.baai.ac.cn/accelerator operator: In values: - NVIDIA_T4 containers: - command: - /bin/sh - -c - echo "PATH=/client-tools:$PATH" >> ~/.bashrc;env >> /etc/environment;/usr/sbin/sshd -f /etc/configmap/sshd_config;python -m paddle.distributed.launch run_check env: - name: TZ value: Asia/Shanghai image: harbor-dev.platform.baai-inner.ac.cn/library/paddle:gpu imagePullPolicy: Always name: paddle resources: limits: cpu: "10" memory: 30Gi nvidia.com/gpu: "2" rdma/mlnx_shared: "2" requests: cpu: "10" memory: 30Gi nvidia.com/gpu: "2" rdma/mlnx_shared: "2" securityContext: capabilities: add: - IPC_LOCK volumeMounts: - mountPath: /dev/shm name: shm-volume - mountPath: /home/elrond name: storage-volume0 - mountPath: /etc/localtime name: localtime - mountPath: /etc/downwardapi name: downward-api readOnly: true - mountPath: /etc/configmap name: sshd-config readOnly: true - mountPath: /etc/pub name: sshproxy-keys-config readOnly: true - mountPath: /client-tools mountPropagation: HostToContainer name: client-tools readOnly: true imagePullSecrets: - name: harbor-platform-readonly-secret schedulerName: volcano volumes: - emptyDir: medium: Memory sizeLimit: 15Gi name: shm-volume - hostPath: path: /mnt/airs-business/airs/sharefs/00ffe2f5-2cf0-47ef-8631-bee744da1069_dbe1edf4-d12d-406b-a64f-2f71f15ed613/215305 name: storage-volume0 - downwardAPI: items: - fieldRef: fieldPath: metadata.labels path: labels - fieldRef: fieldPath: metadata.annotations path: annotations name: downward-api - hostPath: path: /etc/localtime name: localtime - configMap: name: sshd-config name: sshd-config - configMap: items: - key: id_rsa.pub path: id_rsa.pub name: sshproxy-keys-config name: sshproxy-keys-config - hostPath: path: /mnt/airs-business/client-tools/tools/bin name: client-tools runPolicy: cleanPodPolicy: Running schedulingPolicy: priorityClass: high-priority queue: 506ef7b1-1943-4d6a-aac5-e93c23cff768 ttlSecondsAfterFinished: 120 status: conditions: - lastTransitionTime: "2023-01-17T08:34:05Z" lastUpdateTime: "2023-01-17T08:34:05Z" message: PaddleJob job-e9e13362-dbab-4709-9954-bd38d298cf59 is created. reason: PaddleJobCreated status: "True" type: Created lastReconcileTime: "2023-01-17T08:34:05Z" replicaStatuses: {}
But this only happen on paddle controller, In same env tf/pytorch work well
If I restart training-operator pod, the job will become running status.
The text was updated successfully, but these errors were encountered:
cc @shinytang6 Can you tell any difference in the paddle scenario to work with volcano ?
Sorry, something went wrong.
/assign
This is a bug caused by predicates.
@nkflash Probably, that bug was fixed. If you face the same error, feel free to reopen this issue.
I have verified this fix, It work well.
tenzen-y
Successfully merging a pull request may close this issue.
This is like #1630
paddle job hang, since pod has not been created.
job yaml like:
But this only happen on paddle controller, In same env tf/pytorch work well
If I restart training-operator pod, the job will become running status.
The text was updated successfully, but these errors were encountered: