Open
Description
I got a similar problem when I create a pod like issue 18. Please help analyze.
Warning UnexpectedAdmissionError 16m kubelet, ai-1080ti-62 Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6 predicate: /dev/nvidia1, which is unexpected.
- test3.yaml
apiVersion: v1
kind: Pod
metadata:
name: test3
namespace: danlu-efficiency
spec:
restartPolicy: Never
schedulerName: gpu-admission
containers:
- image: danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
name: test3
command:
- /bin/bash
- -c
- sleep 100000000
resources:
requests:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 40
limits:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 40
- kubectl describe pods test3 -n danlu-efficiency
Name: test3
Namespace: danlu-efficiency
Priority: 0
PriorityClassName: <none>
Node: ai-1080ti-62/
Start Time: Wed, 15 Jul 2020 14:54:42 +0800
Labels: <none>
Annotations: tencent.com/gpu-assigned: false
tencent.com/predicate-gpu-idx-0: 1
tencent.com/predicate-node: ai-1080ti-62
tencent.com/predicate-time: 1594796082180123795
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6 predicate: /dev/nvidia1, which is unexpected.
IP:
Containers:
test3:
Image: danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
sleep 100000000
Limits:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 40
Requests:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 40
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-p6lfp (ro)
Volumes:
default-token-p6lfp:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-p6lfp
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17m gpu-admission Successfully assigned danlu-efficiency/test3 to ai-1080ti-62
Warning FailedScheduling 17m gpu-admission pod test3 had been predicated!
Warning UnexpectedAdmissionError 17m kubelet, ai-1080ti-62 Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6 predicate: /dev/nvidia1, which is unexpected.
- The information of ai-1080ti-62 node
Name: ai-1080ti-62
Roles: nvidia418
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
hardware=NVIDIAGPU
hardware-type=NVIDIAGPU
kubernetes.io/hostname=ai-1080ti-62
node-role.kubernetes.io/nvidia418=nvidia418
nvidia-device-enable=enable
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.90.1.131/24
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 29 May 2019 18:02:54 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 15 Jul 2020 15:14:58 +0800 Wed, 15 Jul 2020 11:30:46 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 15 Jul 2020 15:14:58 +0800 Wed, 15 Jul 2020 11:30:46 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 15 Jul 2020 15:14:58 +0800 Wed, 15 Jul 2020 11:30:46 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 15 Jul 2020 15:14:58 +0800 Wed, 15 Jul 2020 11:30:46 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.90.1.131
Hostname: ai-1080ti-62
Capacity:
cpu: 56
ephemeral-storage: 1152148172Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 264029984Ki
nvidia.com/gpu: 8
pods: 110
tencent.com/vcuda-core: 800
tencent.com/vcuda-memory: 349
Allocatable:
cpu: 53
ephemeral-storage: 1040344917078
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 251344672Ki
nvidia.com/gpu: 8
pods: 110
tencent.com/vcuda-core: 800
tencent.com/vcuda-memory: 349
System Info:
Machine ID: bf90cb25500346cb8178be49909651e4
System UUID: 00000000-0000-0000-0000-ac1f6b93483c
Boot ID: 97927469-0e92-4816-880c-243a64ef293a
Kernel Version: 4.19.0-0.bpo.8-amd64
OS Image: Debian GNU/Linux 9 (stretch)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.2
Kubelet Version: v1.13.5
Kube-Proxy Version: v1.13.5
PodCIDR: 192.168.20.0/24
Non-terminated Pods: (58 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
......
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 51210m (96%) 97100m (183%)
memory 105732569856 (41%) 250822036Ki (99%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 8 8
tencent.com/vcuda-core 60 60
tencent.com/vcuda-memory 30 30
Events: <none>
Metadata
Metadata
Assignees
Labels
No labels