-
Notifications
You must be signed in to change notification settings - Fork 238
Closed
Description
I got an error when running a pod, the error is
Nvidia node mismatch for pod example0(example0), pick up:/dev/nvidia1 predicate: /dev/nvidia0, which is unexpected.
It seems that the gpu-admission is assigned /dev/nvidia0 of a node, but the gpu-manager is assigned to the same node /dev/nvidia1 , two values are not equal, so ... ...
locate the code, as shown :
Please help analyze
- example0.yaml
... ...
resources:
requests:
tencent.com/vcuda-core: 60
tencent.com/vcuda-memory: 25
limits:
tencent.com/vcuda-core: 60
tencent.com/vcuda-memory: 25
... ... See below for more information:
- kubectl decribe pod example0
[root@node3 truetest]# kubectl describe pods example0
Name: example0
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: node3/
Start Time: Tue, 14 Apr 2020 16:15:03 +0800
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"example0","namespace":"default"},"spec":{"containers":[{"env":[{"name...
tencent.com/gpu-assigned: false
tencent.com/predicate-gpu-idx-0: 0
tencent.com/predicate-node: node3
tencent.com/predicate-time: 1586852103661396020
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod example0(example0), pick up:/dev/nvidia1 predicate: /dev/nvidia0, which is unexpected.
IP:
Containers:
example0:
Image: test_gpu:v6.6
Port: <none>
Host Port: <none>
Limits:
tencent.com/vcuda-core: 60
tencent.com/vcuda-memory: 25
Requests:
tencent.com/vcuda-core: 60
tencent.com/vcuda-memory: 25
Environment:
LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/nvidia/lib64
LOGGER_LEVEL: 5
Mounts:
/usr/local/cuda-10.0 from cuda-lib (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6jbrl (ro)
Volumes:
cuda-lib:
Type: HostPath (bare host directory volume)
Path: /usr/local/cuda-10.0
HostPathType:
default-token-6jbrl:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-6jbrl
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20s default-scheduler 0/3 nodes are available: 3 Insufficient tencent.com/vcuda-core, 3 Insufficient tencent.com/vcuda-memory.
Normal Scheduled 20s default-scheduler Successfully assigned default/example0 to node3
Warning UnexpectedAdmissionError 20s kubelet, node3 Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod example0(example0), pick up:/dev/nvidia1 predicate: /dev/nvidia0, which is unexpected.
Warning FailedMount 4s (x6 over 20s) kubelet, node3 MountVolume.SetUp failed for volume "default-token-6jbrl" : object "default"/"default-token-6jbrl" not registered
at this time, GPU usage status of node3 :
[root@node3 test]# nvidia-smi
Tue Apr 14 16:18:28 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34 Driver Version: 430.34 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:15:00.0 Off | N/A |
| 22% 38C P8 23W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:21:00.0 Off | N/A |
| 23% 42C P8 11W / 250W | 0MiB / 10997MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Environment
- Kubernetes version : v1.14.3
- tenflow version: tensorflow_1.14_py3_gpu_cuda10.0:latest
Metadata
Metadata
Assignees
Labels
No labels
