Skip to content

Nvidia node mismatch for pod, pick up:/dev/nvidia1 predicate: /dev/nvidia0, which is unexpected. #18

@goversion

Description

@goversion

I got an error when running a pod, the error is

Nvidia node mismatch for pod example0(example0), pick up:/dev/nvidia1  predicate: /dev/nvidia0, which is unexpected.

It seems that the gpu-admission is assigned /dev/nvidia0 of a node, but the gpu-manager is assigned to the same node /dev/nvidia1 , two values are not equal, so ... ...

locate the code, as shown :

code

Please help analyze

  • example0.yaml
    ... ... 
    resources:
      requests:
        tencent.com/vcuda-core: 60
        tencent.com/vcuda-memory: 25
      limits:
        tencent.com/vcuda-core: 60
        tencent.com/vcuda-memory: 25
    ... ... 

See below for more information:

  • kubectl decribe pod example0
[root@node3 truetest]# kubectl describe pods example0
Name:               example0
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               node3/
Start Time:         Tue, 14 Apr 2020 16:15:03 +0800
Labels:             <none>
Annotations:        kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"example0","namespace":"default"},"spec":{"containers":[{"env":[{"name...
                    tencent.com/gpu-assigned: false
                    tencent.com/predicate-gpu-idx-0: 0
                    tencent.com/predicate-node: node3
                    tencent.com/predicate-time: 1586852103661396020
Status:             Failed
Reason:             UnexpectedAdmissionError
Message:            Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod example0(example0), pick up:/dev/nvidia1  predicate: /dev/nvidia0, which is unexpected.
IP:                 
Containers:
  example0:
    Image:      test_gpu:v6.6
    Port:       <none>
    Host Port:  <none>
    Limits:
      tencent.com/vcuda-core:    60
      tencent.com/vcuda-memory:  25
    Requests:
      tencent.com/vcuda-core:    60
      tencent.com/vcuda-memory:  25
    Environment:
      LD_LIBRARY_PATH:  /usr/local/cuda-10.0/lib64:/usr/local/nvidia/lib64
      LOGGER_LEVEL:     5
    Mounts:
      /usr/local/cuda-10.0 from cuda-lib (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6jbrl (ro)
Volumes:
  cuda-lib:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/cuda-10.0
    HostPathType:  
  default-token-6jbrl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6jbrl
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                    Age               From               Message
  ----     ------                    ----              ----               -------
  Warning  FailedScheduling          20s               default-scheduler  0/3 nodes are available: 3 Insufficient tencent.com/vcuda-core, 3 Insufficient tencent.com/vcuda-memory.
  Normal   Scheduled                 20s               default-scheduler  Successfully assigned default/example0 to node3
  Warning  UnexpectedAdmissionError  20s               kubelet, node3  Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod example0(example0), pick up:/dev/nvidia1  predicate: /dev/nvidia0, which is unexpected.
  Warning  FailedMount               4s (x6 over 20s)  kubelet, node3  MountVolume.SetUp failed for volume "default-token-6jbrl" : object "default"/"default-token-6jbrl" not registered

at this time, GPU usage status of node3 :

[root@node3 test]# nvidia-smi
Tue Apr 14 16:18:28 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34       Driver Version: 430.34       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:15:00.0 Off |                  N/A |
| 22%   38C    P8    23W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:21:00.0 Off |                  N/A |
| 23%   42C    P8    11W / 250W |      0MiB / 10997MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Environment

  • Kubernetes version : v1.14.3
  • tenflow version: tensorflow_1.14_py3_gpu_cuda10.0:latest

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions