Skip to content

Nvidia node mismatch for pod, pick up:/dev/nvidia6 predicate: /dev/nvidia1, which is unexpected. #37

Open
@fighterhit

Description

@fighterhit

I got a similar problem when I create a pod like issue 18. Please help analyze.

  Warning  UnexpectedAdmissionError  16m   kubelet, ai-1080ti-62  Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6  predicate: /dev/nvidia1, which is unexpected.
  • test3.yaml
apiVersion: v1
kind: Pod 
metadata:
  name: test3
  namespace: danlu-efficiency
spec:
  restartPolicy: Never
  schedulerName: gpu-admission
  containers:
  - image: danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
    name: test3
    command:
    - /bin/bash
    - -c
    - sleep 100000000
    resources:
      requests:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 40
      limits:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 40
  • kubectl describe pods test3 -n danlu-efficiency
Name:               test3
Namespace:          danlu-efficiency
Priority:           0
PriorityClassName:  <none>
Node:               ai-1080ti-62/
Start Time:         Wed, 15 Jul 2020 14:54:42 +0800
Labels:             <none>
Annotations:        tencent.com/gpu-assigned: false
                    tencent.com/predicate-gpu-idx-0: 1
                    tencent.com/predicate-node: ai-1080ti-62
                    tencent.com/predicate-time: 1594796082180123795
Status:             Failed
Reason:             UnexpectedAdmissionError
Message:            Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6  predicate: /dev/nvidia1, which is unexpected.
IP:                 
Containers:
  test3:
    Image:      danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
      sleep 100000000
    Limits:
      tencent.com/vcuda-core:    10
      tencent.com/vcuda-memory:  40
    Requests:
      tencent.com/vcuda-core:    10
      tencent.com/vcuda-memory:  40
    Environment:                 <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-p6lfp (ro)
Volumes:
  default-token-p6lfp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-p6lfp
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                    Age   From                   Message
  ----     ------                    ----  ----                   -------
  Normal   Scheduled                 17m   gpu-admission          Successfully assigned danlu-efficiency/test3 to ai-1080ti-62
  Warning  FailedScheduling          17m   gpu-admission          pod test3 had been predicated!
  Warning  UnexpectedAdmissionError  17m   kubelet, ai-1080ti-62  Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6  predicate: /dev/nvidia1, which is unexpected.

  • The information of ai-1080ti-62 node
Name:               ai-1080ti-62
Roles:              nvidia418
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    hardware=NVIDIAGPU
                    hardware-type=NVIDIAGPU
                    kubernetes.io/hostname=ai-1080ti-62
                    node-role.kubernetes.io/nvidia418=nvidia418
                    nvidia-device-enable=enable
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.90.1.131/24
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 29 May 2019 18:02:54 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.90.1.131
  Hostname:    ai-1080ti-62
Capacity:
 cpu:                       56
 ephemeral-storage:         1152148172Ki
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    264029984Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
Allocatable:
 cpu:                       53
 ephemeral-storage:         1040344917078
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    251344672Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
System Info:
 Machine ID:                                           bf90cb25500346cb8178be49909651e4
 System UUID:                                          00000000-0000-0000-0000-ac1f6b93483c
 Boot ID:                                              97927469-0e92-4816-880c-243a64ef293a
 Kernel Version:                                       4.19.0-0.bpo.8-amd64
 OS Image:                                             Debian GNU/Linux 9 (stretch)
 Operating System:                                     linux
 Architecture:                                         amd64
 Container Runtime Version:                            docker://18.6.2
 Kubelet Version:                                      v1.13.5
 Kube-Proxy Version:                                   v1.13.5
PodCIDR:                                               192.168.20.0/24
Non-terminated Pods:                                   (58 in total)
  Namespace                                            Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                                            ----                                                               ------------  ----------  ---------------  -------------  ---

......

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                  Requests            Limits
  --------                  --------            ------
  cpu                       51210m (96%)        97100m (183%)
  memory                    105732569856 (41%)  250822036Ki (99%)
  ephemeral-storage         0 (0%)              0 (0%)
  nvidia.com/gpu            8                   8
  tencent.com/vcuda-core    60                  60
  tencent.com/vcuda-memory  30                  30
Events:                     <none>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions