Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia node mismatch for pod, pick up:/dev/nvidia6 predicate: /dev/nvidia1, which is unexpected. #37

Open
fighterhit opened this issue Jul 15, 2020 · 16 comments

Comments

@fighterhit
Copy link
Contributor

I got a similar problem when I create a pod like issue 18. Please help analyze.

  Warning  UnexpectedAdmissionError  16m   kubelet, ai-1080ti-62  Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6  predicate: /dev/nvidia1, which is unexpected.
  • test3.yaml
apiVersion: v1
kind: Pod 
metadata:
  name: test3
  namespace: danlu-efficiency
spec:
  restartPolicy: Never
  schedulerName: gpu-admission
  containers:
  - image: danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
    name: test3
    command:
    - /bin/bash
    - -c
    - sleep 100000000
    resources:
      requests:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 40
      limits:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 40
  • kubectl describe pods test3 -n danlu-efficiency
Name:               test3
Namespace:          danlu-efficiency
Priority:           0
PriorityClassName:  <none>
Node:               ai-1080ti-62/
Start Time:         Wed, 15 Jul 2020 14:54:42 +0800
Labels:             <none>
Annotations:        tencent.com/gpu-assigned: false
                    tencent.com/predicate-gpu-idx-0: 1
                    tencent.com/predicate-node: ai-1080ti-62
                    tencent.com/predicate-time: 1594796082180123795
Status:             Failed
Reason:             UnexpectedAdmissionError
Message:            Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6  predicate: /dev/nvidia1, which is unexpected.
IP:                 
Containers:
  test3:
    Image:      danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
      sleep 100000000
    Limits:
      tencent.com/vcuda-core:    10
      tencent.com/vcuda-memory:  40
    Requests:
      tencent.com/vcuda-core:    10
      tencent.com/vcuda-memory:  40
    Environment:                 <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-p6lfp (ro)
Volumes:
  default-token-p6lfp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-p6lfp
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                    Age   From                   Message
  ----     ------                    ----  ----                   -------
  Normal   Scheduled                 17m   gpu-admission          Successfully assigned danlu-efficiency/test3 to ai-1080ti-62
  Warning  FailedScheduling          17m   gpu-admission          pod test3 had been predicated!
  Warning  UnexpectedAdmissionError  17m   kubelet, ai-1080ti-62  Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6  predicate: /dev/nvidia1, which is unexpected.

  • The information of ai-1080ti-62 node
Name:               ai-1080ti-62
Roles:              nvidia418
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    hardware=NVIDIAGPU
                    hardware-type=NVIDIAGPU
                    kubernetes.io/hostname=ai-1080ti-62
                    node-role.kubernetes.io/nvidia418=nvidia418
                    nvidia-device-enable=enable
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.90.1.131/24
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 29 May 2019 18:02:54 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.90.1.131
  Hostname:    ai-1080ti-62
Capacity:
 cpu:                       56
 ephemeral-storage:         1152148172Ki
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    264029984Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
Allocatable:
 cpu:                       53
 ephemeral-storage:         1040344917078
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    251344672Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
System Info:
 Machine ID:                                           bf90cb25500346cb8178be49909651e4
 System UUID:                                          00000000-0000-0000-0000-ac1f6b93483c
 Boot ID:                                              97927469-0e92-4816-880c-243a64ef293a
 Kernel Version:                                       4.19.0-0.bpo.8-amd64
 OS Image:                                             Debian GNU/Linux 9 (stretch)
 Operating System:                                     linux
 Architecture:                                         amd64
 Container Runtime Version:                            docker://18.6.2
 Kubelet Version:                                      v1.13.5
 Kube-Proxy Version:                                   v1.13.5
PodCIDR:                                               192.168.20.0/24
Non-terminated Pods:                                   (58 in total)
  Namespace                                            Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                                            ----                                                               ------------  ----------  ---------------  -------------  ---

......

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                  Requests            Limits
  --------                  --------            ------
  cpu                       51210m (96%)        97100m (183%)
  memory                    105732569856 (41%)  250822036Ki (99%)
  ephemeral-storage         0 (0%)              0 (0%)
  nvidia.com/gpu            8                   8
  tencent.com/vcuda-core    60                  60
  tencent.com/vcuda-memory  30                  30
Events:                     <none>

@mYmNeo
Copy link
Contributor

mYmNeo commented Jul 16, 2020

It's a defensive mechanism for gpu-manager. The gpu-admission try to assign a pod to one card to avoid fragment, but the gpu-admission schedule information is not as new as the gpu-manager knows for some reason(pod terminated, pod failed etc). The gpu-manager will validate whether it's the same as the gpu-admission predicated, if not gpu-manager will reject it to keep the same allocation view.

@mYmNeo
Copy link
Contributor

mYmNeo commented Jul 17, 2020

Besides, your situation may be another scenario. We're working on this fix.

@fighterhit
Copy link
Contributor Author

fighterhit commented Aug 26, 2020

Today I try to reproduce the problem. First I create 7 NVIDIA GPU Pods, each occupying 1 GPU.

  • The NVIDIA GPU Pod description
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nvidia-gpu-test-app-time-cost
  namespace: danlu-efficiency
spec:
  replicas: 7
  selector:
    matchLabels:
      app: nvidia-gpu-test-app-time-cost
  template:
      metadata:
        labels:
          app: nvidia-gpu-test-app-time-cost
      spec:
        schedulerName: gpu-admission
        restartPolicy: Always
        containers:
          - name: nvidia-gpu-test-app-time-cost
            image: xxx:gpu-test-app-time-cost
            resources:
                    #requests:
                    #tencent.com/vcuda-core: "20"
                    #tencent.com/vcuda-memory: "10"
              limits:
                nvidia.com/gpu: 1
                      #tencent.com/vcuda-core: "20"
                      #tencent.com/vcuda-memory: "10"
        imagePullSecrets:
          - name: gpu
  • GPU information. We can see that there is an idle GPU#4.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   39C    P2    54W / 250W |  10661MiB / 11178MiB |     11%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   42C    P2    56W / 250W |  10661MiB / 11178MiB |      7%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 24%   43C    P2    56W / 250W |  10661MiB / 11178MiB |     11%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 24%   44C    P2    55W / 250W |  10661MiB / 11178MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 23%   26C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:86:00.0 Off |                  N/A |
| 23%   41C    P2    54W / 250W |  10661MiB / 11178MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 23%   38C    P2    54W / 250W |  10661MiB / 11178MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:8A:00.0 Off |                  N/A |
| 23%   39C    P2    54W / 250W |  10661MiB / 11178MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     28425      C   python                                     10651MiB |
|    1     32035      C   python                                     10651MiB |
|    2     27356      C   python                                     10651MiB |
|    3     30741      C   python                                     10651MiB |
|    5     26997      C   python                                     10651MiB |
|    6     27601      C   python                                     10651MiB |
|    7     31145      C   python                                     10651MiB |
+-----------------------------------------------------------------------------+

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity
GPU0     X      PIX     PHB     PHB     SYS     SYS     SYS     SYS     0-13,28-41
GPU1    PIX      X      PHB     PHB     SYS     SYS     SYS     SYS     0-13,28-41
GPU2    PHB     PHB      X      PIX     SYS     SYS     SYS     SYS     0-13,28-41
GPU3    PHB     PHB     PIX      X      SYS     SYS     SYS     SYS     0-13,28-41
GPU4    SYS     SYS     SYS     SYS      X      PIX     PHB     PHB     14-27,42-55
GPU5    SYS     SYS     SYS     SYS     PIX      X      PHB     PHB     14-27,42-55
GPU6    SYS     SYS     SYS     SYS     PHB     PHB      X      PIX     14-27,42-55
GPU7    SYS     SYS     SYS     SYS     PHB     PHB     PIX      X      14-27,42-55

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Then, I create 1 Tencent GPU Pod, occupying 1/5 GPU and 1/4 GPU memory. I got the problem again and this pod always cycles between the two states of pending and UnexpectedAdmissionError.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tencent-gpu-test-app-time-cost
  namespace: danlu-efficiency
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tencent-gpu-test-app-time-cost
  template:
      metadata:
        labels:
          app: tencent-gpu-test-app-time-cost
      spec:
        schedulerName: gpu-admission
        restartPolicy: Always
        containers:
          - name: tencent-gpu-test-app-time-cost
            image: xxx:gpu-test-app-time-cost
            resources:
              requests:
                tencent.com/vcuda-core: "20"
                tencent.com/vcuda-memory: "10"
              limits:
                tencent.com/vcuda-core: "20"
                tencent.com/vcuda-memory: "10"
        imagePullSecrets:
          - name: gpu
  • kubectl describe po ... output
Name:               tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c
Namespace:          danlu-efficiency
Priority:           0
PriorityClassName:  <none>
Node:               ai-1080ti-25/
Start Time:         Wed, 26 Aug 2020 21:10:56 +0800
Labels:             app=tencent-gpu-test-app-time-cost
                    pod-template-hash=7fc956cd5f
Annotations:        tencent.com/gpu-assigned: false
                    tencent.com/predicate-gpu-idx-0: 0
                    tencent.com/predicate-node: ai-1080ti-25
                    tencent.com/predicate-time: 1598447456624461570
Status:             Failed
Reason:             UnexpectedAdmissionError
Message:            Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c(tencent-gpu-test-app-time-cost), pick up:/dev/nvidia4  predicate: /dev/nvidia0, which is unexpected.
IP:
Controlled By:      ReplicaSet/tencent-gpu-test-app-time-cost-7fc956cd5f
Containers:
  tencent-gpu-test-app-time-cost:
    Image:      xxx:gpu-test-app-time-cost
    Port:       <none>
    Host Port:  <none>
    Limits:
      tencent.com/vcuda-core:    20
      tencent.com/vcuda-memory:  10
    Requests:
      tencent.com/vcuda-core:    20
      tencent.com/vcuda-memory:  10
    Environment:                 <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-b78wm (ro)
Volumes:
  default-token-b78wm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-b78wm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                    Age   From                   Message
  ----     ------                    ----  ----                   -------
  Warning  FailedScheduling          28s   gpu-admission          pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c had been predicated!
  Normal   Scheduled                 28s   gpu-admission          Successfully assigned danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c to ai-1080ti-25
  Warning  UnexpectedAdmissionError  28s   kubelet, ai-1080ti-25  Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c(tencent-gpu-test-app-time-cost), pick up:/dev/nvidia4  predicate: /dev/nvidia0, which is unexpected.

So I am confused why GPUManager chooses GPU#4. Shouldn't it choose GPU#0 in terms of resource utilization? Is GPU topology considered here? But why consider topology? This latter test program has nothing to do with other programs.

@mYmNeo
Copy link
Contributor

mYmNeo commented Aug 27, 2020

Today I try to reproduce the problem. First I create 7 NVIDIA GPU Pods, each occupying 1 GPU.

  • The NVIDIA GPU Pod description
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nvidia-gpu-test-app-time-cost
  namespace: danlu-efficiency
spec:
  replicas: 7
  selector:
    matchLabels:
      app: nvidia-gpu-test-app-time-cost
  template:
      metadata:
        labels:
          app: nvidia-gpu-test-app-time-cost
      spec:
        schedulerName: gpu-admission
        restartPolicy: Always
        containers:
          - name: nvidia-gpu-test-app-time-cost
            image: xxx:gpu-test-app-time-cost
            resources:
                    #requests:
                    #tencent.com/vcuda-core: "20"
                    #tencent.com/vcuda-memory: "10"
              limits:
                nvidia.com/gpu: 1
                      #tencent.com/vcuda-core: "20"
                      #tencent.com/vcuda-memory: "10"
        imagePullSecrets:
          - name: gpu
  • GPU information. We can see that there is an idle GPU#4.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   39C    P2    54W / 250W |  10661MiB / 11178MiB |     11%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   42C    P2    56W / 250W |  10661MiB / 11178MiB |      7%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 24%   43C    P2    56W / 250W |  10661MiB / 11178MiB |     11%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 24%   44C    P2    55W / 250W |  10661MiB / 11178MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 23%   26C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:86:00.0 Off |                  N/A |
| 23%   41C    P2    54W / 250W |  10661MiB / 11178MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 23%   38C    P2    54W / 250W |  10661MiB / 11178MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:8A:00.0 Off |                  N/A |
| 23%   39C    P2    54W / 250W |  10661MiB / 11178MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     28425      C   python                                     10651MiB |
|    1     32035      C   python                                     10651MiB |
|    2     27356      C   python                                     10651MiB |
|    3     30741      C   python                                     10651MiB |
|    5     26997      C   python                                     10651MiB |
|    6     27601      C   python                                     10651MiB |
|    7     31145      C   python                                     10651MiB |
+-----------------------------------------------------------------------------+

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity
GPU0     X      PIX     PHB     PHB     SYS     SYS     SYS     SYS     0-13,28-41
GPU1    PIX      X      PHB     PHB     SYS     SYS     SYS     SYS     0-13,28-41
GPU2    PHB     PHB      X      PIX     SYS     SYS     SYS     SYS     0-13,28-41
GPU3    PHB     PHB     PIX      X      SYS     SYS     SYS     SYS     0-13,28-41
GPU4    SYS     SYS     SYS     SYS      X      PIX     PHB     PHB     14-27,42-55
GPU5    SYS     SYS     SYS     SYS     PIX      X      PHB     PHB     14-27,42-55
GPU6    SYS     SYS     SYS     SYS     PHB     PHB      X      PIX     14-27,42-55
GPU7    SYS     SYS     SYS     SYS     PHB     PHB     PIX      X      14-27,42-55

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Then, I create 1 Tencent GPU Pod, occupying 1/5 GPU and 1/4 GPU memory. I got the problem again and this pod always cycles between the two states of pending and UnexpectedAdmissionError.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tencent-gpu-test-app-time-cost
  namespace: danlu-efficiency
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tencent-gpu-test-app-time-cost
  template:
      metadata:
        labels:
          app: tencent-gpu-test-app-time-cost
      spec:
        schedulerName: gpu-admission
        restartPolicy: Always
        containers:
          - name: tencent-gpu-test-app-time-cost
            image: xxx:gpu-test-app-time-cost
            resources:
              requests:
                tencent.com/vcuda-core: "20"
                tencent.com/vcuda-memory: "10"
              limits:
                tencent.com/vcuda-core: "20"
                tencent.com/vcuda-memory: "10"
        imagePullSecrets:
          - name: gpu
  • kubectl describe po ... output
Name:               tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c
Namespace:          danlu-efficiency
Priority:           0
PriorityClassName:  <none>
Node:               ai-1080ti-25/
Start Time:         Wed, 26 Aug 2020 21:10:56 +0800
Labels:             app=tencent-gpu-test-app-time-cost
                    pod-template-hash=7fc956cd5f
Annotations:        tencent.com/gpu-assigned: false
                    tencent.com/predicate-gpu-idx-0: 0
                    tencent.com/predicate-node: ai-1080ti-25
                    tencent.com/predicate-time: 1598447456624461570
Status:             Failed
Reason:             UnexpectedAdmissionError
Message:            Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c(tencent-gpu-test-app-time-cost), pick up:/dev/nvidia4  predicate: /dev/nvidia0, which is unexpected.
IP:
Controlled By:      ReplicaSet/tencent-gpu-test-app-time-cost-7fc956cd5f
Containers:
  tencent-gpu-test-app-time-cost:
    Image:      xxx:gpu-test-app-time-cost
    Port:       <none>
    Host Port:  <none>
    Limits:
      tencent.com/vcuda-core:    20
      tencent.com/vcuda-memory:  10
    Requests:
      tencent.com/vcuda-core:    20
      tencent.com/vcuda-memory:  10
    Environment:                 <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-b78wm (ro)
Volumes:
  default-token-b78wm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-b78wm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                    Age   From                   Message
  ----     ------                    ----  ----                   -------
  Warning  FailedScheduling          28s   gpu-admission          pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c had been predicated!
  Normal   Scheduled                 28s   gpu-admission          Successfully assigned danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c to ai-1080ti-25
  Warning  UnexpectedAdmissionError  28s   kubelet, ai-1080ti-25  Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod tencent-gpu-test-app-time-cost-7fc956cd5f-2f98c(tencent-gpu-test-app-time-cost), pick up:/dev/nvidia4  predicate: /dev/nvidia0, which is unexpected.

So I am confused why GPUManager chooses GPU#4. Shouldn't it choose GPU#0 in terms of resource utilization? Is GPU topology considered here? But why consider topology? This latter test program has nothing to do with other programs.

GPUManager only consider the pod with its specification resource, even if your gpu card was occupied by some programs. Topology is considered because that some program may have p2p data transfer through gpu card by using nccl, the link between two cards may affect the speed of data transfering.

PS: can you provide the log of chosen result of your situation?

@fighterhit
Copy link
Contributor Author

Hi @mYmNeo , thanks for your quick answer. Actually, in order not to affect the current k8s environment, we created a new scheduler and started admission as its extended scheduler. Here is its description.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-admission
  namespace: danlu-efficiency
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-admission-cluster-admin
  namespace: danlu-efficiency
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    namespace: danlu-efficiency
    name: gpu-admission
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-admission-config
  namespace: danlu-efficiency
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1alpha1
    kind: KubeSchedulerConfiguration
    schedulerName: gpu-admission
    algorithmSource:
      policy:
        configMap:
          namespace: danlu-efficiency
          name: gpu-admission-policy
    leaderElection:
      leaderElect: true
      lockObjectName: gpu-admission
      lockObjectNamespace: danlu-efficiency
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-admission-policy
  namespace: danlu-efficiency
data:
  policy.cfg : |
    {
      "kind" : "Policy",
      "apiVersion" : "v1",
      "predicates" : [
        {"name" : "CheckNodeUnschedulable"},
        {"name" : "GeneralPredicates"},
        {"name" : "HostName"},
        {"name" : "PodFitsHostPorts"},
        {"name" : "MatchNodeSelector"},
        {"name" : "PodFitsResources"},
        {"name" : "NoDiskConflict"},
        {"name" : "PodToleratesNodeTaints"},
        {"name" : "MaxEBSVolumeCount"},
        {"name" : "MaxGCEPDVolumeCount"},
        {"name" : "MaxAzureDiskVolumeCount"},
        {"name" : "CheckVolumeBinding"},
        {"name" : "NoVolumeZoneConflict"},
        {"name" : "MatchInterPodAffinity"}
       ],
       "priorities" : [
         {"name" : "EqualPriority", "weight" : 1},
         {"name" : "MostRequestedPriority", "weight" : 1},
         {"name" : "RequestedToCapacityRatioPriority", "weight" : 1},
         {"name" : "SelectorSpreadPriority", "weight" : 1},
         {"name" : "ServiceSpreadingPriority", "weight" : 1},
         {"name" : "InterPodAffinityPriority", "weight" : 1},
         {"name" : "LeastRequestedPriority", "weight" : 1},
         {"name" : "BalancedResourceAllocation", "weight" : 1},
         {"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
         {"name" : "NodeAffinityPriority", "weight" : 1},
         {"name" : "TaintTolerationPriority", "weight" : 1},
         {"name" : "ImageLocalityPriority", "weight" : 1}
       ],
      "extenders" : [
       {
               "urlPrefix": "http://localhost:3456/scheduler",
          "filterVerb": "predicates",
          "enableHttps": false,
          "nodeCacheCapable": false
       }
      ],
      "hardPodAffinitySymmetricWeight" : 10
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-admission
  namespace: danlu-efficiency
  labels:
    app: gpu-admission
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-admission
  template:
    metadata:
      labels:
        app: gpu-admission
    spec:
      serviceAccountName: gpu-admission
      volumes:
        - name: gpu-admission-config
          configMap:
            name: gpu-admission-config
      containers:
        - name: gpu-admission-ctr
          image: gcr.io/google_containers/hyperkube:v1.13.4
          imagePullPolicy: IfNotPresent
          args:
            - kube-scheduler
            - --config=/gpu-admission/config.yaml
            - -v=4
          volumeMounts:
            - name: gpu-admission-config
              mountPath: /gpu-admission
        - name: gpu-admission-extender-ctr
          image: xxx:gpu-admission-v0.1
          imagePullPolicy: Always
          livenessProbe:
            httpGet:
              path: /version
              port: 3456
          readinessProbe:
            httpGet:
              path: /version
              port: 3456
          ports:
            - containerPort: 3456
      imagePullSecrets:
      - name: regcred

This is the gpu-admission-ctr container log.

I0827 09:18:30.335120       1 scheduler.go:525] Attempting to schedule pod: danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-9gttf
E0827 09:18:30.338224       1 factory.go:1519] Error scheduling danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-9gttf: pod tencent-gpu-test-app-time-cost-7fc956cd5f-9gttf had been predicated!; retrying
I0827 09:18:30.338289       1 factory.go:1613] Updating pod condition for danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-9gttf to (PodScheduled==False)
E0827 09:18:30.339964       1 scheduler.go:546] error selecting node for pod: pod tencent-gpu-test-app-time-cost-7fc956cd5f-9gttf had been predicated!
I0827 09:18:31.465811       1 reflector.go:357] k8s.io/client-go/informers/factory.go:132: Watch close - *v1.ReplicaSet total 114 items received
I0827 09:18:39.729290       1 factory.go:1392] About to try and schedule pod danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92
I0827 09:18:39.729311       1 scheduler.go:525] Attempting to schedule pod: danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92
I0827 09:18:39.736815       1 scheduler_binder.go:207] AssumePodVolumes for pod "danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92", node "ai-1080ti-25"
I0827 09:18:39.736841       1 scheduler_binder.go:217] AssumePodVolumes for pod "danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92", node "ai-1080ti-25": all PVCs bound and nothing to do
I0827 09:18:39.736901       1 factory.go:1604] Attempting to bind tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92 to ai-1080ti-25
I0827 09:18:39.736909       1 factory.go:1392] About to try and schedule pod danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92
I0827 09:18:39.736923       1 scheduler.go:525] Attempting to schedule pod: danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92
E0827 09:18:39.739886       1 factory.go:1519] Error scheduling danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92: pod tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92 had been predicated!; retrying
I0827 09:18:39.739927       1 factory.go:1613] Updating pod condition for danlu-efficiency/tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92 to (PodScheduled==False)
E0827 09:18:39.742164       1 scheduler.go:546] error selecting node for pod: pod tencent-gpu-test-app-time-cost-7fc956cd5f-7ww92 had been predicated!

This is the gpu-admission-extender-ctr container log.

W0827 08:49:06.183834       1 reflector.go:302] k8s.io/client-go@v0.0.0-20190816231410-2d3c76f9091b/tools/cache/reflector.go:98: watch of *v1.ConfigMap ended with: too old resource version: 472556199 (472557230)
W0827 09:02:18.188971       1 reflector.go:302] k8s.io/client-go@v0.0.0-20190816231410-2d3c76f9091b/tools/cache/reflector.go:98: watch of *v1.ConfigMap ended with: too old resource version: 472562446 (472562769)
W0827 09:18:30.195060       1 reflector.go:302] k8s.io/client-go@v0.0.0-20190816231410-2d3c76f9091b/tools/cache/reflector.go:98: watch of *v1.ConfigMap ended with: too old resource version: 472567868 (472569885)

I am eager to know whether this problem is related to the mixed use of NVIDIA GPU Pods, because it will affect whether we can use it in our production environment. But what puzzles me is that some mixed use scenario can run correctly. Looking forward to your reply.

@mYmNeo
Copy link
Contributor

mYmNeo commented Aug 28, 2020

The gpu-admission doesn't has the view of your nvidia.com/gpu pods, so it think the card 0 is the best fit for your pod, but gpu-manager has the real card usage, it recommend using card 4. Cluster should have only one controller to control gpu card.

@CoolDarran
Copy link

CoolDarran commented Sep 3, 2020

The gpu-admission doesn't has the view of your nvidia.com/gpu pods, so it think the card 0 is the best fit for your pod, but gpu-manager has the real card usage, it recommend using card 4. Cluster should have only one controller to control gpu card.

I don't have NVIDIA/k8s-device-plugin installed, but also got this error

@qifengz
Copy link

qifengz commented Jan 22, 2021

I met this issue too. I had add some debug logs as below:

I0122 04:41:31.480076 13774 tree.go:119] Update device information
I0122 04:41:31.486222 13774 tree.go:135] node 0, pid: [], memory: 0, utilization: 0, pendingReset: false
I0122 04:41:31.492388 13774 tree.go:135] node 1, pid: [], memory: 0, utilization: 0, pendingReset: false
I0122 04:41:31.492898 13774 allocator.go:375] Tree graph: ROOT:2
|---PHB (aval: 2, pids: [], usedMemory: 0, totalMemory: 12456230912, allocatableCores: 0, allocatableMemory: 0)
| |---GPU0 (pids: [], usedMemory: 0, totalMemory: 6233391104, allocatableCores: 100, allocatableMemory: 6233391104)
| |---GPU1 (pids: [], usedMemory: 0, totalMemory: 6222839808, allocatableCores: 100, allocatableMemory: 6222839808)
I0122 04:41:31.492918 13774 allocator.go:386] Try allocate for 15465015-367d-4f84-9610-0d220b917f99(nvidia), vcore 50, vmemory 3221225472
I0122 04:41:31.492943 13774 share.go:58] Pick up 1 mask 10, cores: 100, memory: 6222839808
I0122 04:41:31.493003 13774 allocator.go:445] devStr: /dev/nvidia0
I0122 04:41:31.493019 13774 allocator.go:447] predicateNode: GPU0
I0122 04:41:31.493043 13774 allocator.go:448] nodes[0]: GPU1
E0122 04:41:31.493056 13774 allocator.go:736] Nvidia node mismatch for pod vcuda(nvidia), pick up:/dev/nvidia1 predicate: /dev/nvidia0

I wonder why use node[0] particularly when has many cards?

if predicateNode.MinorName() != nodes[0].MinorName() {

@qifengz
Copy link

qifengz commented Jan 22, 2021

After I removed the four lines then it works normally!

if predicateNode.MinorName() != nodes[0].MinorName() {
return nil, fmt.Errorf("Nvidia node mismatch for pod %s(%s), pick up:%s predicate: %s",
pod.Name, container.Name, nodes[0].MinorName(), predicateNode.MinorName())
}

@zwpaper
Copy link

zwpaper commented Feb 20, 2021

@qifengz it should be not recommended to delete those error checking code, it looks like worked because gpu manager could actually use the nodes[0], but gpu admission would mistakenly think the pod was using predicateNode.

you can check this by nvidia-smi and pod annotations

@fighterhit
Copy link
Contributor Author

fighterhit commented Feb 23, 2021

Hi @qifengz, @zwpaper has pointed out the reason. If you read the code, you may find that gpu-manager actually sorts GPUs according to topology and the number of processes running on the GPU, but gpu-admission DOES NOT KNOW these information.

sorter = linkSort(nvidia.ByType, nvidia.ByAvailable, nvidia.ByMemory, nvidia.ByPids, nvidia.ByID)

sorter = shareModeSort(nvidia.ByAllocatableCores, nvidia.ByAllocatableMemory, nvidia.ByPids, nvidia.ByID)

  • nvidia.ByType: sort by GPU topology (ref nvidia-smi topo --matrix)
  • nvidia.ByPids: sort by the number of processes running on the GPU

https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/exclusive.go#L48-L51

https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/share.go#L47

Therefore, my approach is to delete these two algorithms so as to be consistent with the gpu-admission sorting algorithm. Although this will lose the key characteristics of gpu-manager, it can minimize the probability of conflict. Hope it helps. :)

@zxt620
Copy link

zxt620 commented Mar 3, 2021

Hi @qifengz, @zwpaper has pointed out the reason. If you read the code, you may find that gpu-manager actually sorts GPUs according to topology and the number of processes running on the GPU, but gpu-admission DOES NOT KNOW these information.

sorter = linkSort(nvidia.ByType, nvidia.ByAvailable, nvidia.ByMemory, nvidia.ByPids, nvidia.ByID)

sorter = shareModeSort(nvidia.ByAllocatableCores, nvidia.ByAllocatableMemory, nvidia.ByPids, nvidia.ByID)

  • nvidia.ByType: sort by GPU topology (ref nvidia-smi topo --matrix)
  • nvidia.ByPids: sort by the number of processes running on the GPU

https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/exclusive.go#L48-L51

https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/share.go#L47

Therefore, my approach is to delete these two algorithms so as to be consistent with the gpu-admission sorting algorithm. Although this will lose the key characteristics of gpu-manager, it can minimize the probability of conflict. Hope it helps.

after delete,does it has some other problem?

@fighterhit
Copy link
Contributor Author

Hi @qifengz, @zwpaper has pointed out the reason. If you read the code, you may find that gpu-manager actually sorts GPUs according to topology and the number of processes running on the GPU, but gpu-admission DOES NOT KNOW these information.

sorter = linkSort(nvidia.ByType, nvidia.ByAvailable, nvidia.ByMemory, nvidia.ByPids, nvidia.ByID)

sorter = shareModeSort(nvidia.ByAllocatableCores, nvidia.ByAllocatableMemory, nvidia.ByPids, nvidia.ByID)

  • nvidia.ByType: sort by GPU topology (ref nvidia-smi topo --matrix)
  • nvidia.ByPids: sort by the number of processes running on the GPU

https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/exclusive.go#L48-L51
https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/algorithm/share.go#L47
Therefore, my approach is to delete these two algorithms so as to be consistent with the gpu-admission sorting algorithm. Although this will lose the key characteristics of gpu-manager, it can minimize the probability of conflict. Hope it helps.

after delete,does it has some other problem?

Performance is not very satisfactory under my test.

@HeroBcat
Copy link
Contributor

HeroBcat commented Mar 5, 2021

@qifengz Your case is the same as mine.

|---PHB (aval: 2, pids: [], usedMemory: 0, totalMemory: 12456230912, allocatableCores: 0, allocatableMemory: 0)
| |---GPU0 (pids: [], usedMemory: 0, totalMemory: 6233391104, allocatableCores: 100, allocatableMemory: 6233391104)
| |---GPU1 (pids: [], usedMemory: 0, totalMemory: 6222839808, allocatableCores: 100, allocatableMemory: 6222839808)

If you look carefully, you will find that the memory of GPU1 (6222839808) is less than GPU0 (6233391104).
So that even if the two GPUs are not allocated, gpu-manager will pick up GPU1 and gpu-admission will predicate GPU0, which leads to mismatch.

The code that caused this issue is in

ByAllocatableMemory = func(p1, p2 *NvidiaNode) bool {
return p1.AllocatableMeta.Memory < p2.AllocatableMeta.Memory
}

#74 fixed it.

@qifengz
Copy link

qifengz commented Mar 12, 2021

@fighterhit @HeroBcat Got you, helpful!

@lynnfi
Copy link

lynnfi commented May 24, 2024

I fix the problem by delete the code,it works fine. Because the admission get the usage from k8s info, so if you don`t check this ,admission can also get the final scheduled info every 30s from k8s.
If you want to use the fixed code,you can fork from my github. (I also fixed some build Error)

https://github.com/lynnfi/gpu-manager

After I removed the four lines then it works normally!

if predicateNode.MinorName() != nodes[0].MinorName() {
return nil, fmt.Errorf("Nvidia node mismatch for pod %s(%s), pick up:%s predicate: %s",
pod.Name, container.Name, nodes[0].MinorName(), predicateNode.MinorName())
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants