MPS Functionality Not Working Correctly in k8s-device-plugin Versions v0.15 to v0.17 #1094

haitwang-cloud · 2024-12-10T08:25:42Z

We are experiencing issues with the MPS (Multi-Process Service) functionality in k8s-device-plugin versions v0.15 to v0.17, as it is not functioning properly. Could you please identify the breaking part?

Related issues include:

Additionally, I am referring to this document for MPS: https://docs.google.com/document/d/1H-ddA11laPQf_1olwXRjEDbzNihxprjPr74pZ4Vdf2M/edit?tab=t.0

Symptoms:

When attempting to use the image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04 for testing, the mps-control-daemon encounters a fatal exception and shuts down repeatedly.
Upon checking the GPU status using nvidia-smi, we do not see any nvidia-cuda-mps-server processes running, even though other CUDA applications are utilizing the GPUs.

Logs:

device-plugin-mps-control-daemon Logs:

[2024-12-10 06:38:01.120 Control    64] Accepting connection...
[2024-12-10 06:38:01.120 Control    64] User did not send valid credentials
[2024-12-10 06:38:01.120 Control    64] Accepting connection...
[2024-12-10 06:38:01.120 Control    64] NEW CLIENT 0 from user 1001: Server is not ready, push client to pending list
[2024-12-10 06:38:01.120 Control    64] Starting new server 92 for user 1001
[2024-12-10 06:38:01.127 Control    64] Accepting connection...
[2024-12-10 06:38:01.158 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.170 Control    64] Server 92 exited with status 1
[2024-12-10 06:38:01.170 Control    64] Starting new server 95 for user 1001
[2024-12-10 06:38:01.177 Control    64] Accepting connection...
[2024-12-10 06:38:01.210 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.222 Control    64] Server 95 exited with status 1
[2024-12-10 06:38:01.223 Control    64] Starting new server 98 for user 1001
[2024-12-10 06:38:01.229 Control    64] Accepting connection...
[2024-12-10 06:38:01.260 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.272 Control    64] Server 98 exited with status 1
[2024-12-10 06:38:01.272 Control    64] Starting new server 101 for user 1001
[2024-12-10 06:38:01.279 Control    64] Accepting connection...
[2024-12-10 06:38:01.311 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.323 Control    64] Server 101 exited with status 1
[2024-12-10 06:38:01.324 Control    64] Starting new server 104 for user 1001
[2024-12-10 06:38:01.330 Control    64] Accepting connection...
[2024-12-10 06:38:01.361 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.373 Control    64] Server 104 exited with status 1
[2024-12-10 06:38:01.373 Control    64] Starting new server 107 for user 1001
[2024-12-10 06:38:01.377 Control    64] Accepting connection...
[2024-12-10 06:38:01.406 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.418 Control    64] Server 107 exited with status 1
[2024-12-10 06:38:01.418 Control    64] Removed Shm file at 
[2024-12-10 06:38:15.078 Control    64] Accepting connection...
[2024-12-10 06:38:15.078 Control    64] NEW UI
[2024-12-10 06:38:15.078 Control    64] Cmd:get_default_active_thread_percentage
[2024-12-10 06:38:15.078 Control    64] 25.0
[2024-12-10 06:38:15.078 Control    64] UI closed

nvidia-smi Output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000000:3B:00.0 Off |                    0 |
| N/A   33C    P0              38W / 250W |   4574MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           Off | 00000000:AF:00.0 Off |                    0 |
| N/A   33C    P0              39W / 250W |   3290MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE-16GB           Off | 00000000:D8:00.0 Off |                    0 |
| N/A   65C    P0             248W / 250W |   5624MiB / 16384MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3623482      C   /usr/local/bin/python3.11                  4570MiB |
|    1   N/A  N/A   3623489      C   /usr/local/bin/python3.11                  3286MiB |
|    2   N/A  N/A   3623708      C   /usr/local/bin/python3.11                  3286MiB |
|    2   N/A  N/A   3628468      C   python                                     2334MiB |
+---------------------------------------------------------------------------------------+

Expected Behavior:

The mps-control-daemon should start and manage the nvidia-cuda-mps-server processes without encountering fatal exceptions.
We expect to see nvidia-cuda-mps-server processes running when using the MPS-enabled images.

Additional Information:

Kubernetes Version: v1.29.9
OS Image: Garden Linux 1443.10
Kernel Version: 6.6.41-amd64
Container Runtime: containerd://1.6.24
GPU Operator Version: v24.9
k8s-device-plugin Version: v0.17.0
CUDA Version: 12.2
Driver Version: 535.86.10

The text was updated successfully, but these errors were encountered:

chipzoller · 2024-12-18T13:59:10Z

I do not see this issue when I test the same version of the device plugin (deployed via the GPU operator) on an EKS node with a T4 and driver version 550.127.08. I also used a more recent tag for the image nvcr.io/nvidia/k8s/cuda-sample which was vectoradd-cuda12.5.0-ubuntu22.04. The MPS daemon runs, accepts a client, and continues to run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPS Functionality Not Working Correctly in k8s-device-plugin Versions v0.15 to v0.17 #1094

MPS Functionality Not Working Correctly in k8s-device-plugin Versions v0.15 to v0.17 #1094

haitwang-cloud commented Dec 10, 2024 •

edited

Loading

chipzoller commented Dec 18, 2024

MPS Functionality Not Working Correctly in k8s-device-plugin Versions v0.15 to v0.17 #1094

MPS Functionality Not Working Correctly in k8s-device-plugin Versions v0.15 to v0.17 #1094

Comments

haitwang-cloud commented Dec 10, 2024 • edited Loading

Symptoms:

Logs:

Expected Behavior:

Additional Information:

chipzoller commented Dec 18, 2024

haitwang-cloud commented Dec 10, 2024 •

edited

Loading