You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are experiencing issues with the MPS (Multi-Process Service) functionality in k8s-device-plugin versions v0.15 to v0.17, as it is not functioning properly. Could you please identify the breaking part?
When attempting to use the image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04 for testing, the mps-control-daemon encounters a fatal exception and shuts down repeatedly.
Upon checking the GPU status using nvidia-smi, we do not see any nvidia-cuda-mps-server processes running, even though other CUDA applications are utilizing the GPUs.
Logs:
device-plugin-mps-control-daemon Logs:
[2024-12-10 06:38:01.120 Control 64] Accepting connection...
[2024-12-10 06:38:01.120 Control 64] User did not send valid credentials
[2024-12-10 06:38:01.120 Control 64] Accepting connection...
[2024-12-10 06:38:01.120 Control 64] NEW CLIENT 0 from user 1001: Server is not ready, push client to pending list
[2024-12-10 06:38:01.120 Control 64] Starting new server 92 for user 1001
[2024-12-10 06:38:01.127 Control 64] Accepting connection...
[2024-12-10 06:38:01.158 Control 64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.170 Control 64] Server 92 exited with status 1
[2024-12-10 06:38:01.170 Control 64] Starting new server 95 for user 1001
[2024-12-10 06:38:01.177 Control 64] Accepting connection...
[2024-12-10 06:38:01.210 Control 64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.222 Control 64] Server 95 exited with status 1
[2024-12-10 06:38:01.223 Control 64] Starting new server 98 for user 1001
[2024-12-10 06:38:01.229 Control 64] Accepting connection...
[2024-12-10 06:38:01.260 Control 64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.272 Control 64] Server 98 exited with status 1
[2024-12-10 06:38:01.272 Control 64] Starting new server 101 for user 1001
[2024-12-10 06:38:01.279 Control 64] Accepting connection...
[2024-12-10 06:38:01.311 Control 64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.323 Control 64] Server 101 exited with status 1
[2024-12-10 06:38:01.324 Control 64] Starting new server 104 for user 1001
[2024-12-10 06:38:01.330 Control 64] Accepting connection...
[2024-12-10 06:38:01.361 Control 64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.373 Control 64] Server 104 exited with status 1
[2024-12-10 06:38:01.373 Control 64] Starting new server 107 for user 1001
[2024-12-10 06:38:01.377 Control 64] Accepting connection...
[2024-12-10 06:38:01.406 Control 64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.418 Control 64] Server 107 exited with status 1
[2024-12-10 06:38:01.418 Control 64] Removed Shm file at
[2024-12-10 06:38:15.078 Control 64] Accepting connection...
[2024-12-10 06:38:15.078 Control 64] NEW UI
[2024-12-10 06:38:15.078 Control 64] Cmd:get_default_active_thread_percentage
[2024-12-10 06:38:15.078 Control 64] 25.0
[2024-12-10 06:38:15.078 Control 64] UI closed
nvidia-smi Output:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-16GB Off | 00000000:3B:00.0 Off | 0 |
| N/A 33C P0 38W / 250W | 4574MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE-16GB Off | 00000000:AF:00.0 Off | 0 |
| N/A 33C P0 39W / 250W | 3290MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE-16GB Off | 00000000:D8:00.0 Off | 0 |
| N/A 65C P0 248W / 250W | 5624MiB / 16384MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3623482 C /usr/local/bin/python3.11 4570MiB |
| 1 N/A N/A 3623489 C /usr/local/bin/python3.11 3286MiB |
| 2 N/A N/A 3623708 C /usr/local/bin/python3.11 3286MiB |
| 2 N/A N/A 3628468 C python 2334MiB |
+---------------------------------------------------------------------------------------+
Expected Behavior:
The mps-control-daemon should start and manage the nvidia-cuda-mps-server processes without encountering fatal exceptions.
We expect to see nvidia-cuda-mps-server processes running when using the MPS-enabled images.
Additional Information:
Kubernetes Version: v1.29.9
OS Image: Garden Linux 1443.10
Kernel Version: 6.6.41-amd64
Container Runtime: containerd://1.6.24
GPU Operator Version: v24.9
k8s-device-plugin Version: v0.17.0
CUDA Version: 12.2
Driver Version: 535.86.10
The text was updated successfully, but these errors were encountered:
I do not see this issue when I test the same version of the device plugin (deployed via the GPU operator) on an EKS node with a T4 and driver version 550.127.08. I also used a more recent tag for the image nvcr.io/nvidia/k8s/cuda-sample which was vectoradd-cuda12.5.0-ubuntu22.04. The MPS daemon runs, accepts a client, and continues to run.
We are experiencing issues with the MPS (Multi-Process Service) functionality in k8s-device-plugin versions v0.15 to v0.17, as it is not functioning properly. Could you please identify the breaking part?
Related issues include:
Additionally, I am referring to this document for MPS: https://docs.google.com/document/d/1H-ddA11laPQf_1olwXRjEDbzNihxprjPr74pZ4Vdf2M/edit?tab=t.0
Symptoms:
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
for testing, themps-control-daemon
encounters a fatal exception and shuts down repeatedly.nvidia-smi
, we do not see anynvidia-cuda-mps-server
processes running, even though other CUDA applications are utilizing the GPUs.Logs:
device-plugin-mps-control-daemon Logs:
nvidia-smi Output:
Expected Behavior:
mps-control-daemon
should start and manage thenvidia-cuda-mps-server
processes without encountering fatal exceptions.nvidia-cuda-mps-server
processes running when using the MPS-enabled images.Additional Information:
The text was updated successfully, but these errors were encountered: