Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPS Functionality Not Working Correctly in k8s-device-plugin Versions v0.15 to v0.17 #1094

Open
haitwang-cloud opened this issue Dec 10, 2024 · 1 comment

Comments

@haitwang-cloud
Copy link

haitwang-cloud commented Dec 10, 2024

We are experiencing issues with the MPS (Multi-Process Service) functionality in k8s-device-plugin versions v0.15 to v0.17, as it is not functioning properly. Could you please identify the breaking part?

Related issues include:

Additionally, I am referring to this document for MPS: https://docs.google.com/document/d/1H-ddA11laPQf_1olwXRjEDbzNihxprjPr74pZ4Vdf2M/edit?tab=t.0

Symptoms:

  • When attempting to use the image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04 for testing, the mps-control-daemon encounters a fatal exception and shuts down repeatedly.
  • Upon checking the GPU status using nvidia-smi, we do not see any nvidia-cuda-mps-server processes running, even though other CUDA applications are utilizing the GPUs.

Logs:

device-plugin-mps-control-daemon Logs:

[2024-12-10 06:38:01.120 Control    64] Accepting connection...
[2024-12-10 06:38:01.120 Control    64] User did not send valid credentials
[2024-12-10 06:38:01.120 Control    64] Accepting connection...
[2024-12-10 06:38:01.120 Control    64] NEW CLIENT 0 from user 1001: Server is not ready, push client to pending list
[2024-12-10 06:38:01.120 Control    64] Starting new server 92 for user 1001
[2024-12-10 06:38:01.127 Control    64] Accepting connection...
[2024-12-10 06:38:01.158 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.170 Control    64] Server 92 exited with status 1
[2024-12-10 06:38:01.170 Control    64] Starting new server 95 for user 1001
[2024-12-10 06:38:01.177 Control    64] Accepting connection...
[2024-12-10 06:38:01.210 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.222 Control    64] Server 95 exited with status 1
[2024-12-10 06:38:01.223 Control    64] Starting new server 98 for user 1001
[2024-12-10 06:38:01.229 Control    64] Accepting connection...
[2024-12-10 06:38:01.260 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.272 Control    64] Server 98 exited with status 1
[2024-12-10 06:38:01.272 Control    64] Starting new server 101 for user 1001
[2024-12-10 06:38:01.279 Control    64] Accepting connection...
[2024-12-10 06:38:01.311 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.323 Control    64] Server 101 exited with status 1
[2024-12-10 06:38:01.324 Control    64] Starting new server 104 for user 1001
[2024-12-10 06:38:01.330 Control    64] Accepting connection...
[2024-12-10 06:38:01.361 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.373 Control    64] Server 104 exited with status 1
[2024-12-10 06:38:01.373 Control    64] Starting new server 107 for user 1001
[2024-12-10 06:38:01.377 Control    64] Accepting connection...
[2024-12-10 06:38:01.406 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.418 Control    64] Server 107 exited with status 1
[2024-12-10 06:38:01.418 Control    64] Removed Shm file at 
[2024-12-10 06:38:15.078 Control    64] Accepting connection...
[2024-12-10 06:38:15.078 Control    64] NEW UI
[2024-12-10 06:38:15.078 Control    64] Cmd:get_default_active_thread_percentage
[2024-12-10 06:38:15.078 Control    64] 25.0
[2024-12-10 06:38:15.078 Control    64] UI closed

nvidia-smi Output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000000:3B:00.0 Off |                    0 |
| N/A   33C    P0              38W / 250W |   4574MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           Off | 00000000:AF:00.0 Off |                    0 |
| N/A   33C    P0              39W / 250W |   3290MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE-16GB           Off | 00000000:D8:00.0 Off |                    0 |
| N/A   65C    P0             248W / 250W |   5624MiB / 16384MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3623482      C   /usr/local/bin/python3.11                  4570MiB |
|    1   N/A  N/A   3623489      C   /usr/local/bin/python3.11                  3286MiB |
|    2   N/A  N/A   3623708      C   /usr/local/bin/python3.11                  3286MiB |
|    2   N/A  N/A   3628468      C   python                                     2334MiB |
+---------------------------------------------------------------------------------------+

Expected Behavior:

  • The mps-control-daemon should start and manage the nvidia-cuda-mps-server processes without encountering fatal exceptions.
  • We expect to see nvidia-cuda-mps-server processes running when using the MPS-enabled images.

Additional Information:

  • Kubernetes Version: v1.29.9
  • OS Image: Garden Linux 1443.10
  • Kernel Version: 6.6.41-amd64
  • Container Runtime: containerd://1.6.24
  • GPU Operator Version: v24.9
  • k8s-device-plugin Version: v0.17.0
  • CUDA Version: 12.2
  • Driver Version: 535.86.10
@chipzoller
Copy link
Contributor

I do not see this issue when I test the same version of the device plugin (deployed via the GPU operator) on an EKS node with a T4 and driver version 550.127.08. I also used a more recent tag for the image nvcr.io/nvidia/k8s/cuda-sample which was vectoradd-cuda12.5.0-ubuntu22.04. The MPS daemon runs, accepts a client, and continues to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants