[shim] Change NVIDIA GPU detection method #1945

un-def · 2024-11-01T15:10:18Z

Check for /dev/nvidiactl, not nvidia-smi binary, to detect NVIDIA GPU. Unlike the binary, the devfs file only exists if some conditions are met (the exact way how /dev/ndivia* character device files are created is complicated and setup-specific — involving some of: kernel module, udev, modprobe, nvidia-persistenced, X server, and more — but in general, it should be safe to assume that if NVIDIA GPU is available, then /dev/nvidiactl does exist.
Run nvidia-smi to get GPU info directly on the host, not inside a container. Using Docker is completely unnecessary, as NVIDIA Container Toolkit mounts libs and executables from the host — dstack-provided Docker image doesn't even contain nvidia-smi binary, it's always a bind-mounted file from the host.

Fixes: #1942

* Check for `/dev/nvidiactl`, not `nvidia-smi` binary, to detect NVIDIA GPU. Unlike the binary, the devfs file only exists if some conditions are met (the exact way how `/dev/ndivia*` character device files are created is complicated and setup-specific — involving some of: kernel module, udev, modprobe, nvidia-persistenced, X server, and more — but in general, it should be safe to assume that if NVIDIA GPU is available, then `/dev/nvidiactl` does exist. * Run `nvidia-smi` to get GPU info directly on the host, not inside a container. Using Docker is completely unnecessary, as NVIDIA Container Toolkit mounts libs and executables from the host — dstack-provided Docker image doesn't even contain `nvidia-smi` binary, it's always a bind-mounted file from the host. Fixes: #1942

jvstme · 2024-11-01T22:03:47Z

runner/internal/shim/gpu.go

-		Command: "docker",
-		Args: []string{
-			"run",
-			"--rm",
-			"--gpus", "all",
-			nvidiaSmiImage,
-			"nvidia-smi", "--query-gpu=gpu_name,memory.total", "--format=csv,nounits",
-		},


Running nvidia-smi in Docker actually has a small advantage, since it also validates that the NVIDIA container runtime is installed and working. However, it's not reporting errors to the server, so the erroneous instance is still added to the fleet, just without any GPUs in resources. Considering this validation doesn't work properly, I think it's OK to drop it for now the way it's done in this PR and later implement it properly in #1947.

Right, didn't think about it in this context. The motivation to get rid of Docker here was to not rely on some third-party (from the point of view of the self-hosted dstack administrator) binary blob (dstack-provided Docker image) — yes, we have these images pre-pulled in the dstack-provided OS images, but if I use SSH fleets or a cloud provider with my own OS images, this check may slow down provisioning a bit (the time to pull our Docker image), and left the image that I have no intention to use cached on the instance.

I think that it would be enough to check Client.Info to ensure that 1) Docker is installed and running, and 2) NVIDIA Container Toolkit runtime is installed.

We still rely on un1def/amd-smi image as a temporary solution to get the info about AMD GPU, but I think we can either make amd-smi a hard requirement (include it in dstack-provided OS images and require it to be present on AMD-powered user-provided OS images and SSH instances) or bring our statically-linked amd-smi binary (not sure if it as easy to do as to say).

un-def requested a review from jvstme November 1, 2024 15:19

jvstme approved these changes Nov 1, 2024

View reviewed changes

un-def merged commit 1a3dca6 into master Nov 4, 2024
23 checks passed

un-def deleted the issue_1942_shim_change_nvidia_gpu_detection_method branch November 4, 2024 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[shim] Change NVIDIA GPU detection method #1945

[shim] Change NVIDIA GPU detection method #1945

un-def commented Nov 1, 2024

jvstme Nov 1, 2024

un-def Nov 2, 2024

[shim] Change NVIDIA GPU detection method #1945

[shim] Change NVIDIA GPU detection method #1945

Conversation

un-def commented Nov 1, 2024

jvstme Nov 1, 2024

Choose a reason for hiding this comment

un-def Nov 2, 2024

Choose a reason for hiding this comment