Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[shim] Change NVIDIA GPU detection method #1945

Merged
merged 1 commit into from
Nov 4, 2024

Conversation

un-def
Copy link
Collaborator

@un-def un-def commented Nov 1, 2024

  • Check for /dev/nvidiactl, not nvidia-smi binary, to detect NVIDIA GPU. Unlike the binary, the devfs file only exists if some conditions are met (the exact way how /dev/ndivia* character device files are created is complicated and setup-specific — involving some of: kernel module, udev, modprobe, nvidia-persistenced, X server, and more — but in general, it should be safe to assume that if NVIDIA GPU is available, then /dev/nvidiactl does exist.
  • Run nvidia-smi to get GPU info directly on the host, not inside a container. Using Docker is completely unnecessary, as NVIDIA Container Toolkit mounts libs and executables from the host — dstack-provided Docker image doesn't even contain nvidia-smi binary, it's always a bind-mounted file from the host.

Fixes: #1942

* Check for `/dev/nvidiactl`, not `nvidia-smi` binary, to detect NVIDIA
  GPU. Unlike the binary, the devfs file only exists if some conditions
  are met (the exact way how `/dev/ndivia*` character device files are
  created is complicated and setup-specific — involving some of: kernel
  module, udev, modprobe, nvidia-persistenced, X server, and more — but
  in general, it should be safe to assume that if NVIDIA GPU
  is available, then `/dev/nvidiactl` does exist.
* Run `nvidia-smi` to get GPU info directly on the host, not inside a
  container. Using Docker is completely unnecessary, as NVIDIA Container
  Toolkit mounts libs and executables from the host — dstack-provided
  Docker image doesn't even contain `nvidia-smi` binary, it's always
  a bind-mounted file from the host.

Fixes: #1942
@un-def un-def requested a review from jvstme November 1, 2024 15:19
Comment on lines -59 to -66
Command: "docker",
Args: []string{
"run",
"--rm",
"--gpus", "all",
nvidiaSmiImage,
"nvidia-smi", "--query-gpu=gpu_name,memory.total", "--format=csv,nounits",
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running nvidia-smi in Docker actually has a small advantage, since it also validates that the NVIDIA container runtime is installed and working. However, it's not reporting errors to the server, so the erroneous instance is still added to the fleet, just without any GPUs in resources. Considering this validation doesn't work properly, I think it's OK to drop it for now the way it's done in this PR and later implement it properly in #1947.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, didn't think about it in this context. The motivation to get rid of Docker here was to not rely on some third-party (from the point of view of the self-hosted dstack administrator) binary blob (dstack-provided Docker image) — yes, we have these images pre-pulled in the dstack-provided OS images, but if I use SSH fleets or a cloud provider with my own OS images, this check may slow down provisioning a bit (the time to pull our Docker image), and left the image that I have no intention to use cached on the instance.

I think that it would be enough to check Client.Info to ensure that 1) Docker is installed and running, and 2) NVIDIA Container Toolkit runtime is installed.

We still rely on un1def/amd-smi image as a temporary solution to get the info about AMD GPU, but I think we can either make amd-smi a hard requirement (include it in dstack-provided OS images and require it to be present on AMD-powered user-provided OS images and SSH instances) or bring our statically-linked amd-smi binary (not sure if it as easy to do as to say).

@un-def un-def merged commit 1a3dca6 into master Nov 4, 2024
23 checks passed
@un-def un-def deleted the issue_1942_shim_change_nvidia_gpu_detection_method branch November 4, 2024 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Not possible to use OS image with NVIDIA drivers on CPU instances
2 participants