Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU addon does not fully set up GPU access #272

Open
ca-scribner opened this issue Apr 11, 2024 · 2 comments
Open

GPU addon does not fully set up GPU access #272

ca-scribner opened this issue Apr 11, 2024 · 2 comments

Comments

@ca-scribner
Copy link

Summary

A few Charmed Kubeflow users have reported that, after doing microk8s enable gpu, the GPUs are not exposed to their workloads unless they also set nvidia as the default runtime in the containerd-template.toml.

I'm not sure if this is a new bug or something that has happened for a while, but it has been raised twice (1, 2) this month.

What Should Happen Instead?

For a machine that has an nvidia gpu, using microk8s enable gpu (or the newer nvidia addon, for recent microk8s versions) would fully set up the GPU for use by pods

Reproduction Steps

See this thread for more details (sorry, I'm creating this to report someone else's issue)

Introspection Report

Sorry, I'm creating this to report someone else's issue and don't have the report

Can you suggest a fix?

No

Are you interested in contributing with a fix?

No, but will cc others who might

@AnotherStranger
Copy link

Hello,
I was one of the mentioned reporters and I will try to provide some more specific reproduction steps.

OS: Fedora 39 Server

My Set-Up Procedure

  1. I installed the NVIDIA-Drivers using RPMFusion (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
  2. I also followed the How-To Guide for CUDA (https://rpmfusion.org/Howto/CUDA?highlight=%28%5CbCategoryHowto%5Cb%29)
  3. I installed the Docker Engine and snapd
  4. I followed the set-up Guide for the nvidia-container toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
  5. I set-up microk8s following the Charmed Kubeflow guide (https://charmed-kubeflow.io/docs/get-started-with-charmed-kubeflow)
  6. After some trouble with firewalld I got everything working.
  7. After I got Kubeflow running on the CPU, I wanted to expose GPUs to the cluster. I ran:
    microk8s enable gpu. However, this didn't work. The node got the nvidia labels as expected but could not access the GPU within the pods.
  8. I could get the pods running by setting the runtimeClassName: nvidia manually
  9. Because of this, I decided to set the default runtime to nvidia in /var/snap/microk8s/current/args/containerd-template.toml. With this change, everything started working as expected.

I hope this Helps.

@gustavosr98
Copy link

gustavosr98 commented Apr 12, 2024

@ca-scribner in case the of the ref 2 is not related to the GPU add-on

I am manually installing the toolkit, rather than having Microk8s install it
Because of some specific considerations of running this on Ubuntu Core
Microk8s team is mostly aware of this

Now, since I am also installing Kubeflow on top of this
I need the default runtimeClass for Containerd to be nvidia

And since, I am manually installing the toolkit I am manually editing a file that I am not sure how persistent this change will be across microk8s updates

sudo vi /var/snap/microk8s/current/args/containerd-template.toml
[..]
    # default_runtime_name = "${RUNTIME}"
    default_runtime_name = "nvidia"

sudo systemctl restart snap.microk8s.daemon-containerd.service

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants