GPU addon does not fully set up GPU access #272

ca-scribner · 2024-04-11T16:05:37Z

Summary

A few Charmed Kubeflow users have reported that, after doing microk8s enable gpu, the GPUs are not exposed to their workloads unless they also set nvidia as the default runtime in the containerd-template.toml.

I'm not sure if this is a new bug or something that has happened for a while, but it has been raised twice (1, 2) this month.

What Should Happen Instead?

For a machine that has an nvidia gpu, using microk8s enable gpu (or the newer nvidia addon, for recent microk8s versions) would fully set up the GPU for use by pods

Reproduction Steps

See this thread for more details (sorry, I'm creating this to report someone else's issue)

Introspection Report

Sorry, I'm creating this to report someone else's issue and don't have the report

Can you suggest a fix?

No

Are you interested in contributing with a fix?

No, but will cc others who might

The text was updated successfully, but these errors were encountered:

AnotherStranger · 2024-04-12T05:01:30Z

Hello,
I was one of the mentioned reporters and I will try to provide some more specific reproduction steps.

OS: Fedora 39 Server

My Set-Up Procedure

I installed the NVIDIA-Drivers using RPMFusion (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
I also followed the How-To Guide for CUDA (https://rpmfusion.org/Howto/CUDA?highlight=%28%5CbCategoryHowto%5Cb%29)
I installed the Docker Engine and snapd
I followed the set-up Guide for the nvidia-container toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
I set-up microk8s following the Charmed Kubeflow guide (https://charmed-kubeflow.io/docs/get-started-with-charmed-kubeflow)
After some trouble with firewalld I got everything working.
After I got Kubeflow running on the CPU, I wanted to expose GPUs to the cluster. I ran:
microk8s enable gpu. However, this didn't work. The node got the nvidia labels as expected but could not access the GPU within the pods.
I could get the pods running by setting the runtimeClassName: nvidia manually
Because of this, I decided to set the default runtime to nvidia in /var/snap/microk8s/current/args/containerd-template.toml. With this change, everything started working as expected.

I hope this Helps.

gustavosr98 · 2024-04-12T22:57:52Z

@ca-scribner in case the of the ref 2 is not related to the GPU add-on

I am manually installing the toolkit, rather than having Microk8s install it
Because of some specific considerations of running this on Ubuntu Core
Microk8s team is mostly aware of this

Now, since I am also installing Kubeflow on top of this
I need the default runtimeClass for Containerd to be nvidia

And since, I am manually installing the toolkit I am manually editing a file that I am not sure how persistent this change will be across microk8s updates

sudo vi /var/snap/microk8s/current/args/containerd-template.toml
[..]
    # default_runtime_name = "${RUNTIME}"
    default_runtime_name = "nvidia"

sudo systemctl restart snap.microk8s.daemon-containerd.service

berkayoz mentioned this issue Apr 12, 2024

Fix the --set-as-default-runtime deprecation warning #270

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU addon does not fully set up GPU access #272

GPU addon does not fully set up GPU access #272

ca-scribner commented Apr 11, 2024

AnotherStranger commented Apr 12, 2024

gustavosr98 commented Apr 12, 2024 •

edited

Loading

GPU addon does not fully set up GPU access #272

GPU addon does not fully set up GPU access #272

Comments

ca-scribner commented Apr 11, 2024

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

AnotherStranger commented Apr 12, 2024

My Set-Up Procedure

gustavosr98 commented Apr 12, 2024 • edited Loading

gustavosr98 commented Apr 12, 2024 •

edited

Loading