Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the nvidia runtime cdi #11065

Merged
merged 1 commit into from
Oct 11, 2024
Merged

Conversation

manuelbuil
Copy link
Contributor

@manuelbuil manuelbuil commented Oct 10, 2024

Proposed Changes

Add the nvidia runtime cdi, so that pods can use the GPU with the full functionality

Types of Changes

Bugfix

Verification

In an env with a GPU and the OS drivers correctly installed, deploy the gpu operator:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
 name: gpu-operator
 namespace: kube-system
spec:
 repo: https://helm.ngc.nvidia.com/nvidia
 chart: gpu-operator
 targetNamespace: gpu-operator
 createNamespace: true
 valuesContent: |-
   toolkit:
     env:
     - name: CONTAINERD_SOCKET
       value: /run/k3s/containerd/containerd.sock

And after some minutes, check that /var/lib/rancher/rke2/agent/etc/containerd/config.toml includes the nvidia runtimes at the bottom (both nvidia and nvidia-cdi).

Then create a pod including:

      runtimeClassName: nvidia
      containers:
      - image: xxxxxxx
        name: xxxxxxx
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all

And execute the command mount, you should see /dev/nvidia... devices being mounted and /usr/lib/libnvidia libraries being mounted

Testing

Linked Issues

#11087

User-Facing Change

Add nvidia cdi runtime to the list of supported and discoverable runtimes

Further Comments

Signed-off-by: manuelbuil <mbuil@suse.com>
@manuelbuil manuelbuil requested a review from a team as a code owner October 10, 2024 17:52
Copy link
Member

@brandond brandond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add the corresponding RuntimeClassName definition to the list at

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia-experimental
handler: nvidia-experimental

Your current example is using the legacy nvidia runtime, not the new nvidia-cdi runtime that you're adding; did you want to use runtimeClassName: nvidia-cdi? Or is the nvidia operator modifying our nvidia RuntimeClass definition so that it uses the nvidia-cdi handler?

Copy link

codecov bot commented Oct 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 44.23%. Comparing base (7552203) to head (e767089).
Report is 4 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (7552203) and HEAD (e767089). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (7552203) HEAD (e767089)
e2etests 7 6
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11065      +/-   ##
==========================================
- Coverage   49.93%   44.23%   -5.71%     
==========================================
  Files         178      178              
  Lines       14816    14820       +4     
==========================================
- Hits         7399     6555     -844     
- Misses       6069     7056     +987     
+ Partials     1348     1209     -139     
Flag Coverage Δ
e2etests 36.44% <100.00%> (-9.61%) ⬇️
inttests 36.83% <100.00%> (-0.04%) ⬇️
unittests 13.57% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@manuelbuil
Copy link
Contributor Author

You should add the corresponding RuntimeClassName definition to the list at

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia-experimental
handler: nvidia-experimental

Your current example is using the legacy nvidia runtime, not the new nvidia-cdi runtime that you're adding; did you want to use runtimeClassName: nvidia-cdi? Or is the nvidia operator modifying our nvidia RuntimeClass definition so that it uses the nvidia-cdi handler?

It's confusing and I don't have an good answer but this is what I see:
Things work (I see nvidia drivers and libs in the pod) with runtimeClassName: nvidia as long as:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia-cdi"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia-cdi".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"
  SystemdCgroup = true

is present in /var/lib/rancher/rke2/agent/etc/containerd/config.toml. I see the handler still being nvidia:

NAME            HANDLER         AGE
nvidia          nvidia          9m23s

so, I don't think the operator is touching it.

If I define a new runtimeclass with nvidia-cdi, things work only if I install the gpu operator with:

   cdi:
     enabled: true

and I'm not sure how it works because containerd by default has cdi false, so I wonder if things work at all inside the container.

My guess is that the default nvidia runtime must be applying some workarounds for use cases when cdi is not enabled and to apply such workarounds, it requires the nvidia.cdi runtime to be present in the containerd config.

Since this PR by itself fixes the problem that the nvidia user reported, I'd merge the PR as it is. In the next days with more time, I can try to go deep and get a better understanding of what's going on (or ideally try to get help from nvidia)

@manuelbuil manuelbuil requested a review from brandond October 11, 2024 13:50
@brandond
Copy link
Member

Ok... If it is handled internally by the nvidia driver and we don't need to be able to reference it by runtimeClassName in pods then I guess we can skip that for now. That is very unusual though. I'll have to do some more research on how this is working under the hood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants