-
Notifications
You must be signed in to change notification settings - Fork 522
feat: don't install nvidia drivers if nvidia-device-plugin is disabled #4358
feat: don't install nvidia drivers if nvidia-device-plugin is disabled #4358
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4358 +/- ##
==========================================
+ Coverage 72.07% 72.09% +0.02%
==========================================
Files 141 141
Lines 21640 21665 +25
==========================================
+ Hits 15596 15619 +23
- Misses 5093 5094 +1
- Partials 951 952 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis, mboersma The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be some documentation about the fact that the GPU Driver is not installed if in containerd and that it is up to the customer to install the nVidia GPU Operator.
Reason for Change:
This PR makes a change so that nvidia drivers are not installed if the nvidia-device-plugin addon is disabled.
Additionally, because the known-working drivers that CSE installs don't work with containerd, we don't change the addons default flow so that the nvidia-device-plugin addon is disabled if containerd is the CRI.
The practical outcome of this is that you may use containerd with N series VM SKUs, and no nvidia drivers will be installed. The nvidia gpu-operator implementation solves this, see:
https://developer.nvidia.com/blog/announcing-containerd-support-for-the-nvidia-gpu-operator/
A new E2E scenario has been added for N series + containerd configurations, which installs the nvidia-curated gpu-operator helm chart, and then validates using the existing CUDA job.
Issue Fixed:
Credit Where Due:
Does this change contain code from or inspired by another project?
If "Yes," did you notify that project's maintainers and provide attribution?
Requirements:
Notes: