feat: Enable ability to build GPU drives during image build #1147

drew-viles · 2023-04-26T21:15:23Z

What this PR does / why we need it:

This has been raised based on the chat in the recent office hours.

NVIDIA GPU drivers can be compiled by the GPU operator when it is deployed without much fuss however this adds time to the deployment of any new nodes - a pretty significant amount of time.

By (optionally) baking it into the image, the GPU operator can be deployed much fast by skipping the need for driver compilation.

Because of the changes to the NVIIDA license structure noted here, there is now the requirements for a .tok file to be used for licensing. This means we can't just have a basic endpoint to pull the .run or .tok from as these contain sensitive licensing information. As a result, this new addition presumes that the user will have an S3 endpoint to pull this information from. Since there are multiple paid and free solutions to this, it didn't seem unreasonable to take this approach.

A new S3 additional component has been added and the NVIDIA module makes use of this to acquire the required resources.

There is no book on the NVIDIA role as I wasn't sure where it would be wanted as it's not a provider, but I have updated the additional components section to include the S3 addition - maybe a community_roles section could be introduced where things like this could be stored?

Additional context

If I've missed anything with regards to the inclusion of the s3 feature in load_additional_modules, please let me know and I'll get that corrected.

This has only been tested on Ubuntu as of this PR.

k8s-ci-robot · 2023-04-26T21:15:31Z

Hi @drew-viles. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

drew-viles · 2023-04-26T21:17:26Z

FYI @mboersma @AverageMarcus @jsturtevant

AverageMarcus

LGTM so far but it would be nice to have some sort of documentation around the nvidia vars and how to include it in your build.

But I'm not sure where that should go exactly. 🤔

drew-viles · 2023-04-27T07:25:44Z

Couldn't agree more. I don't want to dump it in with nothing to explain how it would be used. Let's see what others may have to say but I think a new place to put things like this and the security bits I mentioned on the call would be a good direction to go in. A sort of "optional addons" section.

AverageMarcus · 2023-04-27T07:37:33Z

I wonder if a README.md in the images/capi/ansible/roles/nvidia/ directory would be enough for now.

My knowledge of Ansible isn't all that great and I've just realised I don't actually know if all other roles we have listed are included or they require some changes by the user to include them (like this one will I think?)

drew-viles · 2023-04-27T08:08:41Z

Valid point. I can always start with that and then if it needs some adjustment we can move it into the book. I'm just wondering how many people will go hunting into the roles directory or would head straight for the book.

I'll get a readme in for now anyway, at least that's something!

drew-viles · 2023-04-27T10:55:39Z

I'm just going to draft this for now as we've just noticed some of our latest images are not actually working with GPUs even though the image build including the NVIDIA role is completing without any errors. Once I've validated it's not this new role, I'll remove the draft status.

drew-viles · 2023-04-27T22:07:14Z

ok got to the root of the problem. So originally the module was being included as part of node_custom_roles_post.

The problem we were having was the kernel had been updated via the dist-upgrade that runs during the setup role.
Because the driver was installed after the install of the new kernel, the driver was only being compiled for the previous kernel meaning that upon spinning up the image, the new kernel was active but the driver wasn't configured to work with it.
It purely came down to the DKMS hook not running when the new kernel was installed because it didn't know about the driver. Makes sense.

Anyway, as a result there has been a light adjustment to move it into the node_custom_roles_pre section now.

Glad it came up now before a merge :-D

It's the first time we'd hit this as the latest kernel was released around 17th/18th April and we'd not built any new images yet until today.

AverageMarcus

/lgtm

AverageMarcus · 2023-04-28T05:59:11Z

/ok-to-test

drew-viles · 2023-06-02T12:59:20Z

/test pull-azure-sigs

drew-viles · 2023-06-02T16:53:02Z

/test pull-azure-sigs

drew-viles · 2023-06-02T20:41:05Z

/test pull-azure-sigs

AverageMarcus · 2023-06-03T11:12:54Z

/lgtm
/approve

Thanks for the persistence in getting this done @drew-viles! 😁 Lets get it merged in!

k8s-ci-robot · 2023-06-03T11:13:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AverageMarcus, drew-viles

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~docs/OWNERS~~ [AverageMarcus]
~~images/capi/OWNERS~~ [AverageMarcus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AverageMarcus · 2023-06-03T11:19:41Z

What is going on with these tests?!

/retest-required
/hold for tests passing

drew-viles · 2023-06-03T12:40:35Z

What is going on with these tests?!

/retest-required /hold for tests passing

No idea. They seem a bit flaky tbh but I'm happy to refire as needed.

drew-viles · 2023-06-03T12:41:25Z

/test pull-packer-validate
/test pull-ova-all

drew-viles · 2023-06-03T12:47:36Z

So on the packer failure, it looks like it's centos-7 via Digital Ocean where it's failing.
Error: Failed to initialize build "centos-7" error initializing builder 'digitalocean': Unknown builder digitalocean make: *** [Makefile:494: validate-do-centos-7] Error 1

Yet here, it passed 🤦

drew-viles · 2023-06-04T19:45:17Z

/test pull-packer-validate

AverageMarcus · 2023-06-05T06:50:51Z

Ah! We recently bumped the version of packer (a patch version though 🤨 ) and theres some notes in the release notes about DigitalOcean:

packer-plugin-digitalocean: The Digital Ocean Packer plugin has been handed
over to the Digital Ocean team. New releases for this plugin are available
at https://github.com/digitalocean/packer-plugin-digitalocean.

Related PR: hashicorp/packer#12376

I'll open an issue for this now and look at getting a PR up shortly. (#1180)

AverageMarcus · 2023-06-05T07:19:48Z

Looks like that PR has fixed it.

@drew-viles Once that PR is merged in you'll need to rebase this PR then it should be good to merge 😄 🤞

This addition also creates a new s3 addtional_component that can be used for other s3 related interactions. NVIDIA drivers can be optionally installed using the added role. Due to NVIDIA not making the drivers for GRIDD publically available, this role requires an S3 endpoint as it is probably the most available to most users. Users can use a variety of tools to create an S3 Endpoint be it AWS, CloudFlare, Minio or one of the many other options. With this in mind, this option seems the most logical, plus it allows for an endpoint that can be secured thus not breaking any license agreement with NVIDIA with regards to making the driver public. Users should store their .run driver file and .tok file on the S3 endpoint. the gridd.conf will be generated based on the Feature flag passed in.

drew-viles · 2023-06-05T08:27:30Z

Looks like that PR has fixed it.

@drew-viles Once that PR is merged in you'll need to rebase this PR then it should be good to merge 😄 🤞

Cool, all pushed in, let's see what happens :-p

Thanks @AverageMarcus !

AverageMarcus · 2023-06-05T08:30:00Z

/lgtm

🤞

AverageMarcus · 2023-06-05T09:06:31Z

/unhold

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 26, 2023

k8s-ci-robot requested review from EleanorRigby and kkeshavamurthy April 26, 2023 21:15

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 26, 2023

AverageMarcus reviewed Apr 27, 2023

View reviewed changes

drew-viles force-pushed the nvidia-driver-support branch from ae31ffa to e7e7e1e Compare April 27, 2023 07:51

drew-viles marked this pull request as draft April 27, 2023 10:55

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 27, 2023

drew-viles force-pushed the nvidia-driver-support branch 4 times, most recently from 51eb23e to 6dcba56 Compare April 27, 2023 22:07

drew-viles marked this pull request as ready for review April 27, 2023 22:07

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 27, 2023

k8s-ci-robot requested a review from AverageMarcus April 27, 2023 22:07

AverageMarcus reviewed Apr 28, 2023

View reviewed changes

k8s-ci-robot assigned AverageMarcus Apr 28, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2023

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2023

drew-viles force-pushed the nvidia-driver-support branch from 6dcba56 to 92c0e34 Compare April 28, 2023 07:05

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 3, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2023

This was referenced Jun 5, 2023

DigitalOcean provider no longer working with latest Packer #1180

Closed

Use required_plugin with DigitalOcean provider #1181

Merged

drew-viles force-pushed the nvidia-driver-support branch from 3da7b5d to 5a25506 Compare June 5, 2023 08:26

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2023

k8s-ci-robot merged commit 59511c4 into kubernetes-sigs:master Jun 5, 2023

primeroz mentioned this pull request Jun 5, 2023

Implement GPU Driver Support for CAPZ Clusters giantswarm/roadmap#2346

Open

2 tasks

drew-viles deleted the nvidia-driver-support branch July 5, 2023 08:01

drew-viles mentioned this pull request Aug 14, 2023

REQUEST: New membership for drew-viles kubernetes/org#4386

Closed

9 tasks

drew-viles mentioned this pull request Feb 27, 2024

REQUEST: New membership for drew-viles kubernetes/org#4790

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Enable ability to build GPU drives during image build #1147

feat: Enable ability to build GPU drives during image build #1147

drew-viles commented Apr 26, 2023 •

edited

Loading

k8s-ci-robot commented Apr 26, 2023

drew-viles commented Apr 26, 2023

AverageMarcus left a comment

drew-viles commented Apr 27, 2023

AverageMarcus commented Apr 27, 2023

drew-viles commented Apr 27, 2023

drew-viles commented Apr 27, 2023

drew-viles commented Apr 27, 2023

AverageMarcus left a comment

AverageMarcus commented Apr 28, 2023

drew-viles commented Jun 2, 2023

drew-viles commented Jun 2, 2023

drew-viles commented Jun 2, 2023

AverageMarcus commented Jun 3, 2023

k8s-ci-robot commented Jun 3, 2023

AverageMarcus commented Jun 3, 2023

drew-viles commented Jun 3, 2023

drew-viles commented Jun 3, 2023

drew-viles commented Jun 3, 2023 •

edited

Loading

drew-viles commented Jun 4, 2023

AverageMarcus commented Jun 5, 2023 •

edited

Loading

AverageMarcus commented Jun 5, 2023

drew-viles commented Jun 5, 2023

AverageMarcus commented Jun 5, 2023

AverageMarcus commented Jun 5, 2023

feat: Enable ability to build GPU drives during image build #1147

feat: Enable ability to build GPU drives during image build #1147

Conversation

drew-viles commented Apr 26, 2023 • edited Loading

k8s-ci-robot commented Apr 26, 2023

drew-viles commented Apr 26, 2023

AverageMarcus left a comment

Choose a reason for hiding this comment

drew-viles commented Apr 27, 2023

AverageMarcus commented Apr 27, 2023

drew-viles commented Apr 27, 2023

drew-viles commented Apr 27, 2023

drew-viles commented Apr 27, 2023

AverageMarcus left a comment

Choose a reason for hiding this comment

AverageMarcus commented Apr 28, 2023

drew-viles commented Jun 2, 2023

drew-viles commented Jun 2, 2023

drew-viles commented Jun 2, 2023

AverageMarcus commented Jun 3, 2023

k8s-ci-robot commented Jun 3, 2023

AverageMarcus commented Jun 3, 2023

drew-viles commented Jun 3, 2023

drew-viles commented Jun 3, 2023

drew-viles commented Jun 3, 2023 • edited Loading

drew-viles commented Jun 4, 2023

AverageMarcus commented Jun 5, 2023 • edited Loading

AverageMarcus commented Jun 5, 2023

drew-viles commented Jun 5, 2023

AverageMarcus commented Jun 5, 2023

AverageMarcus commented Jun 5, 2023

drew-viles commented Apr 26, 2023 •

edited

Loading

drew-viles commented Jun 3, 2023 •

edited

Loading

AverageMarcus commented Jun 5, 2023 •

edited

Loading