Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enable ability to build GPU drives during image build #1147

Merged

Conversation

drew-viles
Copy link
Contributor

@drew-viles drew-viles commented Apr 26, 2023

What this PR does / why we need it:

This has been raised based on the chat in the recent office hours.

NVIDIA GPU drivers can be compiled by the GPU operator when it is deployed without much fuss however this adds time to the deployment of any new nodes - a pretty significant amount of time.

By (optionally) baking it into the image, the GPU operator can be deployed much fast by skipping the need for driver compilation.

Because of the changes to the NVIIDA license structure noted here, there is now the requirements for a .tok file to be used for licensing. This means we can't just have a basic endpoint to pull the .run or .tok from as these contain sensitive licensing information. As a result, this new addition presumes that the user will have an S3 endpoint to pull this information from. Since there are multiple paid and free solutions to this, it didn't seem unreasonable to take this approach.

A new S3 additional component has been added and the NVIDIA module makes use of this to acquire the required resources.

There is no book on the NVIDIA role as I wasn't sure where it would be wanted as it's not a provider, but I have updated the additional components section to include the S3 addition - maybe a community_roles section could be introduced where things like this could be stored?

Additional context

If I've missed anything with regards to the inclusion of the s3 feature in load_additional_modules, please let me know and I'll get that corrected.

This has only been tested on Ubuntu as of this PR.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 26, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @drew-viles. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 26, 2023
@drew-viles
Copy link
Contributor Author

Copy link
Member

@AverageMarcus AverageMarcus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far but it would be nice to have some sort of documentation around the nvidia vars and how to include it in your build.

But I'm not sure where that should go exactly. 🤔

@drew-viles
Copy link
Contributor Author

Couldn't agree more. I don't want to dump it in with nothing to explain how it would be used. Let's see what others may have to say but I think a new place to put things like this and the security bits I mentioned on the call would be a good direction to go in. A sort of "optional addons" section.

@AverageMarcus
Copy link
Member

I wonder if a README.md in the images/capi/ansible/roles/nvidia/ directory would be enough for now.

My knowledge of Ansible isn't all that great and I've just realised I don't actually know if all other roles we have listed are included or they require some changes by the user to include them (like this one will I think?)

@drew-viles
Copy link
Contributor Author

Valid point. I can always start with that and then if it needs some adjustment we can move it into the book. I'm just wondering how many people will go hunting into the roles directory or would head straight for the book.

I'll get a readme in for now anyway, at least that's something!

@drew-viles
Copy link
Contributor Author

I'm just going to draft this for now as we've just noticed some of our latest images are not actually working with GPUs even though the image build including the NVIDIA role is completing without any errors. Once I've validated it's not this new role, I'll remove the draft status.

@drew-viles drew-viles marked this pull request as draft April 27, 2023 10:55
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 27, 2023
@drew-viles drew-viles force-pushed the nvidia-driver-support branch 4 times, most recently from 51eb23e to 6dcba56 Compare April 27, 2023 22:07
@drew-viles
Copy link
Contributor Author

ok got to the root of the problem. So originally the module was being included as part of node_custom_roles_post.

The problem we were having was the kernel had been updated via the dist-upgrade that runs during the setup role.
Because the driver was installed after the install of the new kernel, the driver was only being compiled for the previous kernel meaning that upon spinning up the image, the new kernel was active but the driver wasn't configured to work with it.
It purely came down to the DKMS hook not running when the new kernel was installed because it didn't know about the driver. Makes sense.

Anyway, as a result there has been a light adjustment to move it into the node_custom_roles_pre section now.

Glad it came up now before a merge :-D

It's the first time we'd hit this as the latest kernel was released around 17th/18th April and we'd not built any new images yet until today.

@drew-viles drew-viles marked this pull request as ready for review April 27, 2023 22:07
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 27, 2023
Copy link
Member

@AverageMarcus AverageMarcus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2023
@AverageMarcus
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2023
@drew-viles
Copy link
Contributor Author

/test pull-azure-sigs

2 similar comments
@drew-viles
Copy link
Contributor Author

/test pull-azure-sigs

@drew-viles
Copy link
Contributor Author

/test pull-azure-sigs

@AverageMarcus
Copy link
Member

/lgtm
/approve

Thanks for the persistence in getting this done @drew-viles! 😁 Lets get it merged in!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 3, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AverageMarcus, drew-viles

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2023
@AverageMarcus
Copy link
Member

What is going on with these tests?!

/retest-required
/hold for tests passing

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2023
@drew-viles
Copy link
Contributor Author

What is going on with these tests?!

/retest-required /hold for tests passing

No idea. They seem a bit flaky tbh but I'm happy to refire as needed.

@drew-viles
Copy link
Contributor Author

/test pull-packer-validate
/test pull-ova-all

@drew-viles
Copy link
Contributor Author

drew-viles commented Jun 3, 2023

So on the packer failure, it looks like it's centos-7 via Digital Ocean where it's failing.
Error: Failed to initialize build "centos-7" error initializing builder 'digitalocean': Unknown builder digitalocean make: *** [Makefile:494: validate-do-centos-7] Error 1

Yet here, it passed 🤦

@drew-viles
Copy link
Contributor Author

/test pull-packer-validate

@AverageMarcus
Copy link
Member

AverageMarcus commented Jun 5, 2023

Ah! We recently bumped the version of packer (a patch version though 🤨 ) and theres some notes in the release notes about DigitalOcean:

packer-plugin-digitalocean: The Digital Ocean Packer plugin has been handed
over to the Digital Ocean team. New releases for this plugin are available
at https://github.com/digitalocean/packer-plugin-digitalocean.

Related PR: hashicorp/packer#12376

I'll open an issue for this now and look at getting a PR up shortly. (#1180)

@AverageMarcus
Copy link
Member

Looks like that PR has fixed it.

@drew-viles Once that PR is merged in you'll need to rebase this PR then it should be good to merge 😄 🤞

This addition also creates a new s3 addtional_component that can be used for other s3 related interactions.
NVIDIA drivers can be optionally installed using the added role. Due to NVIDIA not making the drivers for GRIDD publically available, this role requires an S3 endpoint as it is probably the most available to most users.
Users can use a variety of tools to create an S3 Endpoint be it AWS, CloudFlare, Minio or one of the many other options. With this in mind, this option seems the most logical, plus it allows for an endpoint that can be secured thus not breaking any license agreement with NVIDIA with regards to making the driver public.
Users should store their .run driver file and .tok file on the S3 endpoint. the gridd.conf will be generated based on the Feature flag passed in.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023
@drew-viles
Copy link
Contributor Author

Looks like that PR has fixed it.

@drew-viles Once that PR is merged in you'll need to rebase this PR then it should be good to merge 😄 🤞

Cool, all pushed in, let's see what happens :-p

Thanks @AverageMarcus !

@AverageMarcus
Copy link
Member

/lgtm

🤞

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023
@AverageMarcus
Copy link
Member

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2023
@k8s-ci-robot k8s-ci-robot merged commit 59511c4 into kubernetes-sigs:master Jun 5, 2023
@drew-viles drew-viles deleted the nvidia-driver-support branch July 5, 2023 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants