-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enable ability to build GPU drives during image build #1147
feat: Enable ability to build GPU drives during image build #1147
Conversation
Hi @drew-viles. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so far but it would be nice to have some sort of documentation around the nvidia vars and how to include it in your build.
But I'm not sure where that should go exactly. 🤔
Couldn't agree more. I don't want to dump it in with nothing to explain how it would be used. Let's see what others may have to say but I think a new place to put things like this and the security bits I mentioned on the call would be a good direction to go in. A sort of "optional addons" section. |
I wonder if a README.md in the My knowledge of Ansible isn't all that great and I've just realised I don't actually know if all other roles we have listed are included or they require some changes by the user to include them (like this one will I think?) |
ae31ffa
to
e7e7e1e
Compare
Valid point. I can always start with that and then if it needs some adjustment we can move it into the book. I'm just wondering how many people will go hunting into the roles directory or would head straight for the book. I'll get a readme in for now anyway, at least that's something! |
I'm just going to draft this for now as we've just noticed some of our latest images are not actually working with GPUs even though the image build including the NVIDIA role is completing without any errors. Once I've validated it's not this new role, I'll remove the draft status. |
51eb23e
to
6dcba56
Compare
ok got to the root of the problem. So originally the module was being included as part of The problem we were having was the kernel had been updated via the dist-upgrade that runs during the Anyway, as a result there has been a light adjustment to move it into the Glad it came up now before a merge :-D It's the first time we'd hit this as the latest kernel was released around 17th/18th April and we'd not built any new images yet until today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/ok-to-test |
6dcba56
to
92c0e34
Compare
/test pull-azure-sigs |
2 similar comments
/test pull-azure-sigs |
/test pull-azure-sigs |
/lgtm Thanks for the persistence in getting this done @drew-viles! 😁 Lets get it merged in! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AverageMarcus, drew-viles The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What is going on with these tests?! /retest-required |
No idea. They seem a bit flaky tbh but I'm happy to refire as needed. |
/test pull-packer-validate |
So on the packer failure, it looks like it's centos-7 via Digital Ocean where it's failing. Yet here, it passed 🤦 |
/test pull-packer-validate |
Ah! We recently bumped the version of packer (a patch version though 🤨 ) and theres some notes in the release notes about DigitalOcean:
Related PR: hashicorp/packer#12376 I'll open an issue for this now and look at getting a PR up shortly. (#1180) |
Looks like that PR has fixed it. @drew-viles Once that PR is merged in you'll need to rebase this PR then it should be good to merge 😄 🤞 |
This addition also creates a new s3 addtional_component that can be used for other s3 related interactions. NVIDIA drivers can be optionally installed using the added role. Due to NVIDIA not making the drivers for GRIDD publically available, this role requires an S3 endpoint as it is probably the most available to most users. Users can use a variety of tools to create an S3 Endpoint be it AWS, CloudFlare, Minio or one of the many other options. With this in mind, this option seems the most logical, plus it allows for an endpoint that can be secured thus not breaking any license agreement with NVIDIA with regards to making the driver public. Users should store their .run driver file and .tok file on the S3 endpoint. the gridd.conf will be generated based on the Feature flag passed in.
3da7b5d
to
5a25506
Compare
Cool, all pushed in, let's see what happens :-p Thanks @AverageMarcus ! |
/lgtm 🤞 |
/unhold |
What this PR does / why we need it:
This has been raised based on the chat in the recent office hours.
NVIDIA GPU drivers can be compiled by the GPU operator when it is deployed without much fuss however this adds time to the deployment of any new nodes - a pretty significant amount of time.
By (optionally) baking it into the image, the GPU operator can be deployed much fast by skipping the need for driver compilation.
Because of the changes to the NVIIDA license structure noted here, there is now the requirements for a
.tok
file to be used for licensing. This means we can't just have a basic endpoint to pull the.run
or.tok
from as these contain sensitive licensing information. As a result, this new addition presumes that the user will have an S3 endpoint to pull this information from. Since there are multiple paid and free solutions to this, it didn't seem unreasonable to take this approach.A new S3 additional component has been added and the NVIDIA module makes use of this to acquire the required resources.
There is no book on the NVIDIA role as I wasn't sure where it would be wanted as it's not a provider, but I have updated the additional components section to include the S3 addition - maybe a
community_roles
section could be introduced where things like this could be stored?Additional context
If I've missed anything with regards to the inclusion of the s3 feature in
load_additional_modules
, please let me know and I'll get that corrected.This has only been tested on Ubuntu as of this PR.