Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with Nvidia GPU Operator #4162

Open
dgr237 opened this issue Aug 23, 2024 · 1 comment
Open

Compatibility with Nvidia GPU Operator #4162

dgr237 opened this issue Aug 23, 2024 · 1 comment
Labels
type/documentation Documentation update/creation

Comments

@dgr237
Copy link

dgr237 commented Aug 23, 2024

What I'd like:
I am looking at Bottle Rocket for EKS and was interested in whether this OS is compatible with the Nvidia GPU Operator. In the documentation for the Nvidia GPU Operator there is documentation on installation steps for installing the Nvidia GPU Operator on various Host OSs but Bottle Rocket is not one which is listed.

Our requirements is to be able to peg the Nvidia Drivers to a specific version of the drivers which has been tested and certified for use by the business. I was therefore looking at the Nvidia GPU Operator as a mechanism to do this. We are currently building custom AMIs based on AL2. This process is cumbersome as we have to build the AMI and release it for the business to test. If any issues are identified in the testing we have to start the process again and build with another version of the Nvidia Drivers.

The use of the Nvidia Operator would enable us to simplify this process and enable the business to install different Nvidia Drivers independently without the need to engineer new Custom AMIs. Given that AL2 is due to be deprecated in 2025 we are looking at what to replace the base AMI with either Bottle Rocket or AL 2023. Whether the Nvidia GPU Operator is supported will be one factor which will determine which Host OS we choose.

It would be good if there was documentation whether Bottle Rocket is compatible with the Nvidia GPU Operator and the installation steps needed to install this.

@dgr237 dgr237 added status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request labels Aug 23, 2024
@yeazelm
Copy link
Contributor

yeazelm commented Aug 23, 2024

Hello @dgr237, thanks for cutting this issue! We do have some documentation on the GPU operator but I admit it is a bit buried in the QUICKSTART doc. We don't recommend that you use the GPU operator with Bottlerocket because the way it operates can cause issues with Bottlerocket. Bottlerocket includes much what you need but we recommend you add additional NVIDIA tools such as DCGM and GPU Feature Discovery by installing them in your cluster by following the helm install instructions provided for each project.

Bottlerocket includes the NVIDIA drivers in the root image and you can't easily change which one is used so pinning your own driver version isn't going to work with the GPU operator on Bottlerocket. Let me know if that helps!

@yeazelm yeazelm added type/documentation Documentation update/creation and removed type/enhancement New feature or request status/needs-triage Pending triage or re-evaluation labels Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/documentation Documentation update/creation
Projects
None yet
Development

No branches or pull requests

2 participants