Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control the the cuda toolkit/nvidia drivers while picking images #4164

Open
eladmotola opened this issue Aug 26, 2024 · 1 comment
Open

control the the cuda toolkit/nvidia drivers while picking images #4164

eladmotola opened this issue Aug 26, 2024 · 1 comment
Labels
area/accelerated-computing Issues related to GPUs/ASICs area/upgrades Related to upgrading Bottlerocket type/documentation Documentation update/creation

Comments

@eladmotola
Copy link

What I'd like:
I am using Karpenter with EKS to auto-update my os.
is there any way to control the Nvidia drivers/cuda toolkit you install?
I mean, I don't want to come up in the morning to figure out that someone upgraded those things without me knowing.
for example - I can see in the AWS marketplace that the image of bottlerocket called amazon/bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.20.5-a3e8bda1. I can't tell from the image which cuda or Nvidia drivers exists

I would like to do something like:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: ${local.streamer_node_class_name}
spec:
  # Required, resolves a default ami and userdata
  amiFamily: Bottlerocket

  amiSelectorTerms:
    - name: "bottlerocket-aws-k8s-1.29-nvidia-x86_64-535-183-01-v12.1-*"

I understand that it's not possible to have a release for each cuda/nvidia drivers out there. but since you support this(and many thanks for that) can we make it less dangerous?

Any alternatives you've considered:

@eladmotola eladmotola added status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request labels Aug 26, 2024
@yeazelm
Copy link
Contributor

yeazelm commented Sep 4, 2024

Hello @eladmotola, thanks for cutting this issue! Bottlerocket currently stays on the latest Long Term Support Branch until a new one is available. When NVIDIA releases a new LTS branch, we will move to it. We use the CUDA libraries provided from the .run archive for that driver so it shouldn't change significantly from release to release.

Our Update Policy tries to provide some of our rationale around updates: https://github.com/bottlerocket-os/bottlerocket/blob/develop/SECURITY_FEATURES.md#update-policy

Our philosophy for variants is that the right time for an unexpected major version update to the kernel or orchestrator agent is "never". New variants can introduce newer LTS kernels or GPU drivers. On release, variants peg to a kernel and GPU driver version and relevant security patches are applied. However, in a situation where security patches are no longer available for the kernel or GPU driver, an existing variant may adopt a new version to address security vulnerabilities.

So the only time we would move an existing variant like aws-k8s-1.29-nvidia from the the current versions would be when the kernel or driver is no longer getting security updates. Otherwise we try to only introduce new kernels or driver branches on a new variant, you would need to switch to a future variant like aws-k8s-1.31-nvidia or newer to get new versions. To be clear, the aws-k8s-1.31-nvidia variant will have the same versions right now, but its the next to-be-released variant right now so I used it as an example. Hopefully this helps you understand how you can predict when new changes will be introduced!

@yeazelm yeazelm added type/documentation Documentation update/creation area/upgrades Related to upgrading Bottlerocket area/accelerated-computing Issues related to GPUs/ASICs and removed type/enhancement New feature or request status/needs-triage Pending triage or re-evaluation labels Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/accelerated-computing Issues related to GPUs/ASICs area/upgrades Related to upgrading Bottlerocket type/documentation Documentation update/creation
Projects
None yet
Development

No branches or pull requests

6 participants
@eladmotola @yeazelm and others