-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node doesn't expose GPU resource on g4dn.[n]xlarge #4087
Comments
Thanks for reporting this! By any chance do you have the instance running? It seems odd that the device plugin isn't showing any output. |
Yes sir, I have the instance running. I agree, from the previous time that I reported the incident (slack thread) the output was different:
But this time it is empty |
@arnaldo2792 let me know if there are steps you want to perform to diagnose the issue? |
I am investigating on this end. On the EC2
This shows that the nvidia kmod downloaded firmware to the GSP during boot. The desired state is:
The slightly better news is that we do have an issue open internally to select the "no GSP download" option on appropriate hardware, without requiring any configuration. |
@larvacea I want to thank you for taking the time to investigate this strange issue. Also I am happy that you found some breadcrumbs on what the problem is. 👏 |
Here's one way to set the relevant kernel parameter using apiclient:
After the instance reboots, |
My understanding is that if that I set the kernel parameter Also my understanding is that the parameter Based on that understanding I would say that my fix would be to add the apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: random-name
spec:
amiFamily: Bottlerocket
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
volumeSize: 4Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
deleteOnTermination: true
iops: 3000
snapshotID: snap-d4758cc7f5f11
throughput: 500
volumeSize: 60Gi
volumeType: gp3
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: required
role: KarpenterNodeRole-prod
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod
subnetSelectorTerms:
- tags:
Name: '*Private*'
karpenter.sh/discovery: prod
tags:
nodepool: random-name
purpose: prod
vendor: random-name
userData: |-
[settings.boot.kernel-parameters]
"nvidia.NVreg_EnableGpuFirmware"=["0"] However, I don't know the internals of that process and maybe my understanding is wrong and I need to use Please correct me if I am wrong |
The We intend to add logic to automate this and set the desired kmod option before we load the driver. In general-purpose Linux operating systems, one could solve the problem by putting the desired configuration in In Bottlerocket, Hope this helps. |
Image I'm using:
System Info:
What I expected to happen:
100% of the time that in
EKS
I start aBottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia) ami-09469fd78070eaac6
node on ag4dn.[n]xlarge
instance-type it should expose the gpu count for pods.What actually happened:
~5% of the time that in
EKS
I start aBottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia) ami-09469fd78070eaac6
node on ag4dn.[n]xlarge
instance-type it didn't expose the gpu count for pods, causing pods requiringnvidia.com/gpu: 1
to not be scheduled, keeping them in pending state waiting for a node.How to reproduce the problem:
Note: This issue has existed for more than a year, you can see the slack thread here
Current settings:
Karpenter managed process:
resources
,node labes
andtolerations
NodePool
andEC2NodeClass
karpenter.sh/nodepool=random-name
Node created:
As you can see the node created fulfill the requirements of
node labes
andtolerations
but not theresources
(gpu)Inspecting the node:
Using the session manager -> amin-container -> sheltie
From the slack thread, someone suggest this:
The text was updated successfully, but these errors were encountered: