-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to disable GSP Firmware module for Nvidia GPUs #3817
Comments
Thanks for bringing this up @chiragjn. As has been discussed on the linked issue, the GSP firmware can be disabled through the module option Setting module options for Bottlerocket can be done through the user-data setting for the kernel command line and the To achieve disabling GSP, you will have to set the following options in your user data:
I have done a test with the following eksctl config to check in an A/B scenario. One nodegroup that boots the image "vanilla" (ng-bottlerocket-g4), and one nodegroup with the appropriate settings to disable GSP firmware (ng-bottlerocket-g4-nogsp):
Instance g4dn.xlarge with GSP disabled:
Instance g4dn.xlarge without GSP disabled:
Would this fix your issue or is there anything extra that you would need from Bottlerocket? |
Oh amazing, didn't know about this |
This works as expected! Thanks again :) |
I had brought this up earlier in another thread but creating an issue to track this separately.
When GSP Firmware is enabled, running dcgm exporter renders the GPU unresponsive and all interactions with the GPU start timing out leading to container creation failures or unresponsive gpu containers.
E.g. When trying to create a pod with gpu access on a g5.xlarge in an EKS 1.28 running BottleRocket AMI (This node already has dcgm exporter daemonset running), we get:
As such, the only working solution we have found is to disable the GSP entirely. For AL2 I have done this using a user data init script, but with bottlerocket I am afraid we don't have that option.
See these threads for more details:
awslabs/amazon-eks-ami#1523
NVIDIA/open-gpu-kernel-modules#446
For AL2, EKS team has decided to disable GSP too and that work is in progress.
How to reproduce
nvidia-smi
. Most likely it will struggle to output or showERR
in a bunch of fieldsdmesg
on the node, logs would contain XID 119 errorsSadly, this issue is not 100% reproducible and takes some recycling of nodes to encounter. One of the users in the above thread has reported a guaranteed of triggering it awslabs/amazon-eks-ami#1523 (comment) - although I have not tested it.
What I'd like:
A kernel module setting to be able to disable GSP
Any alternatives you've considered:
Building Custom AMIs with the GSP files removed
The text was updated successfully, but these errors were encountered: