-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eks gpu image for 1.16 yum update broken #484
Comments
Also requesting updated drivers for cuda 10.2 support if possible. Thanks! (looking at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html -- looks like they're just using the nvidia ".run" files to do driver installs vs ported rpm files.) |
We're seeing the exact same problem, initially posted about here: NVIDIA/nvidia-docker#1310 |
For 1.16, we've opted to just create our own combined cpu/gpu image at this point. The source image is ssm non-gpu image and there's a little scripting in packer to apply the drivers and a systemd shim to set the default runtime to Nvidia on gpu nodes and to exclude it if the hardware isn't present. It is working great so-far, gives control over drivers and it's an all-in-one image that can be used on both cpu and gpu nodes. |
Is there any update about updating the gpu optimized AMI's to CUDA 10.2? |
@bkruger99 Is sharing the Dockerfile possible? |
Optimized GPU AMI uses AL2-graphics source to install NVIDIA driver due to some security reasons. This gives AWS better control on security patch, etc. However, I agree driver update is kind of slow. New CUDA will be released soon to support NVIDIA A100. OCI part use internal stack and it's not using nvidia-docker2. If user plan to customize your own, it would be better to remove all libnvidia packages and reinstall nvidia-docker2. AMI is pretty clean, only driver is installed. All rest packages like CUDA, CuDNN should be installed in container image level |
@neolunar7 It's an eks-packer script. I did it for work, so let me check w/ the boss to see if I can share. I don't believe we have anything secret in there that matters. I don't want to necessarily share to against aws's images as that could be a support contention issue. @Jeffwan Will this be something that will be released to github? What I ideally would like to avoid in the future is broken AMIs when AL2-graphics is out of date, which is why I ended up rolling an all-in-one. Anything sharable would be excellent so we can continue the AIO path we're using as it's been working out really well for us. Right now, I have a script to check if nvidia gpus are present and if they are to modify the docker config file to use nvidia runtime. The install command below is what I'm using..
|
@bkruger99 did you end up making that AMI public? I am facing a similar issue trying to run 450 / 455 drivers on EKS nodes. My ideal was the GPU Operator project from Nvidia but they do not have support for Amazon Linux right now |
I don't have the ami public as we have some company specific things on it, but I can probably provide the relevant packer parts and you can make your own. |
did you ever get around to doing this @bkruger99 |
I'm considering this issue stale; please open a new issue if you face similar problems with current AMI releases. |
What happened:
When issuing a
sudo yum -y upgrade
, yum errors out due to dependency resolution w/ nvidia related things.What you expected to happen:
yum shouldn't error and it should just upgrade.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
amzn2-graphics repo seems to be the culprit in this case. Looking at the V29 of the AL2 gpu+tensorflow+etc. image, the amzn2-graphics repo doesn't exist on that image, but looks like things maybe fairly static built there as there's no cuda/nvidia/etc packages available.
Environment:
cat /etc/eks/release
on a node):The text was updated successfully, but these errors were encountered: