eks gpu image for 1.16 yum update broken #484

bkruger99 · 2020-06-03T06:48:08Z

What happened:
When issuing a sudo yum -y upgrade, yum errors out due to dependency resolution w/ nvidia related things.

What you expected to happen:
yum shouldn't error and it should just upgrade.

How to reproduce it (as minimally and precisely as possible):

sudo yum update -y
...
--> Finished Dependency Resolution
Error: Package: nvidia-container-toolkit-1.1.1-2.amzn2.x86_64 (nvidia-container-runtime)
           Requires: libnvidia-container-tools >= 1.1.1
           Installed: libnvidia-container-tools-1.0.0-1.amzn2.x86_64 (@amzn2-graphics)
               libnvidia-container-tools = 1.0.0-1.amzn2
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

Anything else we need to know?:
amzn2-graphics repo seems to be the culprit in this case. Looking at the V29 of the AL2 gpu+tensorflow+etc. image, the amzn2-graphics repo doesn't exist on that image, but looks like things maybe fairly static built there as there's no cuda/nvidia/etc packages available.

Environment:

AWS Region: us-west-2
Instance Type(s): any gpu
EKS Platform version: N/A, this is just the AMI itself.
Kubernetes version: N/A, ami only
AMI Version: eks 1.16 - pulled out of ssm
Kernel: Linux ip-X-X-X-X.us-west-2.compute.internal 4.14.177-139.253.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Wed Apr 29 09:56:20 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux (Granted, new one gets installed via yum update, but it's moot in this case.
Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-0ab1913ead3883bb5"
BUILD_TIME="Thu May  7 16:30:14 UTC 2020"
BUILD_KERNEL="4.14.177-139.253.amzn2.x86_64"
ARCH="x86_64"

The text was updated successfully, but these errors were encountered:

bkruger99 · 2020-06-03T07:07:02Z

Also requesting updated drivers for cuda 10.2 support if possible. Thanks!

(looking at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html -- looks like they're just using the nvidia ".run" files to do driver installs vs ported rpm files.)

jweber-mimsoftware · 2020-07-10T17:27:21Z

We're seeing the exact same problem, initially posted about here: NVIDIA/nvidia-docker#1310

bkruger99 · 2020-07-10T18:35:53Z

For 1.16, we've opted to just create our own combined cpu/gpu image at this point. The source image is ssm non-gpu image and there's a little scripting in packer to apply the drivers and a systemd shim to set the default runtime to Nvidia on gpu nodes and to exclude it if the hardware isn't present. It is working great so-far, gives control over drivers and it's an all-in-one image that can be used on both cpu and gpu nodes.

acesir · 2020-08-22T00:48:05Z

Is there any update about updating the gpu optimized AMI's to CUDA 10.2?

neolunar7 · 2020-08-26T04:33:38Z

@bkruger99 Is sharing the Dockerfile possible?

Jeffwan · 2020-10-08T18:10:22Z

Optimized GPU AMI uses AL2-graphics source to install NVIDIA driver due to some security reasons. This gives AWS better control on security patch, etc. However, I agree driver update is kind of slow. New CUDA will be released soon to support NVIDIA A100.

OCI part use internal stack and it's not using nvidia-docker2. If user plan to customize your own, it would be better to remove all libnvidia packages and reinstall nvidia-docker2.

AMI is pretty clean, only driver is installed. All rest packages like CUDA, CuDNN should be installed in container image level

bkruger99 · 2020-10-15T20:55:15Z

@neolunar7 It's an eks-packer script. I did it for work, so let me check w/ the boss to see if I can share. I don't believe we have anything secret in there that matters. I don't want to necessarily share to against aws's images as that could be a support contention issue.

@Jeffwan Will this be something that will be released to github? What I ideally would like to avoid in the future is broken AMIs when AL2-graphics is out of date, which is why I ended up rolling an all-in-one. Anything sharable would be excellent so we can continue the AIO path we're using as it's been working out really well for us. Right now, I have a script to check if nvidia gpus are present and if they are to modify the docker config file to use nvidia runtime. The install command below is what I'm using..

"sudo chmod a+x /tmp/nvidia.run && sudo /tmp/nvidia.run -acZs --no-opengl-files --dkms --no-drm -j 2",

brianrudolf · 2020-11-24T20:24:41Z

@bkruger99 did you end up making that AMI public? I am facing a similar issue trying to run 450 / 455 drivers on EKS nodes. My ideal was the GPU Operator project from Nvidia but they do not have support for Amazon Linux right now

bkruger99 · 2020-12-03T05:08:07Z

@bkruger99 did you end up making that AMI public? I am facing a similar issue trying to run 450 / 455 drivers on EKS nodes. My ideal was the GPU Operator project from Nvidia but they do not have support for Amazon Linux right now

I don't have the ami public as we have some company specific things on it, but I can probably provide the relevant packer parts and you can make your own.

anish · 2021-05-28T16:37:05Z

did you ever get around to doing this @bkruger99

cartermckinnon · 2022-11-04T17:52:34Z

I'm considering this issue stale; please open a new issue if you face similar problems with current AMI releases.

mogren assigned Jeffwan Sep 8, 2020

cartermckinnon closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eks gpu image for 1.16 yum update broken #484

eks gpu image for 1.16 yum update broken #484

bkruger99 commented Jun 3, 2020 •

edited

Loading

bkruger99 commented Jun 3, 2020 •

edited

Loading

jweber-mimsoftware commented Jul 10, 2020

bkruger99 commented Jul 10, 2020

acesir commented Aug 22, 2020

neolunar7 commented Aug 26, 2020

Jeffwan commented Oct 8, 2020

bkruger99 commented Oct 15, 2020

brianrudolf commented Nov 24, 2020

bkruger99 commented Dec 3, 2020

anish commented May 28, 2021

cartermckinnon commented Nov 4, 2022

eks gpu image for 1.16 yum update broken #484

eks gpu image for 1.16 yum update broken #484

Comments

bkruger99 commented Jun 3, 2020 • edited Loading

bkruger99 commented Jun 3, 2020 • edited Loading

jweber-mimsoftware commented Jul 10, 2020

bkruger99 commented Jul 10, 2020

acesir commented Aug 22, 2020

neolunar7 commented Aug 26, 2020

Jeffwan commented Oct 8, 2020

bkruger99 commented Oct 15, 2020

brianrudolf commented Nov 24, 2020

bkruger99 commented Dec 3, 2020

anish commented May 28, 2021

cartermckinnon commented Nov 4, 2022

bkruger99 commented Jun 3, 2020 •

edited

Loading

bkruger99 commented Jun 3, 2020 •

edited

Loading