Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks gpu image for 1.16 yum update broken #484

Closed
bkruger99 opened this issue Jun 3, 2020 · 11 comments
Closed

eks gpu image for 1.16 yum update broken #484

bkruger99 opened this issue Jun 3, 2020 · 11 comments
Assignees

Comments

@bkruger99
Copy link

bkruger99 commented Jun 3, 2020

What happened:
When issuing a sudo yum -y upgrade, yum errors out due to dependency resolution w/ nvidia related things.

What you expected to happen:
yum shouldn't error and it should just upgrade.

How to reproduce it (as minimally and precisely as possible):

sudo yum update -y
...
--> Finished Dependency Resolution
Error: Package: nvidia-container-toolkit-1.1.1-2.amzn2.x86_64 (nvidia-container-runtime)
           Requires: libnvidia-container-tools >= 1.1.1
           Installed: libnvidia-container-tools-1.0.0-1.amzn2.x86_64 (@amzn2-graphics)
               libnvidia-container-tools = 1.0.0-1.amzn2
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

Anything else we need to know?:
amzn2-graphics repo seems to be the culprit in this case. Looking at the V29 of the AL2 gpu+tensorflow+etc. image, the amzn2-graphics repo doesn't exist on that image, but looks like things maybe fairly static built there as there's no cuda/nvidia/etc packages available.

Environment:

  • AWS Region: us-west-2
  • Instance Type(s): any gpu
  • EKS Platform version: N/A, this is just the AMI itself.
  • Kubernetes version: N/A, ami only
  • AMI Version: eks 1.16 - pulled out of ssm
  • Kernel: Linux ip-X-X-X-X.us-west-2.compute.internal 4.14.177-139.253.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Wed Apr 29 09:56:20 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux (Granted, new one gets installed via yum update, but it's moot in this case.
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-0ab1913ead3883bb5"
BUILD_TIME="Thu May  7 16:30:14 UTC 2020"
BUILD_KERNEL="4.14.177-139.253.amzn2.x86_64"
ARCH="x86_64"
@bkruger99
Copy link
Author

bkruger99 commented Jun 3, 2020

Also requesting updated drivers for cuda 10.2 support if possible. Thanks!

(looking at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html -- looks like they're just using the nvidia ".run" files to do driver installs vs ported rpm files.)

@jweber-mimsoftware
Copy link

We're seeing the exact same problem, initially posted about here: NVIDIA/nvidia-docker#1310

@bkruger99
Copy link
Author

For 1.16, we've opted to just create our own combined cpu/gpu image at this point. The source image is ssm non-gpu image and there's a little scripting in packer to apply the drivers and a systemd shim to set the default runtime to Nvidia on gpu nodes and to exclude it if the hardware isn't present. It is working great so-far, gives control over drivers and it's an all-in-one image that can be used on both cpu and gpu nodes.

@acesir
Copy link

acesir commented Aug 22, 2020

Is there any update about updating the gpu optimized AMI's to CUDA 10.2?

@neolunar7
Copy link

@bkruger99 Is sharing the Dockerfile possible?

@Jeffwan
Copy link
Contributor

Jeffwan commented Oct 8, 2020

Optimized GPU AMI uses AL2-graphics source to install NVIDIA driver due to some security reasons. This gives AWS better control on security patch, etc. However, I agree driver update is kind of slow. New CUDA will be released soon to support NVIDIA A100.

OCI part use internal stack and it's not using nvidia-docker2. If user plan to customize your own, it would be better to remove all libnvidia packages and reinstall nvidia-docker2.

AMI is pretty clean, only driver is installed. All rest packages like CUDA, CuDNN should be installed in container image level

@bkruger99
Copy link
Author

@neolunar7 It's an eks-packer script. I did it for work, so let me check w/ the boss to see if I can share. I don't believe we have anything secret in there that matters. I don't want to necessarily share to against aws's images as that could be a support contention issue.

@Jeffwan Will this be something that will be released to github? What I ideally would like to avoid in the future is broken AMIs when AL2-graphics is out of date, which is why I ended up rolling an all-in-one. Anything sharable would be excellent so we can continue the AIO path we're using as it's been working out really well for us. Right now, I have a script to check if nvidia gpus are present and if they are to modify the docker config file to use nvidia runtime. The install command below is what I'm using..

"sudo chmod a+x /tmp/nvidia.run && sudo /tmp/nvidia.run -acZs --no-opengl-files --dkms --no-drm -j 2",

@brianrudolf
Copy link

@bkruger99 did you end up making that AMI public? I am facing a similar issue trying to run 450 / 455 drivers on EKS nodes. My ideal was the GPU Operator project from Nvidia but they do not have support for Amazon Linux right now

@bkruger99
Copy link
Author

@bkruger99 did you end up making that AMI public? I am facing a similar issue trying to run 450 / 455 drivers on EKS nodes. My ideal was the GPU Operator project from Nvidia but they do not have support for Amazon Linux right now

I don't have the ami public as we have some company specific things on it, but I can probably provide the relevant packer parts and you can make your own.

@anish
Copy link

anish commented May 28, 2021

did you ever get around to doing this @bkruger99

@cartermckinnon
Copy link
Member

I'm considering this issue stale; please open a new issue if you face similar problems with current AMI releases.

@cartermckinnon cartermckinnon closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants