-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sandbox container image being GC'd in 1.29 #3745
Comments
@z0rc Thanks for opening this issue! we're able to reproduce the issue and work on the fix now! |
For anyone affected, @cartermckinnon shared a workaround of overriding the sandbox image URI to one that doesn't require credentials. This can be done through user-data settings:
Or this:
|
@henry118 is looking into this in containerd/containerd#9726. Although Bottlerocket is on a version of containerd (1.6.26) that supports the sandbox image pinning required by kubelet 1.29, that pinning is not currently happening when the sandbox image is pulled by its containerd client ( |
Applying a DaemonSet to use the sandbox image should prevent the image from being garbage collected. A DaemonSet template and instructions can be found below. |
any updates on a fix ? |
@doramar97 there is a PR in review: |
#3757 is merged and this fix will be in the next release. We will update this issue when it is available. |
We have released 1.19.1 which should resolve this issue. |
Hello, we still get this issue on
The node works fine until it doesn't anymore and pods get stuck. |
@awoimbee if you still have the instance available, could you run
Is this happening intermittently? Consistently on nodes across your cluster(s)? Have you seen this issue on any other variants of Bottlerocket? As an aside, we intend to include the pause image in Bottlerocket images themselves starting in Bottlerocket 1.21.0: #3940 |
I have not reproduced (yet) in a non prod env (where I have the admin container running so I can export logs). I have some pretty big images, I think the pinning introduced in 1.19.1 is just not working. -> in all cases the disk was 20% to 60% full according to node-exporter ( |
Root cause: I was using a data disk snapshot from before v1.19.1 so the image was not pinned for me. Related: #614, especially:
|
Image I'm using:
Bottlerocket AMI, version 1.18.0, variant aws-k8s-1.29
What I expected to happen:
Pods should be able to run on long running nodes (uptime > 12h).
What actually happened:
Pods fail to start. Kubernetes Event:
How to reproduce the problem:
Launch EKS 1.29 cluster with bottlerocket nodes. Let it run for some time, until kubelet garbage collects
pause
image. Time can be reduced by increased image churn, where a lot of pods launched and stopped using different images, like CI runners for example.Effectively it's same as awslabs/amazon-eks-ami#1597. Reproducible in the same way on Bottlerocket nodes.
The text was updated successfully, but these errors were encountered: