Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sandbox container image being GC'd in 1.29 #3745

Closed
z0rc opened this issue Jan 31, 2024 · 12 comments
Closed

Sandbox container image being GC'd in 1.29 #3745

z0rc opened this issue Jan 31, 2024 · 12 comments
Labels
status/in-progress This issue is currently being worked on type/bug Something isn't working

Comments

@z0rc
Copy link
Contributor

z0rc commented Jan 31, 2024

Image I'm using:
Bottlerocket AMI, version 1.18.0, variant aws-k8s-1.29

What I expected to happen:
Pods should be able to run on long running nodes (uptime > 12h).

What actually happened:
Pods fail to start. Kubernetes Event:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull and unpack image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.1-eksbuild.1": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

How to reproduce the problem:
Launch EKS 1.29 cluster with bottlerocket nodes. Let it run for some time, until kubelet garbage collects pause image. Time can be reduced by increased image churn, where a lot of pods launched and stopped using different images, like CI runners for example.

Effectively it's same as awslabs/amazon-eks-ami#1597. Reproducible in the same way on Bottlerocket nodes.

@z0rc z0rc added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Jan 31, 2024
@gthao313 gthao313 added status/in-progress This issue is currently being worked on and removed status/needs-triage Pending triage or re-evaluation labels Jan 31, 2024
@gthao313
Copy link
Member

@z0rc Thanks for opening this issue! we're able to reproduce the issue and work on the fix now!

@bcressey
Copy link
Contributor

bcressey commented Feb 1, 2024

For anyone affected, @cartermckinnon shared a workaround of overriding the sandbox image URI to one that doesn't require credentials.

This can be done through user-data settings:

[settings.kubernetes]
pod-infra-container-image = "registry.k8s.io/pause:3.9"

Or this:

[settings.kubernetes]
pod-infra-container-image = "public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest"

@bcressey
Copy link
Contributor

bcressey commented Feb 1, 2024

@henry118 is looking into this in containerd/containerd#9726.

Although Bottlerocket is on a version of containerd (1.6.26) that supports the sandbox image pinning required by kubelet 1.29, that pinning is not currently happening when the sandbox image is pulled by its containerd client (host-ctr). That client is used in order to support pulling from private ECR repositories.

@ecpullen
Copy link
Contributor

ecpullen commented Feb 2, 2024

Applying a DaemonSet to use the sandbox image should prevent the image from being garbage collected. A DaemonSet template and instructions can be found below.

Sandbox Workaround

@doramar97
Copy link

any updates on a fix ?

@arnaldo2792
Copy link
Contributor

@doramar97 there is a PR in review:

#3757

@yeazelm
Copy link
Contributor

yeazelm commented Feb 6, 2024

#3757 is merged and this fix will be in the next release. We will update this issue when it is available.

@yeazelm
Copy link
Contributor

yeazelm commented Feb 8, 2024

We have released 1.19.1 which should resolve this issue.

@yeazelm yeazelm closed this as completed Feb 8, 2024
@awoimbee
Copy link

awoimbee commented Jul 18, 2024

Hello, we still get this issue on Bottlerocket OS 1.20.3 (aws-k8s-1.30) :

2024-07-15T07:22:38Z Warning: FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull and unpack image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials - <my-redacted-pod-name>

The node works fine until it doesn't anymore and pods get stuck.

@ginglis13
Copy link
Contributor

@awoimbee if you still have the instance available, could you run logdog and provide those logs here? (https://github.com/bottlerocket-os/bottlerocket/tree/c8cc60ddf439af3b0ba53af3c082dbcdcfdf3bb1?tab=readme-ov-file#logs)

The node works fine until it doesn't anymore and pods get stuck.

Is this happening intermittently? Consistently on nodes across your cluster(s)? Have you seen this issue on any other variants of Bottlerocket?

As an aside, we intend to include the pause image in Bottlerocket images themselves starting in Bottlerocket 1.21.0: #3940

@awoimbee
Copy link

awoimbee commented Aug 2, 2024

I have not reproduced (yet) in a non prod env (where I have the admin container running so I can export logs).
This happened 6 times today (2024-08-02) ! On nodes from 11 days old to 20 mins old (EKS 1.30, Bottlerocket 1.20.3 - 1.20.5).

I have some pretty big images, I think the pinning introduced in 1.19.1 is just not working. -> in all cases the disk was 20% to 60% full according to node-exporter (node_filesystem_avail_bytes)

@awoimbee
Copy link

awoimbee commented Aug 9, 2024

Root cause: I was using a data disk snapshot from before v1.19.1 so the image was not pinned for me.
since v1.21.0 the image is finally localhost/kubernetes/pause:0.1.0 so no more "pull access denied" possible :)

Related: #614, especially:

we should also document a workflow for this via snapshot of the data volume.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/in-progress This issue is currently being worked on type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants