Sandbox container image being GC'd in 1.29 #3745

z0rc · 2024-01-31T10:38:22Z

Image I'm using:
Bottlerocket AMI, version 1.18.0, variant aws-k8s-1.29

What I expected to happen:
Pods should be able to run on long running nodes (uptime > 12h).

What actually happened:
Pods fail to start. Kubernetes Event:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull and unpack image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.1-eksbuild.1": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

How to reproduce the problem:
Launch EKS 1.29 cluster with bottlerocket nodes. Let it run for some time, until kubelet garbage collects pause image. Time can be reduced by increased image churn, where a lot of pods launched and stopped using different images, like CI runners for example.

Effectively it's same as awslabs/amazon-eks-ami#1597. Reproducible in the same way on Bottlerocket nodes.

The text was updated successfully, but these errors were encountered:

gthao313 · 2024-01-31T21:05:54Z

@z0rc Thanks for opening this issue! we're able to reproduce the issue and work on the fix now!

bcressey · 2024-02-01T20:03:38Z

For anyone affected, @cartermckinnon shared a workaround of overriding the sandbox image URI to one that doesn't require credentials.

This can be done through user-data settings:

[settings.kubernetes]
pod-infra-container-image = "registry.k8s.io/pause:3.9"

Or this:

[settings.kubernetes]
pod-infra-container-image = "public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest"

bcressey · 2024-02-01T20:08:35Z

@henry118 is looking into this in containerd/containerd#9726.

Although Bottlerocket is on a version of containerd (1.6.26) that supports the sandbox image pinning required by kubelet 1.29, that pinning is not currently happening when the sandbox image is pulled by its containerd client (host-ctr). That client is used in order to support pulling from private ECR repositories.

ecpullen · 2024-02-02T19:18:05Z

Applying a DaemonSet to use the sandbox image should prevent the image from being garbage collected. A DaemonSet template and instructions can be found below.

Sandbox Workaround

doramar97 · 2024-02-05T08:39:06Z

any updates on a fix ?

arnaldo2792 · 2024-02-05T16:56:13Z

@doramar97 there is a PR in review:

#3757

yeazelm · 2024-02-06T22:40:26Z

#3757 is merged and this fix will be in the next release. We will update this issue when it is available.

yeazelm · 2024-02-08T01:37:58Z

We have released 1.19.1 which should resolve this issue.

awoimbee · 2024-07-18T12:54:02Z

Hello, we still get this issue on Bottlerocket OS 1.20.3 (aws-k8s-1.30) :

2024-07-15T07:22:38Z Warning: FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull and unpack image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials - <my-redacted-pod-name>

The node works fine until it doesn't anymore and pods get stuck.

ginglis13 · 2024-07-18T15:54:47Z

@awoimbee if you still have the instance available, could you run logdog and provide those logs here? (https://github.com/bottlerocket-os/bottlerocket/tree/c8cc60ddf439af3b0ba53af3c082dbcdcfdf3bb1?tab=readme-ov-file#logs)

The node works fine until it doesn't anymore and pods get stuck.

Is this happening intermittently? Consistently on nodes across your cluster(s)? Have you seen this issue on any other variants of Bottlerocket?

As an aside, we intend to include the pause image in Bottlerocket images themselves starting in Bottlerocket 1.21.0: #3940

awoimbee · 2024-08-02T13:59:12Z

I have not reproduced (yet) in a non prod env (where I have the admin container running so I can export logs).
This happened 6 times today (2024-08-02) ! On nodes from 11 days old to 20 mins old (EKS 1.30, Bottlerocket 1.20.3 - 1.20.5).

I have some pretty big images, I think the pinning introduced in 1.19.1 is just not working. -> in all cases the disk was 20% to 60% full according to node-exporter (node_filesystem_avail_bytes)

awoimbee · 2024-08-09T13:48:56Z

Root cause: I was using a data disk snapshot from before v1.19.1 so the image was not pinned for me.
since v1.21.0 the image is finally localhost/kubernetes/pause:0.1.0 so no more "pull access denied" possible :)

Related: #614, especially:

we should also document a workflow for this via snapshot of the data volume.

z0rc added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Jan 31, 2024

gthao313 added status/in-progress This issue is currently being worked on and removed status/needs-triage Pending triage or re-evaluation labels Jan 31, 2024

larhauga mentioned this issue Feb 2, 2024

Bottlerocket pod-infra-container-image not supporter - required for workaround of sandbox GC issue aws/karpenter-provider-aws#5589

Open

gthao313 mentioned this issue Feb 5, 2024

host-ctr: add label flag to host-ctr pull-image. #3757

Merged

2 tasks

tzneal mentioned this issue Feb 5, 2024

Sandbox container image being GC'd in 1.29 awslabs/amazon-eks-ami#1597

Closed

bcressey mentioned this issue Feb 7, 2024

Issue with Bottlerocket image #3764

Closed

yeazelm closed this as completed Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sandbox container image being GC'd in 1.29 #3745

Sandbox container image being GC'd in 1.29 #3745

z0rc commented Jan 31, 2024 •

edited

Loading

gthao313 commented Jan 31, 2024

bcressey commented Feb 1, 2024

bcressey commented Feb 1, 2024

ecpullen commented Feb 2, 2024

doramar97 commented Feb 5, 2024

arnaldo2792 commented Feb 5, 2024

yeazelm commented Feb 6, 2024

yeazelm commented Feb 8, 2024 •

edited

Loading

awoimbee commented Jul 18, 2024 •

edited

Loading

ginglis13 commented Jul 18, 2024

awoimbee commented Aug 2, 2024 •

edited

Loading

awoimbee commented Aug 9, 2024

Sandbox container image being GC'd in 1.29 #3745

Sandbox container image being GC'd in 1.29 #3745

Comments

z0rc commented Jan 31, 2024 • edited Loading

gthao313 commented Jan 31, 2024

bcressey commented Feb 1, 2024

bcressey commented Feb 1, 2024

ecpullen commented Feb 2, 2024

doramar97 commented Feb 5, 2024

arnaldo2792 commented Feb 5, 2024

yeazelm commented Feb 6, 2024

yeazelm commented Feb 8, 2024 • edited Loading

awoimbee commented Jul 18, 2024 • edited Loading

ginglis13 commented Jul 18, 2024

awoimbee commented Aug 2, 2024 • edited Loading

awoimbee commented Aug 9, 2024

z0rc commented Jan 31, 2024 •

edited

Loading

yeazelm commented Feb 8, 2024 •

edited

Loading

awoimbee commented Jul 18, 2024 •

edited

Loading

awoimbee commented Aug 2, 2024 •

edited

Loading