-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Layer not found error message #931
Comments
tl;dr there's a bug which causes us to resolve layers as if they were manifests which then causes containerd to pull all subsequent layers ahead of time. This happens when using the CRI on a sparse index (i.e. one where not all layers have zTOCS). ImpactAs you saw from your testing, this bug doesn't immediately prevent SOCI from working. What it does do is it causes containerd to pull all subsequent layers ahead of time once (similar behavior to #693). Tracing the bugTracking this down, when we do a non-lazy layer pull we call:
Since this is a layer, not a manifest, the resolution fails. Why does this even happen?
Potential fixesFixing this is a little weird because there's at least 1 other bug. Even if we managed to resolve this correctly, We should fix We should probably also look at upstreaming our layer size annotation. This would definitely be helpful for stargz and it might be helpful for other remote snapshotters as well. |
I am on {
"error": "cannot unpack the layer: cannot fetch layer: size of descriptor is 0; unable to resolve: unable to resolve ref (nvcr.io/nvidia/k8s/container-toolkit@sha256:745cad9a8a1e0a0d92738687a85b5a314d324dfca7c2dc6f2b2111508f6fbec9): Head \"https://nvcr.io/v2/nvidia/k8s/container-toolkit/manifests/sha256:745cad9a8a1e0a0d92738687a85b5a314d324dfca7c2dc6f2b2111508f6fbec9\": HEAD https://nvcr.io/v2/nvidia/k8s/container-toolkit/manifests/sha256:745cad9a8a1e0a0d92738687a85b5a314d324dfca7c2dc6f2b2111508f6fbec9 giving up after 3 attempt(s)",
"key": "k8s.io/260/extract-45204369-6hIj sha256:a177c22b4e0d76a18351a1a31c666de1643a68f2a3b4c6408762ffef8e5318cc",
"level": "warning",
"msg": "failed to prepare snapshot; deferring to container runtime",
"parent": "k8s.io/259/sha256:63c5d3862f93c51ccb88bbc83cdc6a515e90e7d375631f0bee85b2f01b5cf715",
"time": "2023-12-08T14:43:25.786365548Z"
} In this case, When a GPU node comes up on EKS, we run 4-5 containers with images from I updated the retry configuration to, [http]
MinWaitMsec=15
MaxRetries=2 I do not have concrete numbers on my hand right now, but I could see a noticeable improvement. Do you guys have any suggestions? Can I express something like, "If there is no soci index present, immediately defer all the layers to container runtime." via the configuration? Thanks. |
Description
When attempting to run a container with a layer that does not contain a ztoc through the CRI API I receive an usual "not found" error message.
I will caveat this by saying the container still runs, but I’m unsure the repercussions of this message as it does not appear when running the same image through the containerd grpc api with nerdctl.
Error Message:
Steps to reproduce the bug
The container image I have been using is a self built nginx container image, but the error message is not image dependent (as it appears on any non indexed image too, like the pause container image).
Self Built Container Image:
Dockerfile
Index.html
Container Image Manifest
111222333444.dkr.ecr.eu-west-1.amazonaws.com/nginx-demo@sha256:1f6b58ad873037ea19d0b2e717cc1f60023963a7ea730a0686ec922545b47581
SOCI Index
111222333444.dkr.ecr.eu-west-1.amazonaws.com/nginx-demo@sha256:90ae56adc78f9363b3b2397f87b9d5e971e004694120ccbc0b58cc8d4b5654df
My containerd config
Running with
nerdctl
Snapshotter Logs showing 2 remote snapshots and 1 local snapshot
Running with
crictl
Snapshotter Logs for failed local snapshot for pause container
Snapshotter Logs for Nginx Container. 2 Successful Remote Snapshots, 1 Failed local snapshot
Describe the results you expected
I was expecting the SOCI snapshotter to create a local snapshot.
Host information
Any additional context or information about the bug
No response
The text was updated successfully, but these errors were encountered: