-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
copying system image from manifest list: writing/storing blob: happened during read: unexpected EOF #17193
Comments
Saw multiple occurrence of this flake while working on this PR as well #16297 |
Looks like a registry flake to me. @mtrmac, I think we need to tweak catching EOFs during retry. I did not investigate any further than guessing that https://github.com/containers/common/blob/main/pkg/retry/retry.go#L76-L79 is not enough. |
Yes; it’s already on a short-term list as https://issues.redhat.com/browse/RUN-1636 . |
Now our number-two flake (#16973 is number one by far)
|
containers/image#1816 is WIP trying to improve this. Note #17221 , filed early to collect Podman production data. If that doesn’t catch the flake ( |
Just the last four days:
|
Last five days
|
Help is on the way. Let me vendor c/image directly now since @mtrmac's latest changes are expected to fix it. |
To fix the registry flakes we see in containers#17193 making c/image more tolerant toward unexpected EOFs from registries. Also pin docker/docker to v20.10.23+incompatible as it doesn't compile as imagebuilder needs to be fixed before. Fixes: containers#17193 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
#17221 was merged; that should help but we do see failures that don’t fit into the current heuristic. |
Same failure, new error message seen (in f36 rootless), is this helpful?
|
I don’t know what’s a good way to collect those “heuristic tuning data” messages… in principle they could show us something unexpected. (A real telemetry is … a thought.) So far all of those have been something like the above: a failure very soon after starting to fetch data, less than a megabyte into the stream (a megabyte being the current cut-off). That’s probably a clear enough indication that we need to allow a retry in that case, so more instances of that kind of failure (last retry 0, small total, small “since progress”) would not make a difference. Other kinds of failure messages would still be interesting. |
|
Here's one I don't know how to categorize:
|
From a not very careful check, I’d categorize that as “other”: It is the remote server successfully sending a “handshake failure” TLS alert record, i.e. it’s not an EOF we see from the server. (It may possible that the connection was disrupted in a fundamentally the same way, causing the server to see an EOF from us, and responding to that with a “handshake failure” alert. I’m not sure.) |
I'm seeing a lot of flakes now that say "502 Bad Gateway" instead of "happened during read":
Is this a new variant of this same issue? Is it #16973 (quay flakes)? Or should I file a new issue? |
So, 502 is a different behavior. (We don’t see inside the registry to understand whether it’s the same cause, and it might not matter.) For context, currently there are two levels of retries:
Looking at the code, it seems that a 502 doesn’t trigger the c/image retry (as it shouldn’t), but it also doesn’t trigger c/common/pkg/retry (because the error returned by c/image doesn’t have a recognizable Go type). So, if the failures don’t trigger a (This is under the assumption that retrying could help the operation succeed. I’m not sure about that, nor how to reasonably measure that. Retrying will very likely increase the load on the server, possibly making the failure harder to recover from.) |
New flake, two instances, both in the same PR at a similar clock time:
and
default
as name #17175default
as name #17175(Am not tagging
remote
because even though theint
failure says remote, the failure actually happened while pulling cache images, which I think is done using non-remote podman)The text was updated successfully, but these errors were encountered: