-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: cgroup.freeze flake is back #7148
Comments
we need to revendor buildah to get: containers/buildah#2434 The race is in Buildah that is trying to kill a container that is already exited. |
@TomSweeneyRedHat is working on a rebase for podman-1.15 to fix 2.0.* but we could separately declare a new master branch to vendor it in. |
Ping, what is the status of this? I see #7203 bringing in a new buildah v1.15.1-0.20200731151214-29f4d01c621c, but the flake continues to happen: 46 podman build - stdin test
42 podman build - basic test
|
Another one today 39 podman build - basic test
|
Someone pretty please, help. This is flaking at least once per day, causing much lost effort. Most recently in the v2.0.5 build PR. @TomSweeneyRedHat @rhatdan @giuseppe @mheon anyone, please? |
I've been trying to reproduce locally without any luck so far, let's improve the buildah error message and hope it can show something more useful: containers/buildah#2559 |
Woot, I just reproduced it with podman @ master @ 8fdc116:
Unfortunately I don't see the 'got output' string. |
@giuseppe I hand-patched buildah per your 2559 above. The output is not helpful:
What else do you suggest that I could try? |
My reproducer (please don't judge me): # cat foo.sh
#!/bin/bash
set -e
timeout --foreground -v --kill=10 60 ../bin/podman build -t foo - <<EOF
FROM quay.io/libpod/alpine_labels:latest
RUN mkdir /workdir
WORKDIR /workdir
RUN /bin/echo hello
RUN apk add nginx
RUN echo jnycVMjJfiJTIZtnvhgIta9v359dTI7fiYFrWmeNbzg1zu5e6M > /v4LZDGa4woigKduvCVsX
EOF
../bin/podman rmi foo
# chmod 755 foo.sh
# while ./foo.sh;do echo;done On a 1minutetip f32 VM, seems to fail within 15-25 minutes. |
by the time crun attempts to read from the cgroup, systemd might have already cleaned it up. When using systemd, on ENOENT state reports the container as "stopped" instead of an error. Closes: containers/podman#7148 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Thanks for the hint. I wasn't yet able to reproduce it on my machine (and in a VM), but inspecting more the code, I've found the root issue of the race: containers/crun#474 It can be easily reproduced by adding a sleep just before crun tries to read the cgroup:
then I get:
the different error open pidfd instead of kill is related to crun supporting pidfd now for killing processes. I'll backport the patch to the Fedora package as soon as it is merged |
The nightmareish cgroup.freeze flake is back:
This is on Fedora 31, and yes, it has crun-0.14.1-1
@giuseppe PTAL. I know this is probably going to need to be fixed in crun but I'm filing here as a way to track future flakes.
The text was updated successfully, but these errors were encountered: