-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
race: new rm -fa with dependencies is not actually waiting for container removal #18874
Comments
@mheon PTAL |
I'll take a look this afternoon |
@edsantiago I don't suppose you have any idea what makes this reproduce? I'm not seeing any obvious reason why Is there a possibility it's not removing the second container at all? Or is it definitely being removed, just after the command exits? |
Sorry, I haven't looked into it beyond reporting. I will do so now - will start by looking at the test for obvious races, and will then see if I can reproduce it. |
@mheon If you look at the logs you can see the the that in AfterEach() both podman stop --all and podman rm -fa both output this cid, so yes I think this is not a race it was not removed at all for some reason. |
Gosh, I thought it'd be so easy to reproduce. It's not. Giving up for now; here's my attempt at a reproducer: #!/bin/bash
td=/tmp/mypodmantmpdir
mkdir -p $td/root $td/runroot
PM="bin/podman --cgroup-manager cgroupfs --storage-driver vfs --events-backend file --db-backend sqlite --network-backend netavark --tmpdir $td --root $td/root --runroot $td/runroot"
IMAGE=quay.io/libpod/alpine:latest
$PM rm -fa
set -e
while :;do
$PM run -d --name mytop $IMAGE top
$PM run -d --network container:mytop $IMAGE top
$PM rm -fa
left=$($PM ps -aq)
echo $left
test -z "$left"
done |
Could not reproduce on my f38 laptop, over many hours. Tried 1mt. Reproduced on the second iteration, podman main @ 7c76907. Given how quickly it reproduces -- less than a minute, usually less than five seconds -- I confirmed that none of the options matter. Here is the barest-bones reproducer: #!/bin/bash
PM=bin/podman
IMAGE=quay.io/libpod/alpine:latest
$PM rm -fa
set -e
while :;do
$PM run -d --name mytop $IMAGE top
$PM run -d --network container:mytop $IMAGE top
$PM rm -fa
left=$($PM ps -aq)
echo $left
test -z "$left"
done @mheon in answer to your question, the container is not being removed. Here's my # bin/podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
659d48b527d8 quay.io/libpod/alpine:latest top 4 minutes ago Up 4 minutes mytop (that's "ps", not "ps -a"). |
Incidentally, my observations (Cirrus logs as well as dozens of iterations on 1mt) show that it is always the first container that survives. This may be just chance, but it seems unlikely over this many failures. |
More (possibly useless) data. This reproduces so quickly on 1mt that I wrote a test to confirm my assertion above, that the survivor is always cid1. It is (so far, over hundreds of iterations) but I'm seeing, in my loop, rather a lot of these warnings:
I don't know where they're coming from, and I don't know what that SHA refers to. It's not an error, because Oh, here's more fun. In the process of playing, I ended up with the umount/EINVAL bug (#18831)! # ./reproducer
Error: creating container storage: the container name "mytop" is already in use by 335f21f6bee6407c79bf12181427c7db01b431fa2533112e44c9487002a14bd2. You have to remove that container to be able to reuse that name: that name is already in use
# bin/podman ps -aq <<<--- no output
# bin/podman ps -a --external
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
335f21f6bee6 quay.io/libpod/alpine:latest storage 6 minutes ago Storage mytop
# bin/podman rm -af <<<-- no output, no effect
# bin/podman rm -f 335f
WARN[0000] Unmounting container "335f" while attempting to delete storage: unmounting "/var/lib/containers/storage/overlay/e3149559a6e1ae65a7b0b2e140ceb2bb7e7bdc06b9654bd5d46ba5d3adf7ad46/merged": invalid argument
Error: removing storage for container "335f": unmounting "/var/lib/containers/storage/overlay/e3149559a6e1ae65a7b0b2e140ceb2bb7e7bdc06b9654bd5d46ba5d3adf7ad46/merged": invalid argument From this point on, even In case it helps, here is my final confirm-the-survivor reproducer: #!/bin/bash
PM=bin/podman
IMAGE=quay.io/libpod/alpine:latest
$PM rm -fa
echo
set -e
while :;do
cid1=$($PM run -d --name mytop $IMAGE top)
cid2=$($PM run -d --network container:mytop $IMAGE top)
rm=$($PM rm -fa)
# Order is not predictable. Usually it's cid2 cid1, but sometimes reverse
# if [[ "$(echo $rm)" != "$cid2 $cid1" ]]; then
# echo
# echo "rm: got $rm"
# echo "expected $cid2 $cid1"
# exit 1
# fi
left=$($PM ps -aq --notruncate)
if [[ -n "$left" ]]; then
if [[ "$left" != "$cid1" ]]; then
echo
echo "WHOA! Leftover container is not cid1!"
echo " cid1 : $cid1"
echo " cid2 : $cid2"
echo " left : $left"
exit 1
fi
echo "Triggered, usual case."
$PM rm -fa >/dev/null
fi
done |
@mheon Can you take a look? I think this should be addressed before the next release. |
I don't think this helps in any way, given the super-easy reproducer, but just in case:
Basically, all distros, root & rootless. Have not seen it happen in podman-remote nor in aarch64 (yet) |
New variation:
That is: both containers were stopped and rm'ed (at least according to the log), but there was an error/warning logged. UPDATE: never mind, this is probably #18452 (the "open pidfd" flake) |
Given that a reproducer already exists, I have nothing specific to contribute. Other than to say that after upgrading to 4.6.0, this has started to affect us as well. |
I'll take a further look tomorrow |
Welcome back, @mheon! I hope your PTO was restful. Quick reminder that this is still a huge problem, flaking very often (but passing on ginkgo retry):
|
When removing a container's dependency, getting an error that the container has already been removed (ErrNoSuchCtr and ErrCtrRemoved) should not be fatal. We wanted the container gone, it's gone, no need to error out. [NO NEW TESTS NEEDED] This is a race and thus hard to test for. Fixes containers#18874 Signed-off-by: Matt Heon <mheon@redhat.com>
This is a big one, failing often; it's only a matter of time before it hits us even in main.
The text was updated successfully, but these errors were encountered: