Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: podman kill: loses exit code #9751

Closed
edsantiago opened this issue Mar 18, 2021 · 13 comments · Fixed by #10028
Closed

remote: podman kill: loses exit code #9751

edsantiago opened this issue Mar 18, 2021 · 13 comments · Fixed by #10028
Assignees
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

edsantiago commented Mar 18, 2021

New flake. Only three instances in two months, but that's enough to make me worry:

[+0104s] not ok 51 podman stop - unlock while waiting for timeout
(from function `is' in file test/system/helpers.bash, line 406,
 in test file test/system/050-stop.bats, line 95)
  `is "$output" "137" "Exit code of killed container"' failed
# podman-remote rm --all --force
# podman-remote ps --all --external --format {{.ID}} {{.Names}}
# podman-remote images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
quay.io/libpod/testimage:20200929 766ff5a3a7e4
# podman-remote run --name stopme -d quay.io/libpod/testimage:20200929 sh -c trap 'echo Received SIGTERM, ignoring' SIGTERM; echo READY; while :; do sleep 1; done
2da656948070442ce2585b9f2a83d003b0a1c44ea5a67b47a3b7b4a9d1f4bc9d
# podman-remote ps -a
CONTAINER ID  IMAGE                              COMMAND               CREATED                 STATUS                     PORTS   NAMES
2da656948070  quay.io/libpod/testimage:20200929  sh -c trap 'echo ...  Less than a second ago  Up Less than a second ago          stopme
# podman-remote inspect --format {{.State.Status}} stopme
stopping
# podman-remote kill stopme
2da656948070442ce2585b9f2a83d003b0a1c44ea5a67b47a3b7b4a9d1f4bc9d
# podman-remote inspect --format {{.State.ExitCode}} stopme
0
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Exit code of killed container
#| expected: '137'
#|   actual: '0'
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

sys: podman stop - unlock while waiting for timeout

It's possible that the bug is in the test itself.

@edsantiago edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. remote Problem is in podman-remote labels Mar 18, 2021
@edsantiago
Copy link
Member Author

I just saw it again, f33:

 ✗ podman stop - unlock while waiting for timeout
   (from function `is' in file /usr/share/podman/test/system/helpers.bash, line 411,
    in test file /usr/share/podman/test/system/050-stop.bats, line 138)
     `is "$output" "137" "Exit code of killed container"' failed
   # /usr/bin/podman-remote rm --all --force
   # /usr/bin/podman-remote ps --all --external --format {{.ID}} {{.Names}}
   # /usr/bin/podman-remote images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
   quay.io/libpod/testimage:20210223 4abe85964f5c
   # /usr/bin/podman-remote run --name stopme -d quay.io/libpod/testimage:20210223 sh -c trap 'echo Received SIGTERM, ignoring' SIGTERM; echo READY; while :; do sleep 1; done
   cd52dbb101531433498ca372ffd9db273684ecac76a1e091e078261a396084ac
   # /usr/bin/podman-remote ps -a
   CONTAINER ID  IMAGE                              COMMAND               CREATED                 STATUS    PORTS   NAMES
   cd52dbb10153  quay.io/libpod/testimage:20210223  sh -c trap 'echo ...  Less than a second ago  stopping          stopme
   # /usr/bin/podman-remote inspect --format {{.State.Status}} stopme
   stopping
   # /usr/bin/podman-remote kill stopme
   stopme
   # /usr/bin/podman-remote inspect --format {{.State.ExitCode}} stopme
   0
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: Exit code of killed container
   #| expected: '137'
   #|   actual: '0'
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@edsantiago
Copy link
Member Author

I'm starting to think there's a more general race in podman-remote exit-status handling:

Trivial reproducer (assumes podman system service running elsewhere):

$ cat Containerfile
FROM quay.io/libpod/nosuchimage:nosuchtag
RUN echo hi
$ while :;do ../bin/podman-remote build -t build_test --pull-never . && break;done
...runs for a minute or two, then

STEP 1: FROM quay.io/libpod/nosuchimage:nosuchtag
Error: error creating build container: pull policy is "never" but "quay.io/libpod/nosuchimage:nosuchtag" could not be found locally

STEP 1: FROM quay.io/libpod/nosuchimage:nosuchtag
$ <--- back to prompt, with exit status 0. This should never happen.

My interpretation: the FROM, running on the server, is failing. The server aborts, the client aborts, but the client somehow doesn't get the error message nor exit status. The above is, so far, a 100% reliable reproducer, root and rootless.

@edsantiago
Copy link
Member Author

Another one in kill

sys: podman stop - unlock while waiting for timeout

@edsantiago
Copy link
Member Author

And a few more:

sys: podman stop - unlock while waiting for timeout

I'm having trouble keeping track of which ones I've already posted, so I may have missed some

@rhatdan
Copy link
Member

rhatdan commented Apr 14, 2021

This is very strange, since podman and podman-remote kill are using the same code path for kill. One would think this is a race condition between kill and inspect, but it should be faster with local execution then remote. It could be the kill kills the process inside of the container but the cleanup call does not happen until after the inspect. I would figure this would be the same race on local versus remote.

@mheon WDYT?

@rhatdan
Copy link
Member

rhatdan commented Apr 14, 2021

We do have this code for Docker, we could do the same for libpod, not sure why we would not want to.

	// Docker waits for the container to stop if the signal is 0 or
	// SIGKILL.
	if !utils.IsLibpodRequest(r) {
		sig, err := signal.ParseSignalNameOrNumber(query.Signal)
		if err != nil {
			utils.InternalServerError(w, err)
			return
		}
		if sig == 0 || syscall.Signal(sig) == syscall.SIGKILL {
			opts := entities.WaitOptions{
				Condition: []define.ContainerStatus{define.ContainerStateExited, define.ContainerStateStopped},
				Interval:  time.Millisecond * 250,
			}
			if _, err := containerEngine.ContainerWait(r.Context(), []string{name}, opts); err != nil {
				utils.Error(w, "Something went wrong.", http.StatusInternalServerError, err)
				return
			}
		}
	}

@mheon
Copy link
Member

mheon commented Apr 14, 2021

The test as written is inherently racy. podman kill does not and has never made a promise that it waits until the container is dead. As such, once podman-remote kill exits, the signal has been sent, and the kernel will process it when it next wakes up the container. So the race here is if the container gets scheduled before Podman gets past the initial state check for the container. I imagine the container being a child of podman-remote somewhat reduces its priority to run, as it's backgrounded?

The test should be rewritten to use podman stop to remove races.

@mheon
Copy link
Member

mheon commented Apr 14, 2021

Alternatively, it needs something to force inspect to run after the container has stopped. Maybe a loop to check container state?

@edsantiago
Copy link
Member Author

@mheon this is more than just podman-remote kill; or perhaps there's another different bug? Search in-page for 9920, that one manifests as podman-remote build.

@mheon
Copy link
Member

mheon commented Apr 14, 2021

That sounds like a separate bug? From where I stand, the original issue here with podman kill is definitely a race in the code. podman build may also be a race but since the commands are different it sounds distinct.

@rhatdan
Copy link
Member

rhatdan commented Apr 14, 2021

If we changed the test to do

podman kill
podman wait
podman inspect

The issue probably goes away correct?

@edsantiago
Copy link
Member Author

I'll submit a PR for that. (And will file the build problem as a new issue)

edsantiago added a commit to edsantiago/libpod that referenced this issue Apr 14, 2021
Add 'podman wait' between kill & inspect.

Fixes: containers#9751

Signed-off-by: Ed Santiago <santiago@redhat.com>
jmguzik pushed a commit to jmguzik/podman that referenced this issue Apr 26, 2021
Add 'podman wait' between kill & inspect.

Fixes: containers#9751

Signed-off-by: Ed Santiago <santiago@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants