-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Podman-remote run should wait for exit code #3934
Conversation
Hopefully fixes: #3870 |
@mheon @baude @edsantiago PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, any chance we can test that?
Code LGTM |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rhatdan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@edsantiago has a tests that was revealing this error. |
Testing negatives/error-cases of a race is very difficult to do reliably in automation. I can manually run the change through the reproducer I used before, however the results cannot be conclusive - since code-change nearly always causes a timing change 😕 |
@edsantiago that problem is not happening on master. Cirrus-CI (internally) us suppose to decrypt those If re-running doesn't clear up the problem, then I'll inform their support. |
I think that's a red herring. My suspicion is that the problem is elsewhere, but Cirrus is showing |
It mangles the output with |
@rhatdan There's some problem here, when I run Ed's reproducer using code from this PR on Fedora 30:
On Fedora 30, it simply hangs after printing 'hi' and never returns (I waited 5-minutes). I'll look at the Cirrus-CI logs next... |
...ya, similar story in the integration-tests: |
Ref: VM Images I'm using are:
#!/bin/bash
set -xe
tmpdir=$(mktemp -d --tmpdir podman-remote-test.XXXXXX)
cat >$tmpdir/Dockerfile <<EOF
FROM quay.io/libpod/alpine_labels:latest
RUN apk add nginx
RUN echo hi >/myfile
EOF
pr() { podman-remote "$@"; }
while :;do
pr build -t build_test --format=docker $tmpdir
pr run --rm build_test cat /myfile
pr rmi -f build_test
done like this from the repository root: make podman podman-remote
make install PREFIX=/usr
systemctl enable io.podman.socket
systemctl enable io.podman.service
systemctl restart io.podman.socket
systemctl restart io.podman.service
chmod +x repro.sh
./repro.sh
...cut... |
@cevich @edsantiago Total rewrite of original patch, turns out the error, I believe, was on the server side. We were not waiting for a full exit, so I think the client side could exit before the container was cleaned up. While in this code, I figured out why container exit codes were not being propagated. |
LGTM assuming happy tests. |
LGTM but I'd like a nod from @baude pre-merge |
System tests are failing for most remote client jobs. CGroups v2 remote succeeded, though? |
The major difference between reproducing above, and previously is the |
Rebased onto #3985 just to be safe, spun up a VM w/o updating it's packages. Running the reproducer above I'm still getting Perhaps interestingly, this is also some kind of race, because when I change to
|
@baude Matt took a look at a VM behaving this way, and concluded:
would you mind giving it a go (rebase this PR against master to be safe) when you have a chance? (note: current CI results here are not helpful until rhatdan rebases the PR) |
We have leaked the exit number codess all over the code, this patch removes the numbers to constants. Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
This change matches what is happening on the podman local side and should eliminate a race condition. Also exit commands on the server side should start to return to client. Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
This is a new error in a start test I haven't seen before though it's Ubuntu, so maybe a flake? |
...digging a bit, the @mheon If the above is true, this race is in many places in our integration tests. There are even some which assert the presence of remaining containers, which could race with an erroneous removal to produce false-positive results. Is there a way to detect whether or not the removal process has completed or not for a given container ID (like some file that should not exist or something like that)? This would be the ideal solution in both (positive and negative) test scenarios, to prevent racing on an existence-check. For example, |
uggg, sorry, I shouldn't tie up Dan's PR with more problems, let me open an issue for that race... |
...opened #4021 |
(re-ran flaked test) |
After this PR, we should have a guarantee that the container is gone when the original Podman process exits. |
Nevermind. We don't have an explicit remove in https://github.com/containers/libpod/blob/82ac0d8925dbb5aa738f1494ecb002eb6daca992/pkg/adapter/containers_remote.go#L462 We probably need to add one |
@mheon the remove happens on the server side in the adapter code. |
I think this is ready to merge. @mheon @giuseppe @vrothberg @TomSweeneyRedHat @cevich @edsantiago PTAL |
Are we guaranteed that the client doesn't exit until the server is done removing, though? |
/lgtm |
Possibly not, that's why I opened #4021 |
I believe that we are guaranteed for the removal to happen before the front end gets signaled. |
This change matches what is happening on the podman local side
and should eliminate a race condition.
Signed-off-by: Daniel J Walsh dwalsh@redhat.com