libpod: wait for healthy on main thread #22658

giuseppe · 2024-05-09T14:14:07Z

wait for the healthy status on the thread where the container lock is held. Otherwise, if it is performed from a go routine, a different thread is used (since the runtime.LockOSThread() call doesn't have any effect), causing pthread_mutex_unlock() to fail with EPERM.

Closes: #22651

Does this PR introduce a user-facing change?

None

mheon · 2024-05-09T14:18:32Z

LGTM

edsantiago · 2024-05-09T14:48:10Z

LGTM aside from the unused-arg thing. Do note that there's another c.start() call elsewhere in this module, in restart, does that also need to waitForHealthy()? And did you look for any other calls I may have missed?

giuseppe · 2024-05-09T19:16:58Z

LGTM aside from the unused-arg thing. Do note that there's another c.start() call elsewhere in this module, in restart, does that also need to waitForHealthy()? And did you look for any other calls I may have missed?

thanks, I've fixed the other occurrences and hopefully added enough new comments

packit-as-a-service · 2024-05-09T19:20:38Z

Ephemeral COPR build failed. @containers/packit-build please check.

test/system/260-sdnotify.bats

packit-as-a-service · 2024-05-13T20:02:42Z

Cockpit tests failed for commit 2006ee127641f21739952af7272d398094706706. @martinpitt, @jelly, @mvollmer please check.

Luap99

<+015ms> # # podman run --name 2HlcfDHskG --health-cmd=touch /terminate --sdnotify=healthy quay.io/libpod/testimage:20240123 sh -c while test \! -e /terminate; do sleep 0.1; done; echo finished
<+0120s> # finished
         # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman’
<+005ms> # [ rc=124 (** EXPECTED 0 **) ]
         # *** TIMED OUT ***

New test is flaking?! I don't see anything obvious wrong but we should not merge known flaky test can you check again

edsantiago · 2024-05-14T11:56:21Z

I have a slight suspicion that this might not be the fault of the new test; that it might be something in the new images. This happens often when we switch VM images: something subtle breaks.

The failure you report is in f40 aarch64. You'll notice that the container DID THE ECHO. That is, as best I can tell, the healthcheck and shell loop and everything worked. Container exit (infrastructure) failed.

Then there's the rerun:

[+0651s] not ok 193 [065] podman cp dir from container to container
...
<+014ms> # # podman rm -t 0 -f srccreated 8280f5a13c12dbcebbd24e6acc8efd07a059567542ce4f10cc987fe12085e820 77525efc1774daddb5b5d3e6ae6b5e5876d04fb23ba4e86fe48dc36f6cd4e896 00bf5080d0cc9ab85c2f31b091d33ff09094f6f7fca6554471899432544e80dd 387e81dd4f82591a8e0df3c6581600594f0ab962cb04accc2795840495a267ea 2c24d9cc35e83ce1d50307d561e0d35c043a789ac1249bef8ae96ef36fce35fd 4c6888367a5b02766ddcd5b2600a5f9250257265846b78f57b700c4ecfdc15dd 5d54065e29e2a1a64befd1a44f2f4e393108686901576214272217daf588eaea f486d8768ad311b50917ba01d7de2321bcd7c9c10751eb908b168817c5212226 82c5e6b1e1f33f810cb7213f31ceffcd927adf62acbdc23a06726e5ff1a9c002 8b0a6abfdc8d7a059ecf811499e4109caaf7135d21cbf5b5d8aff1173425fde9 5521a3c3f5b45543e14087cde6beb06fc874821ca9dbbb5ec48fa05680e7f869 c3b299970bf2d3dd8113177b36dc17056de3a39574eeff0d7e8a10bda89e1029 980b649605583afbabd7fbed8bd0b72afb5b0ec78bdd3f1d3c7b3b9c3dd45838 65a5fc9df9e6b0fbb41a1f2663e501e659987c3765f052c382163540de4d3890
<+2.22s> # time="2024-05-14T09:03:45Z" level=error msg="Rolling back transaction to add exit code: sql: transaction has already been committed or rolled back"
         # c3b299970bf2d3dd8113177b36dc17056de3a39574eeff0d7e8a10bda89e1029
         # srccreated
         # 5521a3c3f5b45543e14087cde6beb06fc874821ca9dbbb5ec48fa05680e7f869
         # 65a5fc9df9e6b0fbb41a1f2663e501e659987c3765f052c382163540de4d3890
         # 82c5e6b1e1f33f810cb7213f31ceffcd927adf62acbdc23a06726e5ff1a9c002
         # 8280f5a13c12dbcebbd24e6acc8efd07a059567542ce4f10cc987fe12085e820
         # 387e81dd4f82591a8e0df3c6581600594f0ab962cb04accc2795840495a267ea
         # f486d8768ad311b50917ba01d7de2321bcd7c9c10751eb908b168817c5212226
         # 4c6888367a5b02766ddcd5b2600a5f9250257265846b78f57b700c4ecfdc15dd
         # 77525efc1774daddb5b5d3e6ae6b5e5876d04fb23ba4e86fe48dc36f6cd4e896
         # 8b0a6abfdc8d7a059ecf811499e4109caaf7135d21cbf5b5d8aff1173425fde9
         # 980b649605583afbabd7fbed8bd0b72afb5b0ec78bdd3f1d3c7b3b9c3dd45838
         # 2c24d9cc35e83ce1d50307d561e0d35c043a789ac1249bef8ae96ef36fce35fd
         # 00bf5080d0cc9ab85c2f31b091d33ff09094f6f7fca6554471899432544e80dd
         # Error: cannot remove container 5d54065e29e2a1a64befd1a44f2f4e393108686901576214272217daf588eaea as it could not be stopped: committing transaction to add exit code: disk I/O error: resource temporarily unavailable

I think something is broken in aarch64-land.

Luap99 · 2024-05-14T12:20:14Z

The failure you report is in f40 aarch64. You'll notice that the container DID THE ECHO. That is, as best I can tell, the healthcheck and shell loop and everything worked. Container exit (infrastructure) failed.

Sure, except that the first two runs both failed on this exact test:
https://api.cirrus-ci.com/v1/artifact/task/5603905776648192/html/sys-podman-fedora-40-aarch64-root-host-sqlite.log.html
https://api.cirrus-ci.com/v1/artifact/task/6453525402615808/html/sys-podman-fedora-40-aarch64-root-host-sqlite.log.html

I don't care what fault it is we should never merge known flakes, period.
If we agree that code is well and not at fault then maybe comment the test out with a FIXME + issue or something but I if this flakes twice on this PR then it can only causes havoc for all other PR's as well.

wait for the healthy status on the thread where the container lock is held. Otherwise, if it is performed from a go routine, a different thread is used (since the runtime.LockOSThread() call doesn't have any effect), causing pthread_mutex_unlock() to fail with EPERM. Closes: containers#22651 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe · 2024-05-14T20:56:00Z

I've stressed the command locally and I've noticed it hangs sometimes, so perhaps it just happens more easily on aarch64.

I've added a patch to address the hang I've seen

packit-as-a-service · 2024-05-14T20:58:39Z

Ephemeral COPR build failed. @containers/packit-build please check.

packit-as-a-service · 2024-05-14T21:14:14Z

Cockpit tests failed for commit c08d4a1fb7f213f4509dc11112ca5ca581dac803. @martinpitt, @jelly, @mvollmer please check.

packit-as-a-service · 2024-05-14T21:15:14Z

Cockpit tests failed for commit 1ce0a9886dec45c6f81c26b8c73c141aa72e3e6e. @martinpitt, @jelly, @mvollmer please check.

packit-as-a-service · 2024-05-14T22:01:39Z

Ephemeral COPR build failed. @containers/packit-build please check.

packit-as-a-service · 2024-05-14T22:17:19Z

Cockpit tests failed for commit 3576c9f20d080136dc08391e9fb0cab95d57e61d. @martinpitt, @jelly, @mvollmer please check.

do not wait for the healthcheck status to change if the container is stopped. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

packit-as-a-service · 2024-05-15T07:41:15Z

Ephemeral COPR build failed. @containers/packit-build please check.

packit-as-a-service · 2024-05-15T07:58:08Z

Cockpit tests failed for commit 35375e0. @martinpitt, @jelly, @mvollmer please check.

Luap99

@giuseppe What is the status? Good to remove the draft or do you plan more changes?

LGTM for reference

giuseppe · 2024-05-16T15:46:08Z

yes sorry, forgot to remove the Draft

openshift-ci · 2024-05-16T15:46:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,giuseppe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2024-05-16T15:55:26Z

/lgtm

openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 9, 2024

giuseppe mentioned this pull request May 9, 2024

--sdnotify=healthy leads to a panic when a rootless container is started #22651

Closed

giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 154924d to 8395122 Compare May 9, 2024 19:16

giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 8395122 to 1e9e448 Compare May 9, 2024 19:33

Luap99 reviewed May 13, 2024

View reviewed changes

test/system/260-sdnotify.bats Outdated Show resolved Hide resolved

giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 1e9e448 to 2006ee1 Compare May 13, 2024 19:30

Luap99 requested changes May 14, 2024

View reviewed changes

giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 2006ee1 to 1ce0a98 Compare May 14, 2024 20:54

giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 1ce0a98 to c08d4a1 Compare May 14, 2024 20:55

giuseppe marked this pull request as draft May 14, 2024 21:02

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 14, 2024

giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from c08d4a1 to 3576c9f Compare May 14, 2024 21:57

container_api: do not wait for healtchecks if stopped

35375e0

do not wait for the healthcheck status to change if the container is stopped. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 3576c9f to 35375e0 Compare May 15, 2024 07:36

This was referenced May 15, 2024

machine: Volume ops test: statfs /private/tmp/ci/ginkgoNNN: no such file or directory #22569

Closed

minikube flakes #21931

Closed

Luap99 reviewed May 16, 2024

View reviewed changes

giuseppe marked this pull request as ready for review May 16, 2024 15:46

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 16, 2024

Luap99 approved these changes May 16, 2024

View reviewed changes

openshift-ci bot assigned mheon May 16, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2024

openshift-merge-bot bot merged commit f7a3046 into containers:main May 16, 2024
88 of 91 checks passed

edsantiago mentioned this pull request May 20, 2024

podman healthcheck + sdnotify: Error: container is stopped #22760

Closed

edsantiago mentioned this pull request Jul 22, 2024

podman-machine uncategorized flakes #22551

Open

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 15, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libpod: wait for healthy on main thread #22658

libpod: wait for healthy on main thread #22658

giuseppe commented May 9, 2024

mheon commented May 9, 2024

edsantiago commented May 9, 2024

giuseppe commented May 9, 2024

packit-as-a-service bot commented May 9, 2024

packit-as-a-service bot commented May 13, 2024

Luap99 left a comment

edsantiago commented May 14, 2024

Luap99 commented May 14, 2024

giuseppe commented May 14, 2024

packit-as-a-service bot commented May 14, 2024

packit-as-a-service bot commented May 14, 2024

packit-as-a-service bot commented May 14, 2024

packit-as-a-service bot commented May 14, 2024

packit-as-a-service bot commented May 14, 2024

packit-as-a-service bot commented May 15, 2024

packit-as-a-service bot commented May 15, 2024

Luap99 left a comment

giuseppe commented May 16, 2024

openshift-ci bot commented May 16, 2024

mheon commented May 16, 2024

libpod: wait for healthy on main thread #22658

libpod: wait for healthy on main thread #22658

Conversation

giuseppe commented May 9, 2024

Does this PR introduce a user-facing change?

mheon commented May 9, 2024

edsantiago commented May 9, 2024

giuseppe commented May 9, 2024

packit-as-a-service bot commented May 9, 2024

packit-as-a-service bot commented May 13, 2024

Luap99 left a comment

Choose a reason for hiding this comment

edsantiago commented May 14, 2024

Luap99 commented May 14, 2024

giuseppe commented May 14, 2024

packit-as-a-service bot commented May 14, 2024

packit-as-a-service bot commented May 14, 2024

packit-as-a-service bot commented May 14, 2024

packit-as-a-service bot commented May 14, 2024

packit-as-a-service bot commented May 14, 2024

packit-as-a-service bot commented May 15, 2024

packit-as-a-service bot commented May 15, 2024

Luap99 left a comment

Choose a reason for hiding this comment

giuseppe commented May 16, 2024

openshift-ci bot commented May 16, 2024

mheon commented May 16, 2024