Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libpod: wait for healthy on main thread #22658

Conversation

giuseppe
Copy link
Member

@giuseppe giuseppe commented May 9, 2024

wait for the healthy status on the thread where the container lock is held. Otherwise, if it is performed from a go routine, a different thread is used (since the runtime.LockOSThread() call doesn't have any effect), causing pthread_mutex_unlock() to fail with EPERM.

Closes: #22651

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 9, 2024
@mheon
Copy link
Member

mheon commented May 9, 2024

LGTM

@edsantiago
Copy link
Member

LGTM aside from the unused-arg thing. Do note that there's another c.start() call elsewhere in this module, in restart, does that also need to waitForHealthy()? And did you look for any other calls I may have missed?

@giuseppe giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 154924d to 8395122 Compare May 9, 2024 19:16
@giuseppe
Copy link
Member Author

giuseppe commented May 9, 2024

LGTM aside from the unused-arg thing. Do note that there's another c.start() call elsewhere in this module, in restart, does that also need to waitForHealthy()? And did you look for any other calls I may have missed?

thanks, I've fixed the other occurrences and hopefully added enough new comments

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@giuseppe giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 8395122 to 1e9e448 Compare May 9, 2024 19:33
@giuseppe giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 1e9e448 to 2006ee1 Compare May 13, 2024 19:30
Copy link

Cockpit tests failed for commit 2006ee127641f21739952af7272d398094706706. @martinpitt, @jelly, @mvollmer please check.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<+015ms> # # podman run --name 2HlcfDHskG --health-cmd=touch /terminate --sdnotify=healthy quay.io/libpod/testimage:20240123 sh -c while test \! -e /terminate; do sleep 0.1; done; echo finished
<+0120s> # finished
         # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman’
<+005ms> # [ rc=124 (** EXPECTED 0 **) ]
         # *** TIMED OUT ***

New test is flaking?! I don't see anything obvious wrong but we should not merge known flaky test can you check again

@edsantiago
Copy link
Member

I have a slight suspicion that this might not be the fault of the new test; that it might be something in the new images. This happens often when we switch VM images: something subtle breaks.

The failure you report is in f40 aarch64. You'll notice that the container DID THE ECHO. That is, as best I can tell, the healthcheck and shell loop and everything worked. Container exit (infrastructure) failed.

Then there's the rerun:

[+0651s] not ok 193 [065] podman cp dir from container to container
...
<+014ms> # # podman rm -t 0 -f srccreated 8280f5a13c12dbcebbd24e6acc8efd07a059567542ce4f10cc987fe12085e820 77525efc1774daddb5b5d3e6ae6b5e5876d04fb23ba4e86fe48dc36f6cd4e896 00bf5080d0cc9ab85c2f31b091d33ff09094f6f7fca6554471899432544e80dd 387e81dd4f82591a8e0df3c6581600594f0ab962cb04accc2795840495a267ea 2c24d9cc35e83ce1d50307d561e0d35c043a789ac1249bef8ae96ef36fce35fd 4c6888367a5b02766ddcd5b2600a5f9250257265846b78f57b700c4ecfdc15dd 5d54065e29e2a1a64befd1a44f2f4e393108686901576214272217daf588eaea f486d8768ad311b50917ba01d7de2321bcd7c9c10751eb908b168817c5212226 82c5e6b1e1f33f810cb7213f31ceffcd927adf62acbdc23a06726e5ff1a9c002 8b0a6abfdc8d7a059ecf811499e4109caaf7135d21cbf5b5d8aff1173425fde9 5521a3c3f5b45543e14087cde6beb06fc874821ca9dbbb5ec48fa05680e7f869 c3b299970bf2d3dd8113177b36dc17056de3a39574eeff0d7e8a10bda89e1029 980b649605583afbabd7fbed8bd0b72afb5b0ec78bdd3f1d3c7b3b9c3dd45838 65a5fc9df9e6b0fbb41a1f2663e501e659987c3765f052c382163540de4d3890
<+2.22s> # time="2024-05-14T09:03:45Z" level=error msg="Rolling back transaction to add exit code: sql: transaction has already been committed or rolled back"
         # c3b299970bf2d3dd8113177b36dc17056de3a39574eeff0d7e8a10bda89e1029
         # srccreated
         # 5521a3c3f5b45543e14087cde6beb06fc874821ca9dbbb5ec48fa05680e7f869
         # 65a5fc9df9e6b0fbb41a1f2663e501e659987c3765f052c382163540de4d3890
         # 82c5e6b1e1f33f810cb7213f31ceffcd927adf62acbdc23a06726e5ff1a9c002
         # 8280f5a13c12dbcebbd24e6acc8efd07a059567542ce4f10cc987fe12085e820
         # 387e81dd4f82591a8e0df3c6581600594f0ab962cb04accc2795840495a267ea
         # f486d8768ad311b50917ba01d7de2321bcd7c9c10751eb908b168817c5212226
         # 4c6888367a5b02766ddcd5b2600a5f9250257265846b78f57b700c4ecfdc15dd
         # 77525efc1774daddb5b5d3e6ae6b5e5876d04fb23ba4e86fe48dc36f6cd4e896
         # 8b0a6abfdc8d7a059ecf811499e4109caaf7135d21cbf5b5d8aff1173425fde9
         # 980b649605583afbabd7fbed8bd0b72afb5b0ec78bdd3f1d3c7b3b9c3dd45838
         # 2c24d9cc35e83ce1d50307d561e0d35c043a789ac1249bef8ae96ef36fce35fd
         # 00bf5080d0cc9ab85c2f31b091d33ff09094f6f7fca6554471899432544e80dd
         # Error: cannot remove container 5d54065e29e2a1a64befd1a44f2f4e393108686901576214272217daf588eaea as it could not be stopped: committing transaction to add exit code: disk I/O error: resource temporarily unavailable

I think something is broken in aarch64-land.

@Luap99
Copy link
Member

Luap99 commented May 14, 2024

The failure you report is in f40 aarch64. You'll notice that the container DID THE ECHO. That is, as best I can tell, the healthcheck and shell loop and everything worked. Container exit (infrastructure) failed.

Sure, except that the first two runs both failed on this exact test:
https://api.cirrus-ci.com/v1/artifact/task/5603905776648192/html/sys-podman-fedora-40-aarch64-root-host-sqlite.log.html
https://api.cirrus-ci.com/v1/artifact/task/6453525402615808/html/sys-podman-fedora-40-aarch64-root-host-sqlite.log.html

I don't care what fault it is we should never merge known flakes, period.
If we agree that code is well and not at fault then maybe comment the test out with a FIXME + issue or something but I if this flakes twice on this PR then it can only causes havoc for all other PR's as well.

@giuseppe giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 2006ee1 to 1ce0a98 Compare May 14, 2024 20:54
wait for the healthy status on the thread where the container lock is
held.  Otherwise, if it is performed from a go routine, a different
thread is used (since the runtime.LockOSThread() call doesn't have any
effect), causing pthread_mutex_unlock() to fail with EPERM.

Closes: containers#22651

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 1ce0a98 to c08d4a1 Compare May 14, 2024 20:55
@giuseppe
Copy link
Member Author

I've stressed the command locally and I've noticed it hangs sometimes, so perhaps it just happens more easily on aarch64.

I've added a patch to address the hang I've seen

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@giuseppe giuseppe marked this pull request as draft May 14, 2024 21:02
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 14, 2024
Copy link

Cockpit tests failed for commit c08d4a1fb7f213f4509dc11112ca5ca581dac803. @martinpitt, @jelly, @mvollmer please check.

Copy link

Cockpit tests failed for commit 1ce0a9886dec45c6f81c26b8c73c141aa72e3e6e. @martinpitt, @jelly, @mvollmer please check.

@giuseppe giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from c08d4a1 to 3576c9f Compare May 14, 2024 21:57
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

Copy link

Cockpit tests failed for commit 3576c9f20d080136dc08391e9fb0cab95d57e61d. @martinpitt, @jelly, @mvollmer please check.

do not wait for the healthcheck status to change if the container is
stopped.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the libpod-wait-for-healthy-on-main-thread branch from 3576c9f to 35375e0 Compare May 15, 2024 07:36
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

Copy link

Cockpit tests failed for commit 35375e0. @martinpitt, @jelly, @mvollmer please check.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe What is the status? Good to remove the draft or do you plan more changes?

LGTM for reference

@giuseppe
Copy link
Member Author

yes sorry, forgot to remove the Draft

@giuseppe giuseppe marked this pull request as ready for review May 16, 2024 15:46
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 16, 2024
Copy link
Contributor

openshift-ci bot commented May 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mheon
Copy link
Member

mheon commented May 16, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit f7a3046 into containers:main May 16, 2024
88 of 91 checks passed
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 15, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Aug 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--sdnotify=healthy leads to a panic when a rootless container is started
4 participants