Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pasta udp tests: new bytecheck helper #24238

Merged
merged 2 commits into from
Oct 17, 2024

Conversation

edsantiago
Copy link
Member

@edsantiago edsantiago commented Oct 11, 2024

...for debugging #24147, because "md5sum mismatch" is not
the best way to troubleshoot bytestream differences.

socat is run on the container, so this requires building a
new testimage (20241011). Bump to new VMs which include it:

containers/automation_images#389

This new image breaks APIv2 tests, almost certainly due
to containers/buildah#5595 . Fix
those tests, and add a few new ones while we're at it.

Signed-off-by: Ed Santiago santiago@redhat.com

None

@openshift-ci openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 11, 2024
@edsantiago edsantiago force-pushed the pasta-bytecheck branch 2 times, most recently from c7fb508 to 5fb4046 Compare October 11, 2024 18:42
Copy link

Cockpit tests failed for commit 5fb40466bedc4c76f3ad0705b3c49da26109cc0b. @martinpitt, @jelly, @mvollmer please check.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@martinpitt
Copy link
Contributor

Hmm, testCheckpointRestore started to fail rather often on Fedora 41 a few weeks ago. It's definitively a flake (most podman PRs are fine, and retries work), and we don't see this on our CI weather report at all.

The screenshot shows the error message, but it's not very detailed -- calling crun checkpoint failed . The journal also just says

CRIU checkpointing failed -52. Please check CRIU logfile /var/lib/containers/storage/overlay-containers/98fddba2278ccdbfa2da032d64c5b08b852bfd6c2da61a6da8d5ee8128c8b7f2/userdata/dump.log

That file isn't captured in test artifacts, though.

I upgraded a local F41 VM with all packages from podman-next COPR and ran the test 50 times successfully. So maybe the testing farm environment is different, but I also never saw that failure in our cockpit-podman PRs (i.e. with distro packages instead of podman-next).

For now I make a PR that will attach that dump.log as an artifact in case of failures. Do you have any other idea about how to debug this?

@edsantiago
Copy link
Member Author

See also #24230. There's a new version of criu out there, and it's broken.

@martinpitt
Copy link
Contributor

See cockpit-project/cockpit-podman#1878 for the "collect criu dump.log" bit.

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

Copy link

Cockpit tests failed for commit a64799191ade12605f9c81af5cc5c3a80e816026. @martinpitt, @jelly, @mvollmer please check.

@edsantiago
Copy link
Member Author

@containers/podman-maintainers PTAL.

Merging this will make it much much easier to test zstd. The pasta helper is less important now that @sbrivio-rh has a reproducer for #24147, but it's still a nice-to-have for tomorrow's bugs.

test/NEW-IMAGES Outdated
@@ -12,3 +12,4 @@
#
# Format is one FQIN per line. Enumerate them below:
#
quay.io/libpod/testimage:20241011
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we get this into the CI image first? This will cause quay.io flakes...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already is in the latest CI images! (The zstd ones. Getting podman to work with zstd is just a SMOP).

Oh all right. I'll build new images with only that change...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I know it is annoying but I think we have established that we cannot pull from quay.io reliably enough for podman CI usage. It is not if but when it will flake, if I would know that the zstd image is ready quickly then I might be fine to carry this but given the amount of bugs I doubt it will happen fast enough.

@martinpitt
Copy link
Contributor

A-ha! crun checkpoint failed again, and now we have the dump.log:

(00.103272) Running network-lock scripts
Error (criu/util.c:640): execvp("iptables-restore", ...) failed: No such file or directory
(00.104749) Error (criu/util.c:655): exited, status=1
Error (criu/util.c:640): execvp("ip6tables-restore", ...) failed: No such file or directory
(00.105530) Error (criu/util.c:655): exited, status=1
(00.105550) Error (criu/net.c:3137): net: Locking network failed: iptables-restore returned -1. This may be connected to disabled CONFIG_NETFILTER_XT_MARK kernel build config option.
(00.105616) net: Unlock network
(00.105625) Running network-unlock scripts
Error (criu/util.c:640): execvp("iptables-restore", ...) failed: No such file or directory
(00.106290) Error (criu/util.c:655): exited, status=1
Error (criu/util.c:640): execvp("ip6tables-restore", ...) failed: No such file or directory
(00.106861) Error (criu/util.c:655): exited, status=1
Error (criu/util.c:640): execvp("iptables-restore", ...) failed: No such file or directory
(00.109232) Error (criu/util.c:655): exited, status=1
Error (criu/util.c:640): execvp("ip6tables-restore", ...) failed: No such file or directory
(00.109776) Error (criu/util.c:655): exited, status=1
(00.109796) Unfreezing tasks into 1
(00.109801) 	Unseizing 48726 into 1
(00.109829) Error (criu/cr-dump.c:2111): Dumping FAILED.

@edsantiago That looks distinctly different from #24230. Also, this is a flake, it only fails in like 30% of the cases. At first this makes no sense, though: either iptables is installed or not, why wouldn't that be a persistent failure? Or is the iptables stuff not actually fatal and would always happen, and the real error is something else and hidden?

@martinpitt
Copy link
Contributor

Ah, or are you saying it's persistently broken in this PR only, due to bumping the test image? (I am not sure now if I saw that failure in other PRs)

@edsantiago edsantiago marked this pull request as draft October 16, 2024 14:48
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2024
@edsantiago
Copy link
Member Author

@martinpitt I'm sorry, I'm a bit lost. Are you suggesting that this PR is somehow breaking cockpit? That seems unlikely to me. My mail filters autodelete all packit-as-a-service emails, because those tests are 100% (or nearly 100%) cry-wolf. However, searching my trash for packit + cockpit finds 15 such emails since the beginning of October, only two of them my pasta-udp PR. Do you have a reason for believing that my PR is somehow especially harmful to cockpit?

@martinpitt
Copy link
Contributor

@edsantiago I don't have conclusions yet. I'm saying that there is a bug in runc or criu that sometimes breaks podman container checkpoint, via crun checkpoint. What I don't know yet is if this PR somehow triggers this (via the bumped test image or otherwise) or whether it is a flake. I am not permitted to retry TF runs here, so I can't say for sure. But I checked a dozen other PRs, and checkpointing doesn't fail in them, so there is at least a high probability that the changes here somehow are responsible for it.

Above you mentioned #24230 as a possible issue for a broken criu, but that looks distinctly different. That's why I mentioned you specifically.

@Luap99
Copy link
Member

Luap99 commented Oct 16, 2024

This PR doesn't change any podman code only our own tests so there is no way this breaks cockpit test unless you somehow depend our our tests. I clicked rerun to see if this is indeed just a flake.

@martinpitt
Copy link
Contributor

Passed now, thanks @Luap99 for retrying!

I can reproduce the failure if I remove the iptables-legacy package from my system. Neither podman nor criu depend on this. Still a mystery why either criu would only sometimes fail that way, or why iptables-legacy would only sometimes be installed... I'll check if that's somehow different in F40 or without podman-next, and then file a bugzilla. (Tomorrow, I'm EOD today).

...for debugging containers#24147, because "md5sum mismatch" is not
the best way to troubleshoot bytestream differences.

socat is run on the container, so this requires building a
new testimage (20241011). Bump to new CI VMs[1] which include it.

 [1] containers/automation_images#389

Signed-off-by: Ed Santiago <santiago@redhat.com>
I'm assuming this was buildah#5595: the COMMENT field moved around.
Deal with it, and add a few more checks while we're at it.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@edsantiago edsantiago marked this pull request as ready for review October 16, 2024 16:18
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2024
@edsantiago
Copy link
Member Author

Done. The VM bump is IMO low risk: no significant changes since yesterday's VM bump (#24270)

@martinpitt
Copy link
Contributor

Same criu checkpoint failure just happened in #24300, so it's confirmed independent. This is fallout from containers-common-extra dropping its iptables dependency and netavark now depending on nftables. This uncovered a missing iptables dependency from criu, which I reported to https://bugzilla.redhat.com/show_bug.cgi?id=2319310 (it has already affected RHEL 10 for a while.

I added a workaround in cockpit-project/cockpit-podman#1883 which will hopefully put out this particular piece of noise.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2024
Copy link
Contributor

openshift-ci bot commented Oct 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, giuseppe, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Luap99,edsantiago,giuseppe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 7d5377b into containers:main Oct 17, 2024
86 checks passed
@edsantiago edsantiago deleted the pasta-bytecheck branch October 17, 2024 11:19
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 16, 2025
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Jan 16, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants