System tests: new checkpoint test #11957

edsantiago · 2021-10-13T18:53:23Z

Includes a test for the stdout-goes-away bug (crun #756).

Skip on Ubuntu due to a many-months-old kernel bug that
keeps getting fixed and then un-fixed.

Signed-off-by: Ed Santiago santiago@redhat.com

openshift-ci · 2021-10-13T18:53:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

edsantiago · 2021-10-13T18:54:58Z

Side note: it is unlikely that system tests would've caught #11911, because it would never have occurred to run two checkpoint/restores. It's still a good idea to test this in gating, though.

rhatdan · 2021-10-13T19:04:24Z

Yikes a few sleeps...

edsantiago · 2021-10-13T19:05:12Z

I know. I'm such a hypocrite. (PS can you think of a better way to check for sane output?)

rhatdan · 2021-10-13T19:10:03Z

I don't understand the code well enough to guess, Can we just check the time and make sure the logs are after a certain time?

rhatdan · 2021-10-13T19:12:00Z

podman logs --since ...?

edsantiago · 2021-10-13T19:14:11Z

I guess I'll think about new ways to do this. Maybe reduce the sleep 1 in the loop to something smaller, and use finer-grain timestamps. That should work...

edsantiago · 2021-10-14T00:54:07Z

The subsecond-resolution work is much harder than I expected it to be. I will not be able to finish it this week.

ITM, though, I'm wondering if this PR is worth pursuing -- it looks like checkpoint doesn't work on ubuntu?

# podman container checkpoint bcd9878ba9cb9ab872cc046d246a21e0e05f723ae126a1b630456454f81a8790
CRIU checkpointing failed -52
Please check CRIU logfile /var/lib/containers/storage/overlay-containers/bcd9878ba9cb9ab872cc046d246a21e0e05f723ae126a1b630456454f81a8790/userdata/dump.log

Error: `/usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/bcd9878ba9cb9ab872cc046d246a21e0e05f723ae126a1b630456454f81a8790/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/bcd9878ba9cb9ab872cc046d246a21e0e05f723ae126a1b630456454f81a8790/userdata bcd9878ba9cb9ab872cc046d246a21e0e05f723ae126a1b630456454f81a8790` failed: exit status 1
[ rc=125 (** EXPECTED 0 **) ]

rhatdan · 2021-10-14T11:16:50Z

@adrianreber WDYT?

adrianreber · 2021-10-14T12:04:49Z

@adrianreber WDYT?

I think this is a very good idea. I was not aware of this test suite.

@edsantiago About the failed ubuntu test. Can you get the mentioned log file. If I can get a look at it I can probably help you.

edsantiago · 2021-10-14T12:08:24Z

@adrianreber sorry, this is an ephemeral CI system. Logs are gone. And I personally have no access to Ubuntu anywhere, so I have no way of reproducing manually.

adrianreber · 2021-10-14T12:39:42Z

Okay. I have a Ubuntu VM locally. I can try your steps to see if I can reproduce it.

adrianreber · 2021-10-14T13:38:26Z

Not reproducible on my system.

@cevich you mentioned it is possible to get access to CI system. Can you get me on the failed Ubuntu system.

cevich · 2021-10-14T16:09:40Z

@cevich you mentioned it is possible to get access to CI system. Can you get me on the failed Ubuntu system.

Not that specific failed instance, but I can get you a VM identical to it.

adrianreber · 2021-10-14T16:13:36Z

@cevich you mentioned it is possible to get access to CI system. Can you get me on the failed Ubuntu system.

Not that specific failed instance, but I can get you a VM identical to it.

Sounds good.

cevich · 2021-10-14T16:14:31Z

@adrianreber I tried pinging you on IRC. I have a VM for you based on this PR. E-mail me your ssh public key, and ping me on IRC, I'll get you into it.

cevich · 2021-10-14T16:40:05Z

Maybe IRC is broken. 35.223.76.159 should let you in as root now. There's a /root/ci_env.sh you can run, it will drop you into a shell where you will be Cirrus-CI effectively. Please let me know when you're done so I can manually remove the VM.

adrianreber · 2021-10-14T17:31:09Z

@cevich I can access the VM. Thanks.

@edsantiago The problem is the kernel. Ubuntu was carrying a non-upstream patch (for shiftfs) which broke CRIU on overlayfs. This is, however, fixed. Some information can be found in checkpoint-restore/criu#1316

I know that the latest kernel in 20.04 fixes it. Not sure for 21.04 as in this case.

@cevich Can we upgrade the kernel using apt-get and reboot?

cevich · 2021-10-14T17:38:40Z

@cevich Can we upgrade the kernel using apt-get and reboot?

As a test in GCP, yes you sure can. In CI at runtime, I'm afraid not.

However if there is a kernel update that fixes it, you may be in luck since I just built a set of fresh VM images: #11972 Specifically the .cirrus.yml change to IMAGE_SUFFIX: "c4979650947448832" is all you need. Just be aware, there might be some new gremlins hiding in the new images - I haven't looked closely.

adrianreber · 2021-10-14T18:07:24Z

@cevich Can we upgrade the kernel using apt-get and reboot?

As a test in GCP, yes you sure can. In CI at runtime, I'm afraid not.

I was able to update the kernel to 5.11.0-1020-gcp and now checkpointing works. So the Ubuntu image needs to be updated to have the latest kernel and then the failing tests should work.

adrianreber · 2021-10-14T18:08:41Z

@cevich You can remove the VM again.

cevich · 2021-10-14T19:55:02Z

@cevich You can remove the VM again.

Done, thanks for letting me know.

I was able to update the kernel to 5.11.0-1020-gcp and now checkpointing works. So the Ubuntu image needs to be updated to have the latest kernel and then the failing tests should work.

Great, so we have some easy fixes then:

Wait, the new Ubuntu image will be brought online via Cirrus: Bump Fedora to release 35 #11795 then you just rebase this. I have no idea how long it will take to get tests passing in that PR, it could be weeks, it could be days.
Update the images in this PR, to IMAGE_SUFFIX: "c4979650947448832"in .cirrus.yml and hope nothing else breaks (it shouldn't).
Wait a few weeks until I can bring in Ubuntu 21.10 (which will probably monkey-wrench the whole works anyway).

I'm fine with option-2 - my speculative testing of it today was 100% pass (eventually) of podman tests. Really the only danger is that we uncover something new via 11795 shortly thereafter. But that's not a concern of this PR.

edsantiago · 2021-10-18T01:46:30Z

@cevich, @adrianreber, tests now passing on Ubuntu with your suggested .cirrus.yml fix. Thank you.

@containers/podman-maintainers PTAL, I've significantly reworked it all from the first iteration. But DO NOT MERGE. This cannot merge until someone fixes #9752, because podman container restore triggers that a lot and we will end up with frequent flakes.

adrianreber · 2021-11-12T15:05:50Z

Update: I just tested using @cevich's ubuntu 2110 VM, and it failed again with the same -52 error:
# # podman container checkpoint ddfdbfe64d6ba0680490ec6f45300c8713df8171e922dd807d224e49efd72255
# CRIU checkpointing failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/ddfdbfe64d6ba0680490ec6f45300c8713df8171e922dd807d224e49efd72255/userdata/dump.log
# Error: `/usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/ddfdbfe64d6ba0680490ec6f45300c8713df8171e922dd807d224e49efd72255/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/ddfdbfe64d6ba0680490ec6f45300c8713df8171e922dd807d224e49efd72255/userdata ddfdbfe64d6ba0680490ec6f45300c8713df8171e922dd807d224e49efd72255` failed: exit status 1
# [ rc=125 (** EXPECTED 0 **) ]
Package versions: Kernel: 5.13.0-1005-gcp criu-3.16.1-1-amd64 crun-100:1.3-1-amd64

When we last ran into this problem @adrianreber suggested that this was fixed in kernel 5.11.something.

I need guidance here: I know nothing about checkpointing, all I did was write these tests because we didn't have any. Obviously it doesn't work reliably, and I'm nervous about introducing new flakes into CI or into RHEL. Should I just close this PR?

It seems like 21.10 again has the broken non-upstream kernel patch which breaks overlayfs and CRIU. It seems to be fixed in 21.04 and 20.04, but it is not fixed in 21.10.

As long as Ubuntu has the broken shiftfs patch that breaks overlayfs from CRIU's point of view it is not possible to run checkpoint/restore tests on overlayfs. The checkpoint test would work on 20.04 and anything not Ubuntu.

adrianreber · 2021-11-12T15:15:25Z

That is the link to the Ubuntu bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257

Fixed in 20.04 and 21.04 but apparently re-introduced in 21.10.

cevich · 2021-11-12T15:23:03Z

Given this may eventually be fixed at some unpredictable moment, it may not be safe to just skip based on "Ubuntu".
@edsantiago Adrian pointed to this reproducer (same bug report) as a way the tests could discover if the bug/patch is present or not:. Maybe that could be (somewhat) easily transformed into a podman-command conditional for skipping the test?

cevich · 2021-11-12T15:30:18Z

Alternative idea: I could run a script at image-build time, and place a marker file somewhere to indicate if it's a "known-broken" environment.

Includes a test for the stdout-goes-away bug (crun containers#756). Skip on Ubuntu due to a many-months-old kernel bug that keeps getting fixed and then un-fixed. Signed-off-by: Ed Santiago <santiago@redhat.com>

cevich · 2021-11-17T15:51:28Z

🎉

edsantiago · 2021-11-17T15:56:45Z

tada

Well... not quite. One of the logs is incomplete: the last line is 224, but it should be 282. (I don't know if you see the big red warning; it might be produced by my greasemonkey). Anyhow, this is not a complete log. I don't know if tests succeeded or not. @cevich do you have any idea how to look into that?

cevich · 2021-11-17T16:29:53Z

Oh! That's new. Those output stream to google-cloud-storage objects, the Cirrus-CI agent shuffles the bits back/forth. There is a warning up on the cloud status page:

Global: CPU and memory utilization metrics are intermittently missing for Google Kubernetes Engine (GKE).
This issue does not affect auto-scaling capabilities. Short term mitigation is available

Cirrus-CI does run all their services there, so possibly that's in play, but only a guess. The easiest thing is to just re-run the task, and if it happens again I can ask their support to investigate.

edsantiago · 2021-11-17T16:46:46Z

Followup: I found the colorized logs, and they confirm that the test ran to completion (and passed).

I'm having a very hard time understanding what the above status message could possibly have to do with truncated logs; so much so that I'm just going to stick my head in the sand. I'm deeply concerned about truncated log files, but not enough to fiddle with Cirrus. I'm just going to make a note and, if I see the problem recur, worry about it then.

edsantiago · 2021-11-17T16:48:57Z

@containers/podman-maintainers PTAL. I had been hoping to wait for #11795 to merge first, but recent developments (parallel test development efforts by @rst0git ) force me to bump this up in priority.

cevich · 2021-11-17T17:36:42Z

I'm deeply concerned about truncated log files, but not enough to fiddle with Cirrus

The google-status thing isn't an explicit indicator, it just rings warning bells for me. If something's failing in GKE (where all Cirrus services run), it's easily possible that will have permanent/transient affects on VM and/or output handling. In any case, the re-run passed and logs are complete, so this must have been just a "technical hiccup".

Yes, please do let me know if you notice this again.

rhatdan · 2021-11-18T11:35:53Z

Is this ready to merge?

edsantiago · 2021-11-18T12:39:49Z

From my perspective it's ready. I would like eyeballs on it, though, particularly from those who understand checkpointing.

adrianreber · 2021-11-18T13:03:17Z

Looks good from my side.

rst0git · 2021-11-18T13:01:54Z

test/system/520-checkpoint.bats

+    if is_rootless; then
+        # ...however, is that a genuine cast-in-stone limitation, or one
+        # that can some day be fixed? If one day some PR removes that
+        # restriction, fail loudly here, so the developer can enable tests.


There is a pull request that aims to enable rootless checkpoint/restore (checkpoint-restore/criu#1155).
It is still work in progress, but CAP_CHECKPOINT_RESTORE has been merged.

Excellent! Thank you! I expect that this test will fail then, in one of our update-CI-VM PRs. At that time, we'll need to figure out a different conditional (because, I'm sure, some VMs will have the feature and some will not).

rst0git · 2021-11-18T13:15:08Z

test/system/520-checkpoint.bats

+    # enough to cause a gap in the timestamps in the log. But checkpoint
+    # doesn't seem to work like that: upon restore, even if we sleep a long
+    # time, the newly-started container seems to pick back up close to
+    # where it left off. (Maybe it's something about /proc/uptime?)


I'm not sure if this answers your question, but CRIU supports checkpoint/restore of time namespace.
crun supports timens (containers/crun#743), but runc does not yet (opencontainers/runc#2345).

@rst0git it will take me some time to figure out if that explains what I was seeing, but it looks very likely and is a great lead. Thank you!

rhatdan · 2021-11-18T21:05:14Z

/lgtm

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2021

edsantiago mentioned this pull request Oct 17, 2021

podman run or restore: requested cgroup controller pids is not available #9752

Closed

edsantiago force-pushed the bats branch from 4dc62c3 to 3010eb0 Compare October 18, 2021 00:28

edsantiago added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2021

edsantiago force-pushed the bats branch from 3010eb0 to fc57464 Compare October 18, 2021 01:48

This was referenced Oct 19, 2021

podman exec into a "-it" container: container create failed (no logs from conmon): EOF #10927

Open

podman play kube --replace : timeout #12039

Closed

edsantiago force-pushed the bats branch from 92fa56c to ccbde32 Compare November 16, 2021 12:53

edsantiago mentioned this pull request Nov 16, 2021

CNI: failed to set bridge addr: "cni-podman1" already has an IP address different from 10.25.31.1/24 #12306

Closed

edsantiago force-pushed the bats branch from ccbde32 to 64baf2e Compare November 17, 2021 12:14

System tests: new checkpoint tests

d6c1890

Includes a test for the stdout-goes-away bug (crun containers#756). Skip on Ubuntu due to a many-months-old kernel bug that keeps getting fixed and then un-fixed. Signed-off-by: Ed Santiago <santiago@redhat.com>

edsantiago force-pushed the bats branch from 64baf2e to d6c1890 Compare November 17, 2021 13:32

edsantiago changed the title ~~WIP - System tests: new checkpoint test~~ System tests: new checkpoint test Nov 17, 2021

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 17, 2021

edsantiago removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 17, 2021

rst0git reviewed Nov 18, 2021

View reviewed changes

openshift-ci bot assigned rhatdan Nov 18, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2021

openshift-merge-robot merged commit c26af00 into containers:main Nov 18, 2021

edsantiago mentioned this pull request Nov 30, 2021

pprof test is flaking again #12167

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023

System tests: new checkpoint test #11957

System tests: new checkpoint test #11957

Conversation

edsantiago commented Oct 13, 2021 • edited Loading

openshift-ci bot commented Oct 13, 2021

edsantiago commented Oct 13, 2021

rhatdan commented Oct 13, 2021

edsantiago commented Oct 13, 2021

rhatdan commented Oct 13, 2021

rhatdan commented Oct 13, 2021

edsantiago commented Oct 13, 2021

edsantiago commented Oct 14, 2021

rhatdan commented Oct 14, 2021

adrianreber commented Oct 14, 2021

edsantiago commented Oct 14, 2021

adrianreber commented Oct 14, 2021

adrianreber commented Oct 14, 2021

cevich commented Oct 14, 2021

adrianreber commented Oct 14, 2021

cevich commented Oct 14, 2021

cevich commented Oct 14, 2021

adrianreber commented Oct 14, 2021

cevich commented Oct 14, 2021

adrianreber commented Oct 14, 2021

adrianreber commented Oct 14, 2021

cevich commented Oct 14, 2021

edsantiago commented Oct 18, 2021

adrianreber commented Nov 12, 2021

adrianreber commented Nov 12, 2021

cevich commented Nov 12, 2021

cevich commented Nov 12, 2021

cevich commented Nov 17, 2021

edsantiago commented Nov 17, 2021

cevich commented Nov 17, 2021 • edited Loading

edsantiago commented Nov 17, 2021

edsantiago commented Nov 17, 2021

cevich commented Nov 17, 2021

rhatdan commented Nov 18, 2021

edsantiago commented Nov 18, 2021

adrianreber commented Nov 18, 2021

rst0git Nov 18, 2021 • edited Loading

Choose a reason for hiding this comment

edsantiago Nov 18, 2021

Choose a reason for hiding this comment

rst0git Nov 18, 2021

Choose a reason for hiding this comment

edsantiago Nov 18, 2021

Choose a reason for hiding this comment

rhatdan commented Nov 18, 2021

edsantiago commented Oct 13, 2021 •

edited

Loading

cevich commented Nov 17, 2021 •

edited

Loading

rst0git Nov 18, 2021 •

edited

Loading