Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System tests: new checkpoint test #11957

Merged
merged 1 commit into from
Nov 18, 2021

Conversation

edsantiago
Copy link
Member

@edsantiago edsantiago commented Oct 13, 2021

Includes a test for the stdout-goes-away bug (crun #756).

Skip on Ubuntu due to a many-months-old kernel bug that
keeps getting fixed and then un-fixed.

Signed-off-by: Ed Santiago santiago@redhat.com

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 13, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2021
@edsantiago
Copy link
Member Author

Side note: it is unlikely that system tests would've caught #11911, because it would never have occurred to run two checkpoint/restores. It's still a good idea to test this in gating, though.

@rhatdan
Copy link
Member

rhatdan commented Oct 13, 2021

Yikes a few sleeps...

@edsantiago
Copy link
Member Author

I know. I'm such a hypocrite. (PS can you think of a better way to check for sane output?)

@rhatdan
Copy link
Member

rhatdan commented Oct 13, 2021

I don't understand the code well enough to guess, Can we just check the time and make sure the logs are after a certain time?

@rhatdan
Copy link
Member

rhatdan commented Oct 13, 2021

podman logs --since ...?

@edsantiago
Copy link
Member Author

I guess I'll think about new ways to do this. Maybe reduce the sleep 1 in the loop to something smaller, and use finer-grain timestamps. That should work...

@edsantiago
Copy link
Member Author

The subsecond-resolution work is much harder than I expected it to be. I will not be able to finish it this week.

ITM, though, I'm wondering if this PR is worth pursuing -- it looks like checkpoint doesn't work on ubuntu?

# podman container checkpoint bcd9878ba9cb9ab872cc046d246a21e0e05f723ae126a1b630456454f81a8790
CRIU checkpointing failed -52
Please check CRIU logfile /var/lib/containers/storage/overlay-containers/bcd9878ba9cb9ab872cc046d246a21e0e05f723ae126a1b630456454f81a8790/userdata/dump.log

Error: `/usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/bcd9878ba9cb9ab872cc046d246a21e0e05f723ae126a1b630456454f81a8790/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/bcd9878ba9cb9ab872cc046d246a21e0e05f723ae126a1b630456454f81a8790/userdata bcd9878ba9cb9ab872cc046d246a21e0e05f723ae126a1b630456454f81a8790` failed: exit status 1
[ rc=125 (** EXPECTED 0 **) ]

@rhatdan
Copy link
Member

rhatdan commented Oct 14, 2021

@adrianreber WDYT?

@adrianreber
Copy link
Collaborator

@adrianreber WDYT?

I think this is a very good idea. I was not aware of this test suite.

@edsantiago About the failed ubuntu test. Can you get the mentioned log file. If I can get a look at it I can probably help you.

@edsantiago
Copy link
Member Author

@adrianreber sorry, this is an ephemeral CI system. Logs are gone. And I personally have no access to Ubuntu anywhere, so I have no way of reproducing manually.

@adrianreber
Copy link
Collaborator

Okay. I have a Ubuntu VM locally. I can try your steps to see if I can reproduce it.

@adrianreber
Copy link
Collaborator

Not reproducible on my system.

@cevich you mentioned it is possible to get access to CI system. Can you get me on the failed Ubuntu system.

@cevich
Copy link
Member

cevich commented Oct 14, 2021

@cevich you mentioned it is possible to get access to CI system. Can you get me on the failed Ubuntu system.

Not that specific failed instance, but I can get you a VM identical to it.

@adrianreber
Copy link
Collaborator

@cevich you mentioned it is possible to get access to CI system. Can you get me on the failed Ubuntu system.

Not that specific failed instance, but I can get you a VM identical to it.

Sounds good.

@cevich
Copy link
Member

cevich commented Oct 14, 2021

@adrianreber I tried pinging you on IRC. I have a VM for you based on this PR. E-mail me your ssh public key, and ping me on IRC, I'll get you into it.

@cevich
Copy link
Member

cevich commented Oct 14, 2021

Maybe IRC is broken. 35.223.76.159 should let you in as root now. There's a /root/ci_env.sh you can run, it will drop you into a shell where you will be Cirrus-CI effectively. Please let me know when you're done so I can manually remove the VM.

@adrianreber
Copy link
Collaborator

@cevich I can access the VM. Thanks.

@edsantiago The problem is the kernel. Ubuntu was carrying a non-upstream patch (for shiftfs) which broke CRIU on overlayfs. This is, however, fixed. Some information can be found in checkpoint-restore/criu#1316

I know that the latest kernel in 20.04 fixes it. Not sure for 21.04 as in this case.

@cevich Can we upgrade the kernel using apt-get and reboot?

@cevich
Copy link
Member

cevich commented Oct 14, 2021

@cevich Can we upgrade the kernel using apt-get and reboot?

As a test in GCP, yes you sure can. In CI at runtime, I'm afraid not.

However if there is a kernel update that fixes it, you may be in luck since I just built a set of fresh VM images: #11972 Specifically the .cirrus.yml change to IMAGE_SUFFIX: "c4979650947448832" is all you need. Just be aware, there might be some new gremlins hiding in the new images - I haven't looked closely.

@adrianreber
Copy link
Collaborator

@cevich Can we upgrade the kernel using apt-get and reboot?

As a test in GCP, yes you sure can. In CI at runtime, I'm afraid not.

I was able to update the kernel to 5.11.0-1020-gcp and now checkpointing works. So the Ubuntu image needs to be updated to have the latest kernel and then the failing tests should work.

@adrianreber
Copy link
Collaborator

@cevich You can remove the VM again.

@cevich
Copy link
Member

cevich commented Oct 14, 2021

@cevich You can remove the VM again.

Done, thanks for letting me know.

I was able to update the kernel to 5.11.0-1020-gcp and now checkpointing works. So the Ubuntu image needs to be updated to have the latest kernel and then the failing tests should work.

Great, so we have some easy fixes then:

  1. Wait, the new Ubuntu image will be brought online via Cirrus: Bump Fedora to release 35 #11795 then you just rebase this. I have no idea how long it will take to get tests passing in that PR, it could be weeks, it could be days.
  2. Update the images in this PR, to IMAGE_SUFFIX: "c4979650947448832"in .cirrus.yml and hope nothing else breaks (it shouldn't).
  3. Wait a few weeks until I can bring in Ubuntu 21.10 (which will probably monkey-wrench the whole works anyway).

I'm fine with option-2 - my speculative testing of it today was 100% pass (eventually) of podman tests. Really the only danger is that we uncover something new via 11795 shortly thereafter. But that's not a concern of this PR.

@edsantiago edsantiago added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2021
@edsantiago
Copy link
Member Author

@cevich, @adrianreber, tests now passing on Ubuntu with your suggested .cirrus.yml fix. Thank you.

@containers/podman-maintainers PTAL, I've significantly reworked it all from the first iteration. But DO NOT MERGE. This cannot merge until someone fixes #9752, because podman container restore triggers that a lot and we will end up with frequent flakes.

@adrianreber
Copy link
Collaborator

Update: I just tested using @cevich's ubuntu 2110 VM, and it failed again with the same -52 error:

# # podman container checkpoint ddfdbfe64d6ba0680490ec6f45300c8713df8171e922dd807d224e49efd72255
# CRIU checkpointing failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/ddfdbfe64d6ba0680490ec6f45300c8713df8171e922dd807d224e49efd72255/userdata/dump.log
# Error: `/usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/ddfdbfe64d6ba0680490ec6f45300c8713df8171e922dd807d224e49efd72255/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/ddfdbfe64d6ba0680490ec6f45300c8713df8171e922dd807d224e49efd72255/userdata ddfdbfe64d6ba0680490ec6f45300c8713df8171e922dd807d224e49efd72255` failed: exit status 1
# [ rc=125 (** EXPECTED 0 **) ]

Package versions: Kernel: 5.13.0-1005-gcp criu-3.16.1-1-amd64 crun-100:1.3-1-amd64

When we last ran into this problem @adrianreber suggested that this was fixed in kernel 5.11.something.

I need guidance here: I know nothing about checkpointing, all I did was write these tests because we didn't have any. Obviously it doesn't work reliably, and I'm nervous about introducing new flakes into CI or into RHEL. Should I just close this PR?

It seems like 21.10 again has the broken non-upstream kernel patch which breaks overlayfs and CRIU. It seems to be fixed in 21.04 and 20.04, but it is not fixed in 21.10.

As long as Ubuntu has the broken shiftfs patch that breaks overlayfs from CRIU's point of view it is not possible to run checkpoint/restore tests on overlayfs. The checkpoint test would work on 20.04 and anything not Ubuntu.

@adrianreber
Copy link
Collaborator

That is the link to the Ubuntu bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257

Fixed in 20.04 and 21.04 but apparently re-introduced in 21.10.

@cevich
Copy link
Member

cevich commented Nov 12, 2021

Given this may eventually be fixed at some unpredictable moment, it may not be safe to just skip based on "Ubuntu".
@edsantiago Adrian pointed to this reproducer (same bug report) as a way the tests could discover if the bug/patch is present or not:. Maybe that could be (somewhat) easily transformed into a podman-command conditional for skipping the test?

@cevich
Copy link
Member

cevich commented Nov 12, 2021

Alternative idea: I could run a script at image-build time, and place a marker file somewhere to indicate if it's a "known-broken" environment.

Includes a test for the stdout-goes-away bug (crun containers#756).

Skip on Ubuntu due to a many-months-old kernel bug that
keeps getting fixed and then un-fixed.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@cevich
Copy link
Member

cevich commented Nov 17, 2021

🎉

@edsantiago
Copy link
Member Author

tada

Well... not quite. One of the logs is incomplete: the last line is 224, but it should be 282. (I don't know if you see the big red warning; it might be produced by my greasemonkey). Anyhow, this is not a complete log. I don't know if tests succeeded or not. @cevich do you have any idea how to look into that?

@cevich
Copy link
Member

cevich commented Nov 17, 2021

Oh! That's new. Those output stream to google-cloud-storage objects, the Cirrus-CI agent shuffles the bits back/forth. There is a warning up on the cloud status page:

Global: CPU and memory utilization metrics are intermittently missing for Google Kubernetes Engine (GKE).
This issue does not affect auto-scaling capabilities. Short term mitigation is available

Cirrus-CI does run all their services there, so possibly that's in play, but only a guess. The easiest thing is to just re-run the task, and if it happens again I can ask their support to investigate.

@edsantiago edsantiago changed the title WIP - System tests: new checkpoint test System tests: new checkpoint test Nov 17, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 17, 2021
@edsantiago
Copy link
Member Author

Followup: I found the colorized logs, and they confirm that the test ran to completion (and passed).

I'm having a very hard time understanding what the above status message could possibly have to do with truncated logs; so much so that I'm just going to stick my head in the sand. I'm deeply concerned about truncated log files, but not enough to fiddle with Cirrus. I'm just going to make a note and, if I see the problem recur, worry about it then.

@edsantiago
Copy link
Member Author

@containers/podman-maintainers PTAL. I had been hoping to wait for #11795 to merge first, but recent developments (parallel test development efforts by @rst0git ) force me to bump this up in priority.

@edsantiago edsantiago removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 17, 2021
@cevich
Copy link
Member

cevich commented Nov 17, 2021

I'm deeply concerned about truncated log files, but not enough to fiddle with Cirrus

The google-status thing isn't an explicit indicator, it just rings warning bells for me. If something's failing in GKE (where all Cirrus services run), it's easily possible that will have permanent/transient affects on VM and/or output handling. In any case, the re-run passed and logs are complete, so this must have been just a "technical hiccup".

Yes, please do let me know if you notice this again.

@rhatdan
Copy link
Member

rhatdan commented Nov 18, 2021

Is this ready to merge?

@edsantiago
Copy link
Member Author

From my perspective it's ready. I would like eyeballs on it, though, particularly from those who understand checkpointing.

@adrianreber
Copy link
Collaborator

Looks good from my side.

if is_rootless; then
# ...however, is that a genuine cast-in-stone limitation, or one
# that can some day be fixed? If one day some PR removes that
# restriction, fail loudly here, so the developer can enable tests.
Copy link
Contributor

@rst0git rst0git Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a pull request that aims to enable rootless checkpoint/restore (checkpoint-restore/criu#1155).
It is still work in progress, but CAP_CHECKPOINT_RESTORE has been merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thank you! I expect that this test will fail then, in one of our update-CI-VM PRs. At that time, we'll need to figure out a different conditional (because, I'm sure, some VMs will have the feature and some will not).

# enough to cause a gap in the timestamps in the log. But checkpoint
# doesn't seem to work like that: upon restore, even if we sleep a long
# time, the newly-started container seems to pick back up close to
# where it left off. (Maybe it's something about /proc/uptime?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this answers your question, but CRIU supports checkpoint/restore of time namespace.
crun supports timens (containers/crun#743), but runc does not yet (opencontainers/runc#2345).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rst0git it will take me some time to figure out if that explains what I was seeing, but it looks very likely and is a great lead. Thank you!

@rhatdan
Copy link
Member

rhatdan commented Nov 18, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2021
@openshift-merge-robot openshift-merge-robot merged commit c26af00 into containers:main Nov 18, 2021
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants