-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System tests: new checkpoint test #11957
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: edsantiago The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Side note: it is unlikely that system tests would've caught #11911, because it would never have occurred to run two checkpoint/restores. It's still a good idea to test this in gating, though. |
Yikes a few sleeps... |
I know. I'm such a hypocrite. (PS can you think of a better way to check for sane output?) |
I don't understand the code well enough to guess, Can we just check the time and make sure the logs are after a certain time? |
podman logs --since ...? |
I guess I'll think about new ways to do this. Maybe reduce the |
The subsecond-resolution work is much harder than I expected it to be. I will not be able to finish it this week. ITM, though, I'm wondering if this PR is worth pursuing -- it looks like checkpoint doesn't work on ubuntu?
|
@adrianreber WDYT? |
I think this is a very good idea. I was not aware of this test suite. @edsantiago About the failed ubuntu test. Can you get the mentioned log file. If I can get a look at it I can probably help you. |
@adrianreber sorry, this is an ephemeral CI system. Logs are gone. And I personally have no access to Ubuntu anywhere, so I have no way of reproducing manually. |
Okay. I have a Ubuntu VM locally. I can try your steps to see if I can reproduce it. |
Not reproducible on my system. @cevich you mentioned it is possible to get access to CI system. Can you get me on the failed Ubuntu system. |
Not that specific failed instance, but I can get you a VM identical to it. |
Sounds good. |
@adrianreber I tried pinging you on IRC. I have a VM for you based on this PR. E-mail me your ssh public key, and ping me on IRC, I'll get you into it. |
Maybe IRC is broken. 35.223.76.159 should let you in as root now. There's a |
@cevich I can access the VM. Thanks. @edsantiago The problem is the kernel. Ubuntu was carrying a non-upstream patch (for shiftfs) which broke CRIU on overlayfs. This is, however, fixed. Some information can be found in checkpoint-restore/criu#1316 I know that the latest kernel in 20.04 fixes it. Not sure for 21.04 as in this case. @cevich Can we upgrade the kernel using apt-get and reboot? |
As a test in GCP, yes you sure can. In CI at runtime, I'm afraid not. However if there is a kernel update that fixes it, you may be in luck since I just built a set of fresh VM images: #11972 Specifically the |
I was able to update the kernel to 5.11.0-1020-gcp and now checkpointing works. So the Ubuntu image needs to be updated to have the latest kernel and then the failing tests should work. |
@cevich You can remove the VM again. |
Done, thanks for letting me know.
Great, so we have some easy fixes then:
I'm fine with option-2 - my speculative testing of it today was 100% pass (eventually) of podman tests. Really the only danger is that we uncover something new via 11795 shortly thereafter. But that's not a concern of this PR. |
@cevich, @adrianreber, tests now passing on Ubuntu with your suggested @containers/podman-maintainers PTAL, I've significantly reworked it all from the first iteration. But DO NOT MERGE. This cannot merge until someone fixes #9752, because |
It seems like 21.10 again has the broken non-upstream kernel patch which breaks overlayfs and CRIU. It seems to be fixed in 21.04 and 20.04, but it is not fixed in 21.10. As long as Ubuntu has the broken shiftfs patch that breaks overlayfs from CRIU's point of view it is not possible to run checkpoint/restore tests on overlayfs. The checkpoint test would work on 20.04 and anything not Ubuntu. |
That is the link to the Ubuntu bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257 Fixed in 20.04 and 21.04 but apparently re-introduced in 21.10. |
Given this may eventually be fixed at some unpredictable moment, it may not be safe to just skip based on "Ubuntu". |
Alternative idea: I could run a script at image-build time, and place a marker file somewhere to indicate if it's a "known-broken" environment. |
Includes a test for the stdout-goes-away bug (crun containers#756). Skip on Ubuntu due to a many-months-old kernel bug that keeps getting fixed and then un-fixed. Signed-off-by: Ed Santiago <santiago@redhat.com>
🎉 |
Well... not quite. One of the logs is incomplete: the last line is 224, but it should be 282. (I don't know if you see the big red warning; it might be produced by my greasemonkey). Anyhow, this is not a complete log. I don't know if tests succeeded or not. @cevich do you have any idea how to look into that? |
Oh! That's new. Those output stream to google-cloud-storage objects, the Cirrus-CI agent shuffles the bits back/forth. There is a warning up on the cloud status page:
Cirrus-CI does run all their services there, so possibly that's in play, but only a guess. The easiest thing is to just re-run the task, and if it happens again I can ask their support to investigate. |
Followup: I found the colorized logs, and they confirm that the test ran to completion (and passed). I'm having a very hard time understanding what the above status message could possibly have to do with truncated logs; so much so that I'm just going to stick my head in the sand. I'm deeply concerned about truncated log files, but not enough to fiddle with Cirrus. I'm just going to make a note and, if I see the problem recur, worry about it then. |
The google-status thing isn't an explicit indicator, it just rings warning bells for me. If something's failing in GKE (where all Cirrus services run), it's easily possible that will have permanent/transient affects on VM and/or output handling. In any case, the re-run passed and logs are complete, so this must have been just a "technical hiccup". Yes, please do let me know if you notice this again. |
Is this ready to merge? |
From my perspective it's ready. I would like eyeballs on it, though, particularly from those who understand checkpointing. |
Looks good from my side. |
if is_rootless; then | ||
# ...however, is that a genuine cast-in-stone limitation, or one | ||
# that can some day be fixed? If one day some PR removes that | ||
# restriction, fail loudly here, so the developer can enable tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a pull request that aims to enable rootless checkpoint/restore (checkpoint-restore/criu#1155).
It is still work in progress, but CAP_CHECKPOINT_RESTORE
has been merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Thank you! I expect that this test will fail then, in one of our update-CI-VM PRs. At that time, we'll need to figure out a different conditional (because, I'm sure, some VMs will have the feature and some will not).
# enough to cause a gap in the timestamps in the log. But checkpoint | ||
# doesn't seem to work like that: upon restore, even if we sleep a long | ||
# time, the newly-started container seems to pick back up close to | ||
# where it left off. (Maybe it's something about /proc/uptime?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this answers your question, but CRIU supports checkpoint/restore of time namespace.
crun
supports timens (containers/crun#743), but runc
does not yet (opencontainers/runc#2345).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rst0git it will take me some time to figure out if that explains what I was seeing, but it looks very likely and is a great lead. Thank you!
/lgtm |
Includes a test for the stdout-goes-away bug (crun #756).
Skip on Ubuntu due to a many-months-old kernel bug that
keeps getting fixed and then un-fixed.
Signed-off-by: Ed Santiago santiago@redhat.com