tests/int/cpt: fix lazy-pages flakiness

"checkpoint --lazy-pages and restore" test sometimes fails on restore in our CI on Fedora 33 when systemd cgroup driver is used: > (00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f > (00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated > ... > (00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED. I think what happens is 1. The test runs runc checkpoint in lazy-pages mode in background. 2. The test runs criu lazy-pages in background. 3. The test runs runc restore. Now, all three are working in together: criu restore restores, criu lazy-pages listens for page faults on a uffd and fetch missing pages from runc checkpoint, who serves those pages. At some point criu lazy-pages decides to fetch the rest of the pages, and once it's done it exits, and runc checkpoint, as there are no more pages to serve, exits too. At the end of runc checkpoint the container is removed (see "defer destroy(container)" in checkpoint.go. This involves a call to cgroupManager.Destroy, which, in case systemd manager is used, calls stopUnit, which makes systemd to not just remove the unit, but also send SIGTERM to its processes, if there are any. As the container is being restored into the same systemd unit, sometimes this results in sending SIGTERM to a process which criu restores, and thus restoring fails. The remedy here is to change the name of systemd unit to which the container is restored. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
szuecs · Apr 1, 2021 · 36fe3cc · 36fe3cc
1 parent 2dd62b3
commit 36fe3cc
Showing 1 changed file with 3 additions and 1 deletion.
diff --git a/tests/integration/checkpoint.bats b/tests/integration/checkpoint.bats
@@ -211,11 +211,13 @@ function simple_cr() {
 	lp_pid=$!
 
 	# Restore lazily from checkpoint.
-	# The restored container needs a different name as the checkpointed
+	# The restored container needs a different name (as well as systemd
+	# unit name, in case systemd cgroup driver is used) as the checkpointed
 	# container is not yet destroyed. It is only destroyed at that point
 	# in time when the last page is lazily transferred to the destination.
 	# Killing the CRIU on the checkpoint side will let the container
 	# continue to run if the migration failed at some point.
+	[ -n "$RUNC_USE_SYSTEMD" ] && set_cgroups_path
 	runc_restore_with_pipes ./image-dir test_busybox_restore --lazy-pages
 
 	wait $cpt_pid