Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
tests/int/cpt: fix lazy-pages flakiness
"checkpoint --lazy-pages and restore" test sometimes fails on restore in our CI on Fedora 33 when systemd cgroup driver is used: > (00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f > (00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated > ... > (00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED. I think what happens is 1. The test runs runc checkpoint in lazy-pages mode in background. 2. The test runs criu lazy-pages in background. 3. The test runs runc restore. Now, all three are working in together: criu restore restores, criu lazy-pages listens for page faults on a uffd and fetch missing pages from runc checkpoint, who serves those pages. At some point criu lazy-pages decides to fetch the rest of the pages, and once it's done it exits, and runc checkpoint, as there are no more pages to serve, exits too. At the end of runc checkpoint the container is removed (see "defer destroy(container)" in checkpoint.go. This involves a call to cgroupManager.Destroy, which, in case systemd manager is used, calls stopUnit, which makes systemd to not just remove the unit, but also send SIGTERM to its processes, if there are any. As the container is being restored into the same systemd unit, sometimes this results in sending SIGTERM to a process which criu restores, and thus restoring fails. The remedy here is to change the name of systemd unit to which the container is restored. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
- Loading branch information