Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc 1.0.x fails with criu 3.16 #3278

Closed
kolyshkin opened this issue Nov 12, 2021 · 9 comments · Fixed by #3282
Closed

runc 1.0.x fails with criu 3.16 #3278

kolyshkin opened this issue Nov 12, 2021 · 9 comments · Fixed by #3282

Comments

@kolyshkin
Copy link
Contributor

It seems that runc 1.0.x is not working with criu 3.16.

The failures are like this (from #3277)

=== RUN   TestCheckpoint
    checkpoint_test.go:145: === /tmp/criu484433595/dump.log ===
    checkpoint_test.go:145: open /tmp/criu484433595/dump.log: no such file or directory
    checkpoint_test.go:146: criu failed: type DUMP errno 56
        log file: /tmp/criu484433595/dump.log
--- FAIL: TestCheckpoint (0.29s)

All we have is errno 56. Apparently this is EBADRQC /* Invalid request code */, as returned by setup_opts_from_req from criu/cr-service.

My suspicion is github.com/checkpoint-restore/go-criu need to be updated to fix this. The question is -- whether this will break criu 3.15 compatibility?

// Cc @adrianreber @avagin

@kolyshkin kolyshkin changed the title runc 1.0.x with criu 3.16 runc 1.0.x fails with criu 3.16 Nov 12, 2021
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Nov 12, 2021

Alas, updating go-criu to v5.2.0 (#3279) is not helping. What else could it be?

Also, I just noticed that criu 3.16 is not yet in Fedora 35; this might be the reason why this issue was not found earlier.

@kolyshkin
Copy link
Contributor Author

Also, I just noticed that criu 3.16 is not yet in Fedora 35; this might be the reason why this issue was not found earlier.

Sorry, I was wrong about it. Fedora 35 has criu 3.16 for sure.

@kolyshkin
Copy link
Contributor Author

Reproduced locally (Fedora 35) with runc from release-1.0 branch and criu-3.16:

=== RUN   TestUsernsCheckpoint
    checkpoint_test.go:145: === /tmp/criu040057477/dump.log ===
    checkpoint_test.go:145: open /tmp/criu040057477/dump.log: no such file or directory
    checkpoint_test.go:146: criu failed: type DUMP errno 56
        log file: /tmp/criu040057477/dump.log
--- FAIL: TestUsernsCheckpoint (0.27s)
=== RUN   TestCheckpoint
    checkpoint_test.go:145: === /tmp/criu055391572/dump.log ===
    checkpoint_test.go:145: open /tmp/criu055391572/dump.log: no such file or directory
    checkpoint_test.go:146: criu failed: type DUMP errno 56
        log file: /tmp/criu055391572/dump.log
--- FAIL: TestCheckpoint (0.30s)

It's funny but integration tests work:

kir@kir-rhat runc]$ sudo bats tests/integration/checkpoint.bats 
 ✓ checkpoint and restore 
 ✓ checkpoint and restore (with --debug)
 - checkpoint and restore (cgroupns) (skipped: test requires cgroups_v1)
 ✓ checkpoint --pre-dump (bad --parent-path)
 ✓ checkpoint --pre-dump and restore
 ✓ checkpoint --lazy-pages and restore
 ✓ checkpoint and restore in external network namespace
 ✓ checkpoint and restore with container specific CRIU config
 ✓ checkpoint and restore with nested bind mounts

@kolyshkin
Copy link
Contributor Author

I suspect this is something trivial but have to patch criu to find out what.

@adrianreber
Copy link
Contributor

Interesting. I just tried it on Fedora 35 with CRIU 3.16.1 and the latest runc git checkout and I see no errors. I tried it with cgroup v1 and v2 and it just works.

@kolyshkin
Copy link
Contributor Author

Interesting. I just tried it on Fedora 35 with CRIU 3.16.1 and the latest runc git checkout and I see no errors. I tried it with cgroup v1 and v2 and it just works.

@adrianreber you probably missed the 1.0.x part. Here's a complete repro:

[kir@kir-rhat runc]$ git remote -v | grep origin
origin	https://github.com/opencontainers/runc (fetch)
origin	https://github.com/opencontainers/runc (push)
[kir@kir-rhat runc]$ git fetch origin
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 676 bytes | 676.00 KiB/s, done.
From https://github.com/opencontainers/runc
   c1103d98..f247ad20  master     -> origin/master
[kir@kir-rhat runc]$ git checkout release-1.0 ### <<< THIS
Switched to branch 'release-1.0'
Your branch is up to date with 'origin/release-1.0'.
[kir@kir-rhat runc]$ make; go test -v -run Checkpo -exec sudo ./libcontainer/integration/
go build -trimpath "-mod=vendor" "-buildmode=pie"  -tags "seccomp" -ldflags "-X main.gitCommit=v1.0.2-4-g23c6e4f5 -X main.version=1.0.2+dev " -o runc .
=== RUN   TestUsernsCheckpoint
    checkpoint_test.go:145: === /tmp/criu1051146552/dump.log ===
    checkpoint_test.go:145: open /tmp/criu1051146552/dump.log: no such file or directory
    checkpoint_test.go:146: criu failed: type DUMP errno 56
        log file: /tmp/criu1051146552/dump.log
--- FAIL: TestUsernsCheckpoint (0.29s)
=== RUN   TestCheckpoint
    checkpoint_test.go:145: === /tmp/criu1983195949/dump.log ===
    checkpoint_test.go:145: open /tmp/criu1983195949/dump.log: no such file or directory
    checkpoint_test.go:146: criu failed: type DUMP errno 56
        log file: /tmp/criu1983195949/dump.log
--- FAIL: TestCheckpoint (0.20s)
FAIL
FAIL	github.com/opencontainers/runc/libcontainer/integration	0.506s
FAIL
[kir@kir-rhat runc]$ 

@adrianreber
Copy link
Contributor

Ah, right, I did not check the 1.0 branch. Following patch (or something like this) would fix it:

diff --git a/libcontainer/integration/checkpoint_test.go b/libcontainer/integration/checkpoint_test.go
index f2870ae0..e22c8145 100644
--- a/libcontainer/integration/checkpoint_test.go
+++ b/libcontainer/integration/checkpoint_test.go
@@ -136,7 +136,7 @@ func testCheckpoint(t *testing.T, userns bool) {
        checkpointOpts := &libcontainer.CriuOpts{
                ImagesDirectory: imagesDir,
                WorkDirectory:   imagesDir,
-               ParentImage:     "../criu-parent",
+               ParentImage:     "../"+parentDir[strings.LastIndex(parentDir, "/")+1:],
        }
        dumpLog := filepath.Join(checkpointOpts.WorkDirectory, "dump.log")
        restoreLog := filepath.Join(checkpointOpts.WorkDirectory, "restore.log")

The test is doing a pre-dump, but the pre-dump was never working because the path to the parent checkpoint was not correct. CRIU used to ignore it, but now it errors out.

@kolyshkin
Copy link
Contributor Author

Ahhh... thanks so much @adrianreber!

I have actually fixed it in #3112, which I closed in favor of #3100. I forgot to port either to 1.0.x.

Will do.

@kolyshkin
Copy link
Contributor Author

Fixed by #3282

@kolyshkin kolyshkin linked a pull request Nov 18, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants