Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker checkpoint create fails #1316

Closed
Pigrenok opened this issue Dec 18, 2020 · 9 comments
Closed

docker checkpoint create fails #1316

Pigrenok opened this issue Dec 18, 2020 · 9 comments

Comments

@Pigrenok
Copy link

Hello!

I am trying to experiment with criu and experimental docker feature docker checkpoint. Unfortunately, experiments are not really successful by now.

I have Ubuntu 20.04 with 5.8.0-29-generic kernel (available from Ubuntu repo)

$ uname -a
Linux PigerBook 5.8.0-29-generic #31~20.04.1-Ubuntu SMP Fri Nov 6 16:10:42 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

I originally tried it on kernel 5.6.0-1034-oem with the same result. After reading issue #860, I updated to 5.8.0-29-generic as it was stated at the end that this issue was corrected in 5.8.0-16.17 (according to ubuntu kernel bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257.

I use Docker 20.10.0, build 7287ab3. I enabled experimental mode in docker, so, docker checkpoint command became available. I also installed criu package from ubuntu repo:
criu version 3.15

I also check that criu works:

$ sudo criu check
Looks good.

After that I try to run the simple example from criu.org/Docker:
docker run -d --name looper --security-opt seccomp:unconfined busybox \ /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'

Docker starts container and logs show that everything is fine:

$ docker logs looper
0
1
2
3
4
5
6
7
8
9

After that I try to create checkpoint and get an error:

$ docker checkpoint create looper checkpoint1
Error response from daemon: Cannot checkpoint container looper: runc did not terminate successfully: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/moby/9840600feebfd863ddbdc2251c0f0af0ec802f4e408f1e73cfe7542913586b38/criu-dump.log: unknown

The file criu-dump.log is here:
criu-dump.log

The error is indicated at the end of that file and is as follows:

(00.046036) Error (criu/files-reg.c:1689): Can't lookup mount=1700 for fd=-3 path=/bin/sh
(00.046045) Error (criu/cr-dump.c:1250): Collect mappings (pid: 7856) failed with -1
(00.046086) Unlock network
(00.046089) Running network-unlock scripts
(00.046091)     RPC
(00.048419) Unfreezing tasks into 1
(00.048432)     Unseizing 7856 into 1
(00.048440)     Unseizing 7977 into 1
(00.048455) Error (criu/cr-dump.c:1768): Dumping FAILED.

I also did checks mentioned in issue #860:

$ grep "^1700\>" /proc/*/mountinfo
and it did not return anything.

Inside the container (with the output):

# exec 100< /bin/sh
# cat /proc/$$/fdinfo/100
pos:	0
flags:	0100000
mnt_id:	1901

Is it still kernel issue? Or is it completely different bug somewhere else? Is there any way around this problem?

Thank you very much in advance.

@adrianreber
Copy link
Member

This is still the same bug. The state in the launchpad bug report is wrong. Please also report it there that it is still broken, but I have doubts that it will ever get resolved.

My recommendation right now it to not use Ubuntu if you want to do checkpoints on overlayfs and last time I checked Docker was also broken. Our CI runs with Docker do not work any more. So, do not use Ubuntu and Docker to checkpoint and restore containers currently. Try Podman on something else than Ubuntu. That should work pretty reliably.

@Pigrenok
Copy link
Author

Thanks, Adrian! I will report in launchpad.

Is there a way to avoid use of overlayfs?

@adrianreber
Copy link
Member

Thanks, Adrian! I will report in launchpad.

Is there a way to avoid use of overlayfs?

Yes, for CRIU CI we use devicemapper as graphdriver. Not sure if that still exists with latest docker. Podman has a dir graphdriver which also avoids overlayfs.

@nelson-liu
Copy link

nelson-liu commented Jan 1, 2021

@adrianreber thanks for all the helpful info on this error. I noticed that the Podman CI tests for CRIU ( e.g., https://github.com/checkpoint-restore/criu/runs/1619725048 ) are running on Ubuntu (See https://github.com/checkpoint-restore/criu/runs/1619725048#step:3:165 ) :

+ uname -a
Linux fv-az119-109 5.4.0-1032-azure #33-Ubuntu SMP Fri Nov 13 14:23:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Maybe I missed something, but how do the test get CRIU + Podman + Ubuntu working?

edit: is this the reason?

# overlaysfs behaves differently on Ubuntu and breaks CRIU
# https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257
podman --storage-driver vfs info
Looks like the podman storage driver is changed from overlayfs

@adrianreber
Copy link
Member

You are correct. Podman uses the vfs backend to be able to run on the Ubuntu Kernel.

You can either not use Ubuntu or not use overlayfs. The Ubuntu Kernel has a non upstreamed patch which breaks CRIU on overlayfs.

For docker we are trying to use the devicemapper backend to work around the broken Ubuntu Kernel.

@forkjoseph
Copy link

forkjoseph commented May 20, 2021

For anyone in the future, configuring vfs in docker worked (https://docs.docker.com/storage/storagedriver/vfs-driver/):
add "storage-driver": "vfs" in /etc/docker/daemon.json, restart docker engine.

Tested in Ubuntu 18.04 LTS.

@adrianreber
Copy link
Member

For anyone in the future, configuring vfs in docker worked (https://docs.docker.com/storage/storagedriver/vfs-driver/):
add
"storage-driver": "vfs" in /etc/docker/daemon.json, restart docker engine.

Thanks for the information. You have to be aware, however, that vfs can be really slow.

@rst0git
Copy link
Member

rst0git commented May 20, 2021

A fix for the problem with overlayfs has been merged in the master-next branch for Ubuntu Focal (20.04)
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/commit/?h=master-next&id=28eab192cf0e37156fc41b36f06790d5ca984834

@ZacBlanco
Copy link

A fix for the problem with overlayfs has been merged in the master-next branch for Ubuntu Focal (20.04)
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/commit/?h=master-next&id=28eab192cf0e37156fc41b36f06790d5ca984834

The Ubuntu kernel with this patch was released as of writing this. I am able to checkpoint with 5.8.0-55-generic (hwe) and 5.4.0-74-generic without using the vfs driver any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants