Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when checkpointing container: Can't lookup mount #860

Closed
giovannivenancio opened this issue Nov 19, 2019 · 43 comments
Closed

error when checkpointing container: Can't lookup mount #860

giovannivenancio opened this issue Nov 19, 2019 · 43 comments
Labels
bug kernel no-auto-close Don't auto-close as a stale issue stale-issue

Comments

@giovannivenancio
Copy link

I have a freshly installed Ubuntu 18.04 and I get the following error when checkpointing a container:

docker checkpoint create cr checkpoint1 Error response from daemon: Cannot checkpoint container cr: runc did not terminate sucessfully: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v1.linux/moby/f2e3e31d2e1b75c6ad339f8931d38e296344f87917d7ef54b6de16a1268d2faf/criu-dump.log: unknown

When inspecting the log, there is this error: (00.008480) Error (criu/files-reg.c:1338): Can't lookup mount=795 for fd=-3 path=/bin/sh

criu-dump.log

The container was created as follows:

docker run --security-opt=seccomp:unconfined --name cr -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'

I'm using Docker 19.03.5 and CRIU 3.12.

Any ideas? Thanks in advance!

@avagin
Copy link
Member

avagin commented Nov 19, 2019

What kernel do you use?

@avagin
Copy link
Member

avagin commented Nov 19, 2019

(00.000079) Running on wraeclast Linux 5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64

@avagin
Copy link
Member

avagin commented Nov 19, 2019

Cc: @Snorch

@Snorch
Copy link
Member

Snorch commented Nov 20, 2019

Do you have mount 795 on host? If you still have dumpable ct running or you can reproduce it do:

grep "^795\>" /proc/*/mountinfo

If the above gives anything, please, also show the optput of:

lsns | grep mnt

  1. You can have open fd in the dumpable process from the mount outside of dumpable mntns'es.
  2. You can have open mount in detached mntns.
  3. You can have open file on overlayfs and it sometimes shows mnt_id of pseudo mount which never existed.

If (1), you might need some external mount given to criu. (2) is not supported. For (3) I'm not sure if we can handle these.

Likely you case is (3):

(00.006249) type overlay source overlay mnt_id 898 s_dev 0x55 / @ ./ flags 0x280000 options lowerdir=/var/lib/docker/overlay2/l/FQGSJGYMN3LHZQBI7BRPIJG32M:/var/lib/docker/overlay2/l/5PSVBCDONUTQRDXFW2KXS3OEVC,upperdir=/var/lib/docker/overlay2/d39179831d1d7b13c89b4b5c50168d857c107e805e6c4ccc9cb5fa80af405504/diff,workdir=/var/lib/docker/overlay2/d39179831d1d7b13c89b4b5c50168d857c107e805e6c4ccc9cb5fa80af405504/work,xino=off

To verify it, (upd) do inside you docker ct (e.g. via nsenter -t $CTPID -m):

exec 100< /bin/sh
cat /proc/$$/fdinfo/100

It should give you 795, or if number changed some other id which is not seen in mountinfo.

@avagin Also the strange thing is that fd number is <0, not sure what it can mean.

@giovannivenancio
Copy link
Author

grep "^795\>" /proc/*/mountinfo

Returns nothing. Also, the results of the nsenter verification:

root@d5edadab7ac3:/# exec 100< /bin/sh
root@d5edadab7ac3:/# cat /proc/$$/fdinfo/100
pos:	0
flags:	0100000
mnt_id:	795

I'm also not sure of how relevant this is but:

  1. Last week, checkpointing (and restore too) was working just fine. Without any major changes to the S.O. (apart from apt updates), it wasn't working anymore. For this reason I reinstalled Ubuntu, but the error persists.

  2. I also tried to install CRIU on a VM (using the same setup Ubuntu 18.04, CRIU 3.12) and the checkpointing works...

@avagin
Copy link
Member

avagin commented Nov 20, 2019

I think the problem might be in the linux kernel. Can you try to downgrade the kernel?

@Snorch
Copy link
Member

Snorch commented Nov 21, 2019

Not exactly the same problem connected to bad (pseudo) stat sd_dev on overlay:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1751243

In git://kernel.ubuntu.com/ubuntu/autotest-client-tests:

+++ b/ubuntu_unionmount_overlayfs_suite/0001-Fix-check-for-file-on-overlayfs.patch
@@ -0,0 +1,50 @@
+From e19b161d30d648cf0ac5bd68df84b82322de7451 Mon Sep 17 00:00:00 2001
+From: Kleber Sacilotto de Souza kleber.souza@canonical.com
+Date: Thu, 31 May 2018 13:52:30 +0200
+Subject: [PATCH][unionmount-testsuite] Fix check for file on overlayfs
+
+BugLink: https://bugs.launchpad.net/bugs/1751243
+
+After kernel 4.15, the unmodified files do not have the same st_dev as
+the lower filesystem anymore, but they are assigned an anonymous bdev
+instead. So checking if a file which is expected to be unmodified has
+the same st_dev as the lower filesystem doesn't work anymore, we need to
+check if they do not have the same st_dev as the overlay filesystem.
+
+Signed-off-by: Kleber Sacilotto de Souza kleber.souza@canonical.com
+Acked-by: Colin Ian King colin.king@canonical.com
+---

  • context.py | 13 +++++--------
  • 1 file changed, 5 insertions(+), 8 deletions(-)

@Snorch
Copy link
Member

Snorch commented Nov 21, 2019

Looks like we have the same as for st_dev now with mnt_id, that is bad, because we can't find on which mount to open the file if kernel hides these information from us.

@giovannivenancio
Copy link
Author

I think the problem might be in the linux kernel. Can you try to downgrade the kernel?

That was the problem. Downgrading the kernel solves the error!

So, for reference, I was using kernel version 5.0.0-36-generic and downgrading to kernel version 5.0.0-23-generic solves the error.

Thank you, i really appreciate the effort! Cheers.

@adrianreber
Copy link
Member

I reported the ubuntu kernel problems here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257

@adrianreber
Copy link
Member

This bug is now also seen in multiple other projects which have to disable CRIU tests on Travis:

containerd/containerd#3898
opencontainers/runc#2196
opencontainers/runc#2198

@anagainaru
Copy link

Any update with this?

I have the same problem when trying to use CRIU Version: 3.6, Docker version 19.03.6 on the 5.3.0-40-generic Ubuntu kernel. I would prefer not having to downgrade my kernel.

@adrianreber
Copy link
Member

@anagainaru You have to complain here:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257

There is patch to fix it, but it seems not have been applied by Ubuntu. You could always switch to another Linux distribution.

@avagin
Copy link
Member

avagin commented Apr 23, 2020

Here is the fix: https://lists.openvz.org/pipermail/criu/2020-April/044973.html

@buck2202
Copy link

buck2202 commented Aug 3, 2020

The status of the launchpad bug is a bit confusing to me--the janitor reported back in May that it was fixed in 5.3/5.4, comments suggest the fix caused other issues and it was either rolled back or mistakenly not rolled back, and now the janitor reports it fixed again in 5.4 in the focal repo, but no mention of other releases with currently-active kernel support. Can anyone give any insight to the status of this?

@adrianreber
Copy link
Member

The status of the launchpad bug is a bit confusing to me--the janitor reported back in May that it was fixed in 5.3/5.4, comments suggest the fix caused other issues and it was either rolled back or mistakenly not rolled back, and now the janitor reports it fixed again in 5.4 in the focal repo, but no mention of other releases with currently-active kernel support. Can anyone give any insight to the status of this?

I am also confused by the state of that bug. Someone would need to test it. I am not using Ubuntu, so I cannot test it. For CRIU we rely on Travis and Travis is based on Ubuntu, so it would be nice to have a fixed kernel in Travis. But Travis also takes some time to update their images. Even if it fixed I cannot verify it via Travis. Additionally Travis uses the GCE variant of the Ubuntu kernel and I am not sure how that kernel version maps to the kernel version in the launchpad bug report.

No answer for your question, but I can confirm that the state of that bug is also very unclear to me.

@buck2202
Copy link

buck2202 commented Aug 4, 2020

Alright. A big part of my CRIU usage is on google cloud, so I went ahead and created some clean ubuntu images with latest stable docker from the download.docker repo, and CRIU from the PPA. I tried to checkpoint a container created from giovannivenancio's MWE above just to see where things stand.

docker run --security-opt=seccomp:unconfined --name cr -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
docker checkpoint create cr chk

16.04/xenial installs with 4.15.0-1080-gcp #90~16.04.1-Ubuntu. Checkpointing still works fine here

18.04/bionic installs with 5.3.0-1032-gcp #34~18.04.1-Ubuntu. Fails with

Error response from daemon: Cannot checkpoint container cr: runc did not terminate sucessfully: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v1.linux/moby/a3fd40353bfa4b041fb4eb9f38e6140e243700106208e9de77ec1f97bc206986/criu-dump.log: unknown

(00.170336) Error (criu/files-reg.c:1372): Can't lookup mount=339 for fd=-3 path=/bin/sh
(00.170347) Error (criu/cr-dump.c:1247): Collect mappings (pid: 1927) failed with -1

20.04/focal installs with 5.4.0-1021-gcp #21-Ubuntu. Fails with

Error response from daemon: Cannot checkpoint container cr: runc did not terminate sucessfully: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v1.linux/moby/802ae7117527f88f524fe32db4959b191df694de4ed33cf00a6ceeef5055993a/criu-dump.log: unknown

(00.255897) Error (criu/files-reg.c:1371): Can't lookup mount=436 for fd=-3 path=/bin/sh
(00.255910) Error (criu/cr-dump.c:1247): Collect mappings (pid: 1149) failed with -1

TLDR; still broken, but I'm not sure of the mapping between gcp and vanilla ubuntu versions.

edit: here's a mapping, but not totally helpful to get back to baseline ubuntu https://people.canonical.com/~kernel/info/kernel-version-map.html

Ubuntu Kernel Version Ubuntu Kernel Tag Mainline Kernel Version
(bionic) 5.3.0-1032.34~18.04.1 Ubuntu-gcp-5.3-5.3.0-1032.34_18.04.1 5.3.18
(focal) 5.4.0-1021.21 Ubuntu-gcp-5.4.0-1021.21 5.4.44

@adrianreber
Copy link
Member

Thanks a lot for trying it out! Good to know that it is still not fixed. I will re-open the launchpad bug.

As far I understood it this is related to Ubuntu's out of tree shiftfs implementation which is not part of the upstream linux kernel. That is the reason it only happens on Ubuntu.

@buck2202
Copy link

buck2202 commented Aug 4, 2020

No problem at all, thanks for following up on launchpad. I'm keeping my work on xenial to avoid manually blocking kernel updates, but I'll keep these instances around to retry if I see any movement on a fix

@buck2202
Copy link

buck2202 commented Sep 9, 2020

Just wanted to drop by again since launchpad is again confusing. There's another fix-released comment, but still "confirmed" status for focal and "won't fix" for eoan. The two most recent LTS releases are on 5.4 kernels, so I guess it's not surprising that it's still broken in both of them.

Summary:

Release kernel working?
18.04 server LTS 5.4.0-1024-gcp #24~18.04.1-Ubuntu no
20.04 server LTS 5.4.0-1024-gcp #24-Ubuntu no

Wonder if it's worth tagging launchpad for bionic as well?

@adrianreber
Copy link
Member

My guess right now is, that this will not be fixed any time soon. As far as I understand it, this is related to non-upstream kernel patches concerning 'shiftfs'. From what I heard this will not be upstreamed in the version Ubuntu carries it. Other implementations of similar features are also not upstream (yet). So if we are lucky it will be fixed with the next LTS release, but that is about 1.5 years of waiting and as the feature is not yet in the kernel it is not clear if this will happen.

If this breaks your workflow you have to either run an old kernel or switch distributions.

@buck2202
Copy link

buck2202 commented Sep 9, 2020

Understood, just wanted to follow up since there was a new janitor post and you had said previously that you didn't have an easy way of testing it. I'll still re-test and check back in if it looks like their fix makes it back to focal. Hopefully that would cover bionic as well, since they're currently on the same LTS kernel for gcloud at least.

I've kept my google cloud instances on xenial and my (linux mint) desktops on the still-supported 4.15 kernel. I might move away from ubuntu/derivatives at some point, but I'm trying to keep things static until my current work wraps up.

@adrianreber
Copy link
Member

I removed the stale-issue label and added the no-auto-close label as long as this is not fixed in Ubuntu.

#887 is another example where we were impacted by this kernel bug.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

github-actions bot commented Apr 2, 2021

A friendly reminder that this issue had no activity for 30 days.

@mihalicyn
Copy link
Member

@buck2202
Copy link

buck2202 commented Aug 3, 2021

update for checkpoint creation on google cloud images:

I ran

docker run --security-opt=seccomp:unconfined --name cr -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
docker checkpoint create cr chk

and confirmed a zero exit code

base image kernel working?
18.04 server LTS (hwe) 5.4.0-1049-gcp #53~18.04.1-Ubuntu SMP yes
20.04 server LTS 5.8.0-1038-gcp #40~20.04.1-Ubuntu SMP yes

@adrianreber
Copy link
Member

Thanks everyone for working on fixes and checking the state of the kernel. Closing this now as it seems to be fixed in Ubuntu.

@108anup
Copy link

108anup commented Feb 8, 2022

Edit: Upgrading to kernel: 5.4.0-1068-azure worked.


Hey was facing a similar issue. I am guessing I need to update the kernel based on the thread, is that correct?

OS: Ubuntu 18.04.4
kernel: 5.3.0-1034-azure (this is on an azure machine)
CRIU version: 3.6 (installed from ppa)
docker version: 20.10.7

cmds:

docker run --security-opt=seccomp:unconfined --name cr -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
docker checkpoint create cr checkpoint1

tail of the log:

(00.033224) ========================================
(00.033228) Dumping task (pid: 15327)
(00.033232) ========================================
(00.033235) Obtaining task stat ...
(00.033274)
(00.033277) Collecting mappings (pid: 15327)
(00.033280) ----------------------------------------
(00.033454) Error (criu/files-reg.c:1281): Can't lookup mount=387 for fd=-3 path=/bin/sh
(00.033470) Error (criu/cr-dump.c:1249): Collect mappings (pid: 15327) failed with -1
(00.033505) Unlock network
(00.033509) Running network-unlock scripts
(00.033512)     RPC
iptables-restore: invalid option -- 'w'
ip6tables-restore: invalid option -- 'w'
(00.061090) Unfreezing tasks into 1
(00.061159) Error (criu/cr-dump.c:1709): Dumping FAILED.

@elchead
Copy link

elchead commented Mar 7, 2022

Had the same issue and got it to work on Ubuntu 21.04 LTS after performing a kernel upgrade through sudo apt-get dist-upgrade to version 5.11.0-1027-azure x86_64 (Ubuntu 22 did not work with kernel 5.13)

@adrianreber
Copy link
Member

Had the same issue and got it to work on Ubuntu 21.04 LTS after performing a kernel upgrade through sudo apt-get dist-upgrade to version 5.11.0-1027-azure x86_64 (Ubuntu 22 did not work with kernel 5.13)

Unfortunately we have also seen it being reintroduced by Ubuntu. Not much we can do.

@avagin avagin reopened this Mar 7, 2022
@avagin
Copy link
Member

avagin commented Mar 7, 2022

@elchead you need to find what kernel change broke the workflow and report it to ubuntu.
Cc: @Snorch @mihalicyn

@rst0git
Copy link
Member

rst0git commented Mar 7, 2022

Ubuntu 22 did not work

@elchead What is the error message you see with Ubuntu 22?

I noticed that Ubuntu 22.04 has upgraded to glibc 2.35: #1696

$ podman run ubuntu:22.04 ldd --version
ldd (Ubuntu GLIBC 2.35-0ubuntu1) 2.35
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

@elchead
Copy link

elchead commented Mar 18, 2022

not sure yet which change broke it but I can further confirm that

kernel 5.4.0.1010.11 did not work, but 5.4.0-1068-azure does as stated above.
5.13.0.1017.19 also does not work (latest kernel available through azure on 20.04 LTS)

@baelter
Copy link

baelter commented Mar 31, 2022

Also seeing this:

~# uname -r
5.13.0-39-generic
~# criu --version
Version: 3.16.1
~# crun --version
crun version 1.4.4
commit: 6521fcc5806f20f6187eb933f9f45130c86da230
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL

@elchead
Copy link

elchead commented Apr 1, 2022

@baelter

You can try this:
apt install -y linux-image-unsigned-5.4.0-1068-azure

Follow the steps from here to change the boot kernel:
https://meetrix.io/blog/aws/changing-default-ubuntu-kernel.html

@avagin
Copy link
Member

avagin commented Apr 5, 2022

5.13.0-1017-azure has this issue. I filed a new ubuntu issue:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1967924

@rst0git rst0git closed this as completed Aug 14, 2022
@baelter
Copy link

baelter commented Aug 15, 2022

I'm not using an azure build but standard that comes with ubuntu, so this issue is not limited to that build.

@rst0git
Copy link
Member

rst0git commented Aug 15, 2022

Thanks @baelter, the problem should be fixed with the standard kernel that comes with Ubuntu:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1967924

@RonaldGalea
Copy link

I have the same issue on an Ubuntu AWS VM, the kernel version is:
Linux version 5.15.0-1019-aws

I can't find what version this is exactly, and whether it's expected to work or not?

@mihalicyn
Copy link
Member

As far as I understand your kernel is build on Wed, 17 Aug 2022. So it doesn't contain my last fix for this issue. Please try to upgrade your kernel to the recent version. It should work. If not - then ping me ;-)

@RonaldGalea
Copy link

Updated to 5.15.0-1023-aws and it works as expected, thank you very much :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug kernel no-auto-close Don't auto-close as a stale issue stale-issue
Projects
None yet
Development

No branches or pull requests