Allow CRIU to be used as non-root (Take 2) #1930

ymanton · 2022-06-28T13:30:35Z

We at the OpenJ9 and Open Liberty projects have been experimenting with CRIU to improve JVM and Java web application start-up times, primarily in container deployments.¹ To that end we've been testing #1155 for a while now and having success and we want to help get it across the finish line.

With @adrianreber's agreement I've rebased the patches in #1155 and added 4 additional patches as follows:

Patches to add the --unprivileged option to libcriu and the RPC interface, and a fix to do the required check_caps() on the service worker path, equivalent to what's done on the tool path:

3807cc9 non-root: Call check_caps() in service worker mode as well
8df76f0 non-root: Allow unprivileged flag to be set via libcriu

Patches to move some new code (that I believe requires root) recently landed in criu-dev to the root-only init path introduced by in #1155:

b32f887 non-root: Do kerndat_has_move_mount_set_group() in root-only init
f42e25f non-root: Do kerndat_has_nftables_concat() in root-only init

As suggested by @adrianreber I'm currently working on where to store criu.kdat, if/how to use XDG_RUNTIME_PATH for that purpose, and what to do when it's not set (e.g. when running via sudo).

I'm also looking over #1155 to see if there are any other unresolved questions.

In the meantime I wanted to open this PR now in case anyone has any fresh thoughts on the subject, comments on how to proceed, and so on.

This use-case is a bit different from using CRIU to facilitate process migration and has some unique challenges, including the fact that we're using CRIU "under the covers" in a scenario that to end-users shouldn't appear too different from simply starting a process from scratch. Allowing non-root users to restore processes, particularly inside unprivileged containers, makes this use-case much more accessible to end-users because it reduces the privileges required to "start" a process in this way, relative to starting the process from scratch. ↩

criu/include/kerndat.h

scripts/ci/vagrant.sh

adrianreber · 2022-06-28T14:27:45Z

@ymanton Thanks for picking up #1155

I had following proposal in a chat how to deal with kdat as non-root. Currently the nice thing about the kdat file is that will disappear after a reboot because it is on a tmpfs. Unfortunately as non-root there is no easy way to allow all users to write to /run. My proposal was to use XDG_RUNTIME_DIR (like I have seen it by container engines) if it is defined. It usually points to a tmpfs. This way we automatically have a location for each user for kdat. Unfortunately XDG_RUNTIME_DIR is not set when doing su/sudo. Only a real login session set up by systemd has XDG_RUNTIME_DIR pointing to the right location. If XDG_RUNTIME_DIR is not set or not writeable, non-root CRIU should fall back to just ignore kdat and run all the checks all the time. A correctly set up session will correctly use kdat but a session through sudo/su will still work, although CRIU might be slower as it has to redo all checks everytime.

criu/include/cr_options.h

criu/cgroup.c

criu/cr-check.c

criu/fdstore.c

criu/kerndat.c

ymanton · 2022-07-25T19:38:35Z

Went back and fixed the kdat file XDG_RUNTIME_DIR patch to allow the tmpfs check be done in non-root mode as well (was previously skipped since the kdat file was being stored in HOME).
Fixed some linking issues affecting unittest (it now uses some of the new caps util funcs @adrianreber added but the function defs were not visible at link time).
Addressed some review comments.

criu/fdstore.c

criu/files.c

criu/image.c

criu/kerndat.c

criu/include/kerndat.h

avagin · 2022-08-10T16:04:48Z

Thanks for working on this. I want to ask one more thing. We are trying to support the commit history in the main tree as clean as possible. The main idea is very similar to the Linux kernel process. In this PR, I see that a few patches fix problems introduced by other patches. I think we need to merge all fixes in proper changes. I understand that the origin patches are authored by @adrianreber. I think we can use tags like Co-authored-by, Co-developed-by, Originally-by to mention all developers involved in the process.

ymanton · 2022-08-15T22:30:40Z

Thanks for working on this. I want to ask one more thing. We are trying to support the commit history in the main tree as clean as possible. The main idea is very similar to the Linux kernel process. In this PR, I see that a few patches fix problems introduced by other patches. I think we need to merge all fixes in proper changes. I understand that the origin patches are authored by @adrianreber. I think we can use tags like Co-authored-by, Co-developed-by, Originally-by to mention all developers involved in the process.

No problem, thanks for letting me know. I've merged all my changes together with @adrianreber's and added Co-authored-bys to the patches that were significantly altered. Going forward I'll maintain them this way as things progress.

github-actions · 2022-09-26T01:03:44Z

A friendly reminder that this PR had no activity for 30 days.

adrianreber

Looks good. Thanks for picking it up and thanks for solving the open questions.

criu/files.c

The idea behind the rootless CRIU code is, that CRIU reads out its effective capabilities and stores that in the global opts structure. Different parts of CRIU can then, based on the existing capabilities, automatically enable or disable certain code paths. Currently at least CAP_CHECKPOINT_RESTORE is required. CRIU will not start without this capability. Signed-off-by: Adrian Reber <areber@redhat.com>

This adds the function check_caps() which checks if CRIU is running with at least CAP_CHECKPOINT_RESTORE. That is the minimum capability CRIU needs to do a minimal checkpoint and restore from it. In addition helper functions are added to easily query for other capability for enhanced checkpoint/restore support. Co-authored-by: Younes Manton <ymanton@ca.ibm.com> Signed-off-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>

This commit enables checkpointing and restoring of applications as non-root. First goal was to enable checkpoint and restore of the env00 and pthread00 test case. This uses the information from opts.unprivileged and opts.cap_eff to skip certain code paths which do not work as non-root. Co-authored-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>

This patch modifies how kerndat is handled in unprivileged mode. Initialization and functionality that can only be done as root is made separate from common code. The kerndat file's location is defined as $XDG_RUNTIME_DIR/criu.kdat in unprivileged mode. Since we expect that directory to be on tmpfs we maintain the same behavior as the root-mode kerndat which lives in /run. Co-authored-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>

This adds the non-root section and information about the parameter --unprivileged to the man page. Co-authored-by: Anna Singleton <annabeths111@gmail.com> Signed-off-by: Adrian Reber <areber@redhat.com> Signed-off-by: Anna Singleton <annabeths111@gmail.com>

avagin · 2022-10-19T14:36:01Z

Why CI checks have not been executed?..

adrianreber · 2022-10-19T14:37:33Z

Why CI checks have not been executed?..

There was an GitHub Actions outage yesterday. @ymanton please push once more to trigger CI.

These are the minimal changes to make zdtm.py successfully run the env00 and pthread test case as non-root using the '--rootless' zdtm option. Co-authored-by: Younes Manton <ymanton@ca.ibm.com> Signed-off-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>

Run env00 and pthread00 test as non-root as initial proof of concept. Signed-off-by: Adrian Reber <areber@redhat.com>

adrianreber · 2022-10-20T06:39:39Z

The cross-compile test results fail sometimes because of broken repositories like in this case and can be ignored.

The CentOS 7 errors are because of a package installation problem. @rst0git I think the epel package installation fix is from you. Can you maybe add a ||: after the epel activation line. The problem is that on CentOS 7 epel might already be installed, if you test it in RHEL the package is not installed. CentOS 7 is anyway not relevant for the non-root CRIU operation as the kernel is too old.

The Vagrant errors seem to be something new for shared memory already tracked somewhere else: #1982

The non-root Vagrant test setup is running successfully.

I think from the CI side this is ready.

ymanton · 2022-10-20T13:48:07Z

@adrianreber Thanks for double checking, I was unsure about Vagrant Fedora based test (no VDSO) and started looking at it.

avagin · 2022-10-25T14:28:14Z

@ymanton thank you for moving this pr to the finish line. @adrianreber thank you for implementing the kernel part and the initial version of the userspace changes.

purplesyringa · 2022-10-27T23:47:12Z

Perhaps this should better be asked in a separate issue, but I'm not sure if what I'm asking about is reasonable, so here goes.

One thing this PR has not handled at all is map_files. Problem is, proc(5) says that

Capabilities are required to read the contents of the symbolic links in this directory: before Linux 5.9, the reading process requires CAP_SYS_ADMIN in the initial user namespace; since Linux 5.9, the reading process must have either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE in the user namespace where it resides.

...but that does not match what the kernel does, e.g. on HEAD we have

https://github.com/torvalds/linux/blob/27bc50fc90647bbf7b734c3fc306a5e61350da53/fs/proc/base.c#L2249

...which checks whether CAP_CHECKPOINT_RESTORE is present in the init user namespace. This means that reading map_files always fails in containers. Now one would think this is seldom needed, but the kernel uses a deleted /dev/zero device for every shared anonymous mapping:

https://github.com/torvalds/linux/blob/aae703b02f92bde9264366c545e87cec451de471/mm/shmem.c#L4243-L4264

I am not sure if deleted files can be gracefully handled in the general case, but the comment suggests so. Would this be an incentive to implement that sooner than later?

UPD: actually, the current logic seems fishy as is:

if (is_memfd(vfi_dev)) {
    ...
}
if (is_hugetlb_dev(vfi_dev, &hugetlb_flag) || is_anon_shmem_map(vfi_dev)) {
    ...
}

because the implementation of is_memfd matches that of is_anon_shmem_map.

purplesyringa · 2022-10-28T02:14:25Z

You certainly know what you are doing more than me, but perhaps it might be of interest to you to know what I needed to patch to get CRIU of httpd in podman working:

Patch

diff --git a/criu/proc_parse.c b/criu/proc_parse.c
index 946b0fc40..a24783da7 100644
--- a/criu/proc_parse.c
+++ b/criu/proc_parse.c
@@ -313,25 +313,8 @@ static int vma_get_mapfile_user(const char *fname, struct vma_area *vma, struct
 
 	vfi_dev = makedev(vfi->dev_maj, vfi->dev_min);
 
-	if (is_memfd(vfi_dev)) {
-		char tmp[PATH_MAX];
-		strlcpy(tmp, fname, PATH_MAX);
-		strip_deleted(tmp, strlen(tmp));
-
-		/*
-		 * The error EPERM will be shown in the following pr_perror().
-		 * It comes from the previous open() call.
-		 */
-		pr_perror("Can't open mapped [%s]", tmp);
-
-		/*
-		 * TODO Perhaps we could do better than failing and dump the
-		 * memory like what is being done in shmem.c
-		 */
-		return -1;
-	}
-
 	if (is_hugetlb_dev(vfi_dev, &hugetlb_flag) || is_anon_shmem_map(vfi_dev)) {
+		vma->e->status |= VMA_AREA_REGULAR;
 		if (!(vma->e->flags & MAP_SHARED))
 			vma->e->status |= VMA_ANON_PRIVATE;
 		else
diff --git a/criu/shmem.c b/criu/shmem.c
index 81e701586..2de2ea9af 100644
--- a/criu/shmem.c
+++ b/criu/shmem.c
@@ -724,7 +724,7 @@ static int next_data_segment(int fd, unsigned long pfn, unsigned long *next_data
 	return 0;
 }
 
-static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si)
+static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si, bool seek_data_supported)
 {
 	struct page_pipe *pp;
 	struct page_xfer xfer;
@@ -750,7 +750,7 @@ static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si)
 		unsigned long pgaddr;
 		int st = -1;
 
-		if (pfn >= next_hole_pfn && next_data_segment(fd, pfn, &next_data_pnf, &next_hole_pfn))
+		if (seek_data_supported && pfn >= next_hole_pfn && next_data_segment(fd, pfn, &next_data_pnf, &next_hole_pfn))
 			goto err_xfer;
 
 		if (si->pstate_map && is_shmem_tracking_en()) {
@@ -808,20 +808,50 @@ static int dump_one_shmem(struct shmem_info *si)
 {
 	int fd, ret = -1;
 	void *addr;
+	bool seek_data_supported;
 
 	pr_info("Dumping shared memory %ld\n", si->shmid);
 
-	fd = open_proc(si->pid, "map_files/%lx-%lx", si->start, si->end);
-	if (fd < 0)
-		goto err;
 
-	addr = mmap(NULL, si->size, PROT_READ, MAP_SHARED, fd, 0);
-	if (addr == MAP_FAILED) {
-		pr_err("Can't map shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
-		goto errc;
+	fd = __open_proc(si->pid, EPERM, O_RDONLY, "map_files/%lx-%lx", si->start, si->end);
+	if (fd >= 0) {
+		addr = mmap(NULL, si->size, PROT_READ, MAP_SHARED, fd, 0);
+		if (addr == MAP_FAILED) {
+			pr_err("Can't map shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
+			goto errc;
+		}
+
+		seek_data_supported = true;
+	} else {
+		if(errno != EPERM) {
+			goto err;
+		}
+
+		fd = open_proc(si->pid, "mem");
+		if(fd < 0) {
+			goto err;
+		}
+
+		addr = mmap(NULL, si->size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+		if (addr == MAP_FAILED) {
+			pr_err("Can't map empty space for shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
+			goto errc;
+		}
+
+		if(lseek(fd, si->start, SEEK_SET) < 0) {
+			pr_perror("Can't seek virtual memory");
+			return -1;
+		}
+
+		if(read(fd, addr, si->size) < si->size) {
+			pr_perror("Can't read virtual memory");
+			return -1;
+		}
+
+		seek_data_supported = false;
 	}
 
-	ret = do_dump_one_shmem(fd, addr, si);
+	ret = do_dump_one_shmem(fd, addr, si, seek_data_supported);
 
 	munmap(addr, si->size);
 errc:
@@ -849,7 +879,7 @@ int dump_one_memfd_shmem(int fd, unsigned long shmid, unsigned long size)
 		goto err;
 	}
 
-	ret = do_dump_one_shmem(fd, addr, &si);
+	ret = do_dump_one_shmem(fd, addr, &si, true);
 
 	munmap(addr, size);
 err:
@@ -875,7 +905,7 @@ int dump_one_sysv_shmem(void *addr, unsigned long size, unsigned long shmid)
 	if (fd < 0)
 		return -1;
 
-	ret = do_dump_one_shmem(fd, addr, si);
+	ret = do_dump_one_shmem(fd, addr, si, true);
 	close(fd);
 	return ret;
 }
diff --git a/criu/sockets.c b/criu/sockets.c
index db772707b..d50866c65 100644
--- a/criu/sockets.c
+++ b/criu/sockets.c
@@ -12,6 +12,7 @@
 
 #include "int.h"
 #include "bitops.h"
+#include "cr_options.h"
 #include "libnetlink.h"
 #include "sockets.h"
 #include "unix_diag.h"
@@ -457,6 +458,27 @@ int sk_collect_one(unsigned ino, int family, struct socket_desc *d, struct ns_id
 
 int do_restore_opt(int sk, int level, int name, void *val, int len)
 {
+	void *buf;
+	socklen_t cur_len;
+	bool match;
+
+	if(opts.unprivileged) {
+		buf = xmalloc(len);
+		if (!buf) {
+			return -1;
+		}
+		cur_len = len;
+		if (getsockopt(sk, level, name, buf, &cur_len) < 0) {
+			pr_perror("Can't get %d:%d (len %d)", level, name, len);
+			return -1;
+		}
+		match = cur_len == len && memcmp(buf, val, len) == 0;
+		xfree(buf);
+		if(match) {
+			return 0;
+		}
+	}
+
 	if (setsockopt(sk, level, name, val, len) < 0) {
 		pr_perror("Can't set %d:%d (len %d)", level, name, len);
 		return -1;
@@ -469,9 +491,20 @@ static int sk_setbufs(void *arg, int fd, pid_t pid)
 {
 	u32 *buf = (u32 *)arg;
 
-	if (restore_opt(fd, SOL_SOCKET, SO_SNDBUFFORCE, &buf[0]))
+	int snd_opt_name;
+	int rcv_opt_name;
+
+	if (!opts.unprivileged) {
+		snd_opt_name = SO_SNDBUFFORCE;
+		rcv_opt_name = SO_RCVBUFFORCE;
+	} else {
+		snd_opt_name = SO_SNDBUF;
+		rcv_opt_name = SO_RCVBUF;
+	}
+
+	if (restore_opt(fd, SOL_SOCKET, snd_opt_name, &buf[0]))
 		return -1;
-	if (restore_opt(fd, SOL_SOCKET, SO_RCVBUFFORCE, &buf[1]))
+	if (restore_opt(fd, SOL_SOCKET, rcv_opt_name, &buf[1]))
 		return -1;
 
 	return 0;

The omission of vma->e->status |= VMA_AREA_REGULAR; seems to be a somewhat unrelated bug that should probably have been noticed earlier. Using lseek + read + vmsplice is inefficient, I guess process_vm_readv should be used instead, but I just wanted to get it working first. I also updated do_restore_opt to not set options if they are already correct because we might not have enough permissions for that, and SO_RCVBUF/SO_SNDBUF are used instead of their *FORCE counterparts in unprivileged mode. Maybe some of this is technically wrong, please take a better look (e.g. perhaps inaccessible map_files should be handled similarly in dump_one_sysv_shmem too).

I tested this inside

podman run -it --cap-add=cap_checkpoint_restore --cap-add=cap_sys_ptrace --cap-add=cap_sys_resource --security-opt seccomp=unconfined --security-opt unmask=/proc/sys php:8.1-apache bash

CAP_CHECKPOINT_RESTORE is needed for obvious reasons, CAP_SYS_PTRACE is for process_vm_readv IIRC, CAP_SYS_RESOURCE is for rlimit, seccomp is so that vmsplice does not yield EPERM (I don't quite understand the reason for this one tbh), /proc/sys is unmasked for ns_next_pid. CAP_NET_ADMIN is not used due to a bug described in one of my review comments on this PR.

Hope this helps someone.

ymanton · 2022-10-28T15:34:03Z

@imachug Thanks for testing, I'll get back to you shortly on your comments.

ymanton · 2022-11-01T17:01:45Z

Perhaps this should better be asked in a separate issue, but I'm not sure if what I'm asking about is reasonable, so here goes.

One thing this PR has not handled at all is map_files. Problem is, proc(5) says that

Capabilities are required to read the contents of the symbolic links in this directory: before Linux 5.9, the reading process requires CAP_SYS_ADMIN in the initial user namespace; since Linux 5.9, the reading process must have either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE in the user namespace where it resides.

...but that does not match what the kernel does, e.g. on HEAD we have

https://github.com/torvalds/linux/blob/27bc50fc90647bbf7b734c3fc306a5e61350da53/fs/proc/base.c#L2249

...which checks whether CAP_CHECKPOINT_RESTORE is present in the init user namespace. This means that reading map_files always fails in containers.

You're right about the man-page, it appears to be incorrect. I've sent an email to the maintainers and list and CC'd you. Thanks for catching that. As for what to do about it or how to improve it, I'm not exactly sure. Wouldn't the kernel behaviour have to change or is there something you think CRIU can do better if we're not in the init user ns but need access to map_files contents?

purplesyringa · 2022-11-01T17:13:25Z

Can we avoid using map_files in this case and read the memory of the process directly?

purplesyringa · 2022-11-01T17:33:35Z

Ah, I guess I see what the problem is: we want to dump the whole file, and mmap may only map the file partially. Moreover, for deleted files, we might not even know if two mappings are of the same file or not. Given that mremap does not really work for shared memory, I guess the best that can be done is a) attempting to modify one mapping and checking how it affects others, to figure out which mappings are parts of the same file, and then b) dumping different parts of each file from memory of different processes, if no single process contains the whole file. This leaves the case when a part of a file is not mapped anywhere, but then we can just fill it with zeroes -- that should be indistinguishable from the "real" file if map_files is unavailable.

adrianreber reviewed Jun 28, 2022

View reviewed changes

criu/include/kerndat.h Outdated Show resolved Hide resolved

scripts/ci/vagrant.sh Outdated Show resolved Hide resolved