Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow CRIU to be used as non-root (Take 2) #1930

Merged
merged 7 commits into from
Oct 25, 2022

Conversation

ymanton
Copy link
Contributor

@ymanton ymanton commented Jun 28, 2022

We at the OpenJ9 and Open Liberty projects have been experimenting with CRIU to improve JVM and Java web application start-up times, primarily in container deployments.1 To that end we've been testing #1155 for a while now and having success and we want to help get it across the finish line.


With @adrianreber's agreement I've rebased the patches in #1155 and added 4 additional patches as follows:

Patches to add the --unprivileged option to libcriu and the RPC interface, and a fix to do the required check_caps() on the service worker path, equivalent to what's done on the tool path:

  • 3807cc9 non-root: Call check_caps() in service worker mode as well
  • 8df76f0 non-root: Allow unprivileged flag to be set via libcriu

Patches to move some new code (that I believe requires root) recently landed in criu-dev to the root-only init path introduced by in #1155:

  • b32f887 non-root: Do kerndat_has_move_mount_set_group() in root-only init
  • f42e25f non-root: Do kerndat_has_nftables_concat() in root-only init

As suggested by @adrianreber I'm currently working on where to store criu.kdat, if/how to use XDG_RUNTIME_PATH for that purpose, and what to do when it's not set (e.g. when running via sudo).

I'm also looking over #1155 to see if there are any other unresolved questions.

In the meantime I wanted to open this PR now in case anyone has any fresh thoughts on the subject, comments on how to proceed, and so on.

Footnotes

  1. This use-case is a bit different from using CRIU to facilitate process migration and has some unique challenges, including the fact that we're using CRIU "under the covers" in a scenario that to end-users shouldn't appear too different from simply starting a process from scratch. Allowing non-root users to restore processes, particularly inside unprivileged containers, makes this use-case much more accessible to end-users because it reduces the privileges required to "start" a process in this way, relative to starting the process from scratch.

criu/include/kerndat.h Outdated Show resolved Hide resolved
scripts/ci/vagrant.sh Outdated Show resolved Hide resolved
@adrianreber
Copy link
Member

@ymanton Thanks for picking up #1155

I had following proposal in a chat how to deal with kdat as non-root. Currently the nice thing about the kdat file is that will disappear after a reboot because it is on a tmpfs. Unfortunately as non-root there is no easy way to allow all users to write to /run. My proposal was to use XDG_RUNTIME_DIR (like I have seen it by container engines) if it is defined. It usually points to a tmpfs. This way we automatically have a location for each user for kdat. Unfortunately XDG_RUNTIME_DIR is not set when doing su/sudo. Only a real login session set up by systemd has XDG_RUNTIME_DIR pointing to the right location. If XDG_RUNTIME_DIR is not set or not writeable, non-root CRIU should fall back to just ignore kdat and run all the checks all the time. A correctly set up session will correctly use kdat but a session through sudo/su will still work, although CRIU might be slower as it has to redo all checks everytime.

criu/cgroup.c Outdated Show resolved Hide resolved
criu/cr-check.c Outdated Show resolved Hide resolved
criu/fdstore.c Show resolved Hide resolved
criu/kerndat.c Show resolved Hide resolved
@ymanton
Copy link
Contributor Author

ymanton commented Jul 25, 2022

  • Went back and fixed the kdat file XDG_RUNTIME_DIR patch to allow the tmpfs check be done in non-root mode as well (was previously skipped since the kdat file was being stored in HOME).
  • Fixed some linking issues affecting unittest (it now uses some of the new caps util funcs @adrianreber added but the function defs were not visible at link time).
  • Addressed some review comments.

criu/fdstore.c Outdated Show resolved Hide resolved
criu/files.c Outdated Show resolved Hide resolved
criu/image.c Outdated Show resolved Hide resolved
criu/kerndat.c Outdated Show resolved Hide resolved
criu/kerndat.c Outdated Show resolved Hide resolved
criu/include/kerndat.h Outdated Show resolved Hide resolved
@avagin
Copy link
Member

avagin commented Aug 10, 2022

Thanks for working on this. I want to ask one more thing. We are trying to support the commit history in the main tree as clean as possible. The main idea is very similar to the Linux kernel process. In this PR, I see that a few patches fix problems introduced by other patches. I think we need to merge all fixes in proper changes. I understand that the origin patches are authored by @adrianreber. I think we can use tags like Co-authored-by, Co-developed-by, Originally-by to mention all developers involved in the process.

@ymanton
Copy link
Contributor Author

ymanton commented Aug 15, 2022

Thanks for working on this. I want to ask one more thing. We are trying to support the commit history in the main tree as clean as possible. The main idea is very similar to the Linux kernel process. In this PR, I see that a few patches fix problems introduced by other patches. I think we need to merge all fixes in proper changes. I understand that the origin patches are authored by @adrianreber. I think we can use tags like Co-authored-by, Co-developed-by, Originally-by to mention all developers involved in the process.

No problem, thanks for letting me know. I've merged all my changes together with @adrianreber's and added Co-authored-bys to the patches that were significantly altered. Going forward I'll maintain them this way as things progress.

@ymanton ymanton force-pushed the unprivileged branch 2 times, most recently from f38498f to d6cd284 Compare August 26, 2022 16:55
@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

@avagin avagin removed the stale-pr label Sep 26, 2022
@avagin avagin self-assigned this Sep 26, 2022
@avagin avagin added the no-auto-close Don't auto-close as a stale issue label Sep 26, 2022
Copy link
Member

@adrianreber adrianreber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for picking it up and thanks for solving the open questions.

criu/files.c Outdated Show resolved Hide resolved
adrianreber and others added 5 commits October 18, 2022 07:10
The idea behind the rootless CRIU code is, that CRIU reads out its
effective capabilities and stores that in the global opts structure.

Different parts of CRIU can then, based on the existing capabilities,
automatically enable or disable certain code paths.

Currently at least CAP_CHECKPOINT_RESTORE is required. CRIU will not
start without this capability.

Signed-off-by: Adrian Reber <areber@redhat.com>
This adds the function check_caps() which checks if CRIU is running
with at least CAP_CHECKPOINT_RESTORE. That is the minimum capability
CRIU needs to do a minimal checkpoint and restore from it.

In addition helper functions are added to easily query for other
capability for enhanced checkpoint/restore support.

Co-authored-by: Younes Manton <ymanton@ca.ibm.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This commit enables checkpointing and restoring of applications as
non-root.

First goal was to enable checkpoint and restore of the env00 and
pthread00 test case.

This uses the information from opts.unprivileged and opts.cap_eff to
skip certain code paths which do not work as non-root.

Co-authored-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This patch modifies how kerndat is handled in unprivileged mode.

Initialization and functionality that can only be done as root is
made separate from common code. The kerndat file's location is
defined as $XDG_RUNTIME_DIR/criu.kdat in unprivileged mode. Since
we expect that directory to be on tmpfs we maintain the same behavior
as the root-mode kerndat which lives in /run.

Co-authored-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This adds the non-root section and information about the parameter
--unprivileged to the man page.

Co-authored-by: Anna Singleton <annabeths111@gmail.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Anna Singleton <annabeths111@gmail.com>
@ymanton ymanton force-pushed the unprivileged branch 2 times, most recently from 01c43ab to bad5a60 Compare October 18, 2022 15:24
@ymanton ymanton marked this pull request as ready for review October 18, 2022 16:12
@avagin
Copy link
Member

avagin commented Oct 19, 2022

Why CI checks have not been executed?..

@adrianreber
Copy link
Member

Why CI checks have not been executed?..

There was an GitHub Actions outage yesterday. @ymanton please push once more to trigger CI.

These are the minimal changes to make zdtm.py successfully run the
env00 and pthread test case as non-root using the '--rootless' zdtm option.

Co-authored-by: Younes Manton <ymanton@ca.ibm.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Run env00 and pthread00 test as non-root as initial proof of concept.

Signed-off-by: Adrian Reber <areber@redhat.com>
@adrianreber
Copy link
Member

The cross-compile test results fail sometimes because of broken repositories like in this case and can be ignored.

The CentOS 7 errors are because of a package installation problem. @rst0git I think the epel package installation fix is from you. Can you maybe add a ||: after the epel activation line. The problem is that on CentOS 7 epel might already be installed, if you test it in RHEL the package is not installed. CentOS 7 is anyway not relevant for the non-root CRIU operation as the kernel is too old.

The Vagrant errors seem to be something new for shared memory already tracked somewhere else: #1982

The non-root Vagrant test setup is running successfully.

I think from the CI side this is ready.

@ymanton
Copy link
Contributor Author

ymanton commented Oct 20, 2022

@adrianreber Thanks for double checking, I was unsure about Vagrant Fedora based test (no VDSO) and started looking at it.

@avagin
Copy link
Member

avagin commented Oct 25, 2022

@ymanton thank you for moving this pr to the finish line. @adrianreber thank you for implementing the kernel part and the initial version of the userspace changes.

@purplesyringa
Copy link
Contributor

purplesyringa commented Oct 27, 2022

Perhaps this should better be asked in a separate issue, but I'm not sure if what I'm asking about is reasonable, so here goes.

One thing this PR has not handled at all is map_files. Problem is, proc(5) says that

Capabilities are required to read the contents of the symbolic links in this directory: before Linux 5.9, the reading process requires CAP_SYS_ADMIN in the initial user namespace; since Linux 5.9, the reading process must have either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE in the user namespace where it resides.

...but that does not match what the kernel does, e.g. on HEAD we have

https://github.com/torvalds/linux/blob/27bc50fc90647bbf7b734c3fc306a5e61350da53/fs/proc/base.c#L2249

...which checks whether CAP_CHECKPOINT_RESTORE is present in the init user namespace. This means that reading map_files always fails in containers. Now one would think this is seldom needed, but the kernel uses a deleted /dev/zero device for every shared anonymous mapping:

https://github.com/torvalds/linux/blob/aae703b02f92bde9264366c545e87cec451de471/mm/shmem.c#L4243-L4264

I am not sure if deleted files can be gracefully handled in the general case, but the comment suggests so. Would this be an incentive to implement that sooner than later?


UPD: actually, the current logic seems fishy as is:

if (is_memfd(vfi_dev)) {
    ...
}
if (is_hugetlb_dev(vfi_dev, &hugetlb_flag) || is_anon_shmem_map(vfi_dev)) {
    ...
}

because the implementation of is_memfd matches that of is_anon_shmem_map.

@purplesyringa
Copy link
Contributor

purplesyringa commented Oct 28, 2022

You certainly know what you are doing more than me, but perhaps it might be of interest to you to know what I needed to patch to get CRIU of httpd in podman working:

Patch
diff --git a/criu/proc_parse.c b/criu/proc_parse.c
index 946b0fc40..a24783da7 100644
--- a/criu/proc_parse.c
+++ b/criu/proc_parse.c
@@ -313,25 +313,8 @@ static int vma_get_mapfile_user(const char *fname, struct vma_area *vma, struct
 
 	vfi_dev = makedev(vfi->dev_maj, vfi->dev_min);
 
-	if (is_memfd(vfi_dev)) {
-		char tmp[PATH_MAX];
-		strlcpy(tmp, fname, PATH_MAX);
-		strip_deleted(tmp, strlen(tmp));
-
-		/*
-		 * The error EPERM will be shown in the following pr_perror().
-		 * It comes from the previous open() call.
-		 */
-		pr_perror("Can't open mapped [%s]", tmp);
-
-		/*
-		 * TODO Perhaps we could do better than failing and dump the
-		 * memory like what is being done in shmem.c
-		 */
-		return -1;
-	}
-
 	if (is_hugetlb_dev(vfi_dev, &hugetlb_flag) || is_anon_shmem_map(vfi_dev)) {
+		vma->e->status |= VMA_AREA_REGULAR;
 		if (!(vma->e->flags & MAP_SHARED))
 			vma->e->status |= VMA_ANON_PRIVATE;
 		else
diff --git a/criu/shmem.c b/criu/shmem.c
index 81e701586..2de2ea9af 100644
--- a/criu/shmem.c
+++ b/criu/shmem.c
@@ -724,7 +724,7 @@ static int next_data_segment(int fd, unsigned long pfn, unsigned long *next_data
 	return 0;
 }
 
-static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si)
+static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si, bool seek_data_supported)
 {
 	struct page_pipe *pp;
 	struct page_xfer xfer;
@@ -750,7 +750,7 @@ static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si)
 		unsigned long pgaddr;
 		int st = -1;
 
-		if (pfn >= next_hole_pfn && next_data_segment(fd, pfn, &next_data_pnf, &next_hole_pfn))
+		if (seek_data_supported && pfn >= next_hole_pfn && next_data_segment(fd, pfn, &next_data_pnf, &next_hole_pfn))
 			goto err_xfer;
 
 		if (si->pstate_map && is_shmem_tracking_en()) {
@@ -808,20 +808,50 @@ static int dump_one_shmem(struct shmem_info *si)
 {
 	int fd, ret = -1;
 	void *addr;
+	bool seek_data_supported;
 
 	pr_info("Dumping shared memory %ld\n", si->shmid);
 
-	fd = open_proc(si->pid, "map_files/%lx-%lx", si->start, si->end);
-	if (fd < 0)
-		goto err;
 
-	addr = mmap(NULL, si->size, PROT_READ, MAP_SHARED, fd, 0);
-	if (addr == MAP_FAILED) {
-		pr_err("Can't map shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
-		goto errc;
+	fd = __open_proc(si->pid, EPERM, O_RDONLY, "map_files/%lx-%lx", si->start, si->end);
+	if (fd >= 0) {
+		addr = mmap(NULL, si->size, PROT_READ, MAP_SHARED, fd, 0);
+		if (addr == MAP_FAILED) {
+			pr_err("Can't map shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
+			goto errc;
+		}
+
+		seek_data_supported = true;
+	} else {
+		if(errno != EPERM) {
+			goto err;
+		}
+
+		fd = open_proc(si->pid, "mem");
+		if(fd < 0) {
+			goto err;
+		}
+
+		addr = mmap(NULL, si->size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+		if (addr == MAP_FAILED) {
+			pr_err("Can't map empty space for shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
+			goto errc;
+		}
+
+		if(lseek(fd, si->start, SEEK_SET) < 0) {
+			pr_perror("Can't seek virtual memory");
+			return -1;
+		}
+
+		if(read(fd, addr, si->size) < si->size) {
+			pr_perror("Can't read virtual memory");
+			return -1;
+		}
+
+		seek_data_supported = false;
 	}
 
-	ret = do_dump_one_shmem(fd, addr, si);
+	ret = do_dump_one_shmem(fd, addr, si, seek_data_supported);
 
 	munmap(addr, si->size);
 errc:
@@ -849,7 +879,7 @@ int dump_one_memfd_shmem(int fd, unsigned long shmid, unsigned long size)
 		goto err;
 	}
 
-	ret = do_dump_one_shmem(fd, addr, &si);
+	ret = do_dump_one_shmem(fd, addr, &si, true);
 
 	munmap(addr, size);
 err:
@@ -875,7 +905,7 @@ int dump_one_sysv_shmem(void *addr, unsigned long size, unsigned long shmid)
 	if (fd < 0)
 		return -1;
 
-	ret = do_dump_one_shmem(fd, addr, si);
+	ret = do_dump_one_shmem(fd, addr, si, true);
 	close(fd);
 	return ret;
 }
diff --git a/criu/sockets.c b/criu/sockets.c
index db772707b..d50866c65 100644
--- a/criu/sockets.c
+++ b/criu/sockets.c
@@ -12,6 +12,7 @@
 
 #include "int.h"
 #include "bitops.h"
+#include "cr_options.h"
 #include "libnetlink.h"
 #include "sockets.h"
 #include "unix_diag.h"
@@ -457,6 +458,27 @@ int sk_collect_one(unsigned ino, int family, struct socket_desc *d, struct ns_id
 
 int do_restore_opt(int sk, int level, int name, void *val, int len)
 {
+	void *buf;
+	socklen_t cur_len;
+	bool match;
+
+	if(opts.unprivileged) {
+		buf = xmalloc(len);
+		if (!buf) {
+			return -1;
+		}
+		cur_len = len;
+		if (getsockopt(sk, level, name, buf, &cur_len) < 0) {
+			pr_perror("Can't get %d:%d (len %d)", level, name, len);
+			return -1;
+		}
+		match = cur_len == len && memcmp(buf, val, len) == 0;
+		xfree(buf);
+		if(match) {
+			return 0;
+		}
+	}
+
 	if (setsockopt(sk, level, name, val, len) < 0) {
 		pr_perror("Can't set %d:%d (len %d)", level, name, len);
 		return -1;
@@ -469,9 +491,20 @@ static int sk_setbufs(void *arg, int fd, pid_t pid)
 {
 	u32 *buf = (u32 *)arg;
 
-	if (restore_opt(fd, SOL_SOCKET, SO_SNDBUFFORCE, &buf[0]))
+	int snd_opt_name;
+	int rcv_opt_name;
+
+	if (!opts.unprivileged) {
+		snd_opt_name = SO_SNDBUFFORCE;
+		rcv_opt_name = SO_RCVBUFFORCE;
+	} else {
+		snd_opt_name = SO_SNDBUF;
+		rcv_opt_name = SO_RCVBUF;
+	}
+
+	if (restore_opt(fd, SOL_SOCKET, snd_opt_name, &buf[0]))
 		return -1;
-	if (restore_opt(fd, SOL_SOCKET, SO_RCVBUFFORCE, &buf[1]))
+	if (restore_opt(fd, SOL_SOCKET, rcv_opt_name, &buf[1]))
 		return -1;
 
 	return 0;

The omission of vma->e->status |= VMA_AREA_REGULAR; seems to be a somewhat unrelated bug that should probably have been noticed earlier. Using lseek + read + vmsplice is inefficient, I guess process_vm_readv should be used instead, but I just wanted to get it working first. I also updated do_restore_opt to not set options if they are already correct because we might not have enough permissions for that, and SO_RCVBUF/SO_SNDBUF are used instead of their *FORCE counterparts in unprivileged mode. Maybe some of this is technically wrong, please take a better look (e.g. perhaps inaccessible map_files should be handled similarly in dump_one_sysv_shmem too).

I tested this inside

podman run -it --cap-add=cap_checkpoint_restore --cap-add=cap_sys_ptrace --cap-add=cap_sys_resource --security-opt seccomp=unconfined --security-opt unmask=/proc/sys php:8.1-apache bash

CAP_CHECKPOINT_RESTORE is needed for obvious reasons, CAP_SYS_PTRACE is for process_vm_readv IIRC, CAP_SYS_RESOURCE is for rlimit, seccomp is so that vmsplice does not yield EPERM (I don't quite understand the reason for this one tbh), /proc/sys is unmasked for ns_next_pid. CAP_NET_ADMIN is not used due to a bug described in one of my review comments on this PR.

Hope this helps someone.

@ymanton
Copy link
Contributor Author

ymanton commented Oct 28, 2022

@imachug Thanks for testing, I'll get back to you shortly on your comments.

@ymanton
Copy link
Contributor Author

ymanton commented Nov 1, 2022

Perhaps this should better be asked in a separate issue, but I'm not sure if what I'm asking about is reasonable, so here goes.

One thing this PR has not handled at all is map_files. Problem is, proc(5) says that

Capabilities are required to read the contents of the symbolic links in this directory: before Linux 5.9, the reading process requires CAP_SYS_ADMIN in the initial user namespace; since Linux 5.9, the reading process must have either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE in the user namespace where it resides.

...but that does not match what the kernel does, e.g. on HEAD we have

https://github.com/torvalds/linux/blob/27bc50fc90647bbf7b734c3fc306a5e61350da53/fs/proc/base.c#L2249

...which checks whether CAP_CHECKPOINT_RESTORE is present in the init user namespace. This means that reading map_files always fails in containers.

You're right about the man-page, it appears to be incorrect. I've sent an email to the maintainers and list and CC'd you. Thanks for catching that. As for what to do about it or how to improve it, I'm not exactly sure. Wouldn't the kernel behaviour have to change or is there something you think CRIU can do better if we're not in the init user ns but need access to map_files contents?

@purplesyringa
Copy link
Contributor

Can we avoid using map_files in this case and read the memory of the process directly?

@purplesyringa
Copy link
Contributor

Ah, I guess I see what the problem is: we want to dump the whole file, and mmap may only map the file partially. Moreover, for deleted files, we might not even know if two mappings are of the same file or not. Given that mremap does not really work for shared memory, I guess the best that can be done is a) attempting to modify one mapping and checking how it affects others, to figure out which mappings are parts of the same file, and then b) dumping different parts of each file from memory of different processes, if no single process contains the whole file. This leaves the case when a part of a file is not mapped anywhere, but then we can just fill it with zeroes -- that should be indistinguishable from the "real" file if map_files is unavailable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants