-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow CRIU to be used as non-root (Take 2) #1930
Conversation
@ymanton Thanks for picking up #1155 I had following proposal in a chat how to deal with kdat as non-root. Currently the nice thing about the kdat file is that will disappear after a reboot because it is on a tmpfs. Unfortunately as non-root there is no easy way to allow all users to write to |
|
Thanks for working on this. I want to ask one more thing. We are trying to support the commit history in the main tree as clean as possible. The main idea is very similar to the Linux kernel process. In this PR, I see that a few patches fix problems introduced by other patches. I think we need to merge all fixes in proper changes. I understand that the origin patches are authored by @adrianreber. I think we can use tags like Co-authored-by, Co-developed-by, Originally-by to mention all developers involved in the process. |
No problem, thanks for letting me know. I've merged all my changes together with @adrianreber's and added Co-authored-bys to the patches that were significantly altered. Going forward I'll maintain them this way as things progress. |
f38498f
to
d6cd284
Compare
A friendly reminder that this PR had no activity for 30 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for picking it up and thanks for solving the open questions.
The idea behind the rootless CRIU code is, that CRIU reads out its effective capabilities and stores that in the global opts structure. Different parts of CRIU can then, based on the existing capabilities, automatically enable or disable certain code paths. Currently at least CAP_CHECKPOINT_RESTORE is required. CRIU will not start without this capability. Signed-off-by: Adrian Reber <areber@redhat.com>
This adds the function check_caps() which checks if CRIU is running with at least CAP_CHECKPOINT_RESTORE. That is the minimum capability CRIU needs to do a minimal checkpoint and restore from it. In addition helper functions are added to easily query for other capability for enhanced checkpoint/restore support. Co-authored-by: Younes Manton <ymanton@ca.ibm.com> Signed-off-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This commit enables checkpointing and restoring of applications as non-root. First goal was to enable checkpoint and restore of the env00 and pthread00 test case. This uses the information from opts.unprivileged and opts.cap_eff to skip certain code paths which do not work as non-root. Co-authored-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This patch modifies how kerndat is handled in unprivileged mode. Initialization and functionality that can only be done as root is made separate from common code. The kerndat file's location is defined as $XDG_RUNTIME_DIR/criu.kdat in unprivileged mode. Since we expect that directory to be on tmpfs we maintain the same behavior as the root-mode kerndat which lives in /run. Co-authored-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
This adds the non-root section and information about the parameter --unprivileged to the man page. Co-authored-by: Anna Singleton <annabeths111@gmail.com> Signed-off-by: Adrian Reber <areber@redhat.com> Signed-off-by: Anna Singleton <annabeths111@gmail.com>
01c43ab
to
bad5a60
Compare
Why CI checks have not been executed?.. |
There was an GitHub Actions outage yesterday. @ymanton please push once more to trigger CI. |
These are the minimal changes to make zdtm.py successfully run the env00 and pthread test case as non-root using the '--rootless' zdtm option. Co-authored-by: Younes Manton <ymanton@ca.ibm.com> Signed-off-by: Adrian Reber <areber@redhat.com> Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Run env00 and pthread00 test as non-root as initial proof of concept. Signed-off-by: Adrian Reber <areber@redhat.com>
The cross-compile test results fail sometimes because of broken repositories like in this case and can be ignored. The CentOS 7 errors are because of a package installation problem. @rst0git I think the epel package installation fix is from you. Can you maybe add a The Vagrant errors seem to be something new for shared memory already tracked somewhere else: #1982 The non-root Vagrant test setup is running successfully. I think from the CI side this is ready. |
@adrianreber Thanks for double checking, I was unsure about |
@ymanton thank you for moving this pr to the finish line. @adrianreber thank you for implementing the kernel part and the initial version of the userspace changes. |
Perhaps this should better be asked in a separate issue, but I'm not sure if what I'm asking about is reasonable, so here goes. One thing this PR has not handled at all is
...but that does not match what the kernel does, e.g. on HEAD we have ...which checks whether CAP_CHECKPOINT_RESTORE is present in the init user namespace. This means that reading I am not sure if deleted files can be gracefully handled in the general case, but the comment suggests so. Would this be an incentive to implement that sooner than later? UPD: actually, the current logic seems fishy as is:
because the implementation of |
You certainly know what you are doing more than me, but perhaps it might be of interest to you to know what I needed to patch to get CRIU of httpd in podman working: Patchdiff --git a/criu/proc_parse.c b/criu/proc_parse.c
index 946b0fc40..a24783da7 100644
--- a/criu/proc_parse.c
+++ b/criu/proc_parse.c
@@ -313,25 +313,8 @@ static int vma_get_mapfile_user(const char *fname, struct vma_area *vma, struct
vfi_dev = makedev(vfi->dev_maj, vfi->dev_min);
- if (is_memfd(vfi_dev)) {
- char tmp[PATH_MAX];
- strlcpy(tmp, fname, PATH_MAX);
- strip_deleted(tmp, strlen(tmp));
-
- /*
- * The error EPERM will be shown in the following pr_perror().
- * It comes from the previous open() call.
- */
- pr_perror("Can't open mapped [%s]", tmp);
-
- /*
- * TODO Perhaps we could do better than failing and dump the
- * memory like what is being done in shmem.c
- */
- return -1;
- }
-
if (is_hugetlb_dev(vfi_dev, &hugetlb_flag) || is_anon_shmem_map(vfi_dev)) {
+ vma->e->status |= VMA_AREA_REGULAR;
if (!(vma->e->flags & MAP_SHARED))
vma->e->status |= VMA_ANON_PRIVATE;
else
diff --git a/criu/shmem.c b/criu/shmem.c
index 81e701586..2de2ea9af 100644
--- a/criu/shmem.c
+++ b/criu/shmem.c
@@ -724,7 +724,7 @@ static int next_data_segment(int fd, unsigned long pfn, unsigned long *next_data
return 0;
}
-static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si)
+static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si, bool seek_data_supported)
{
struct page_pipe *pp;
struct page_xfer xfer;
@@ -750,7 +750,7 @@ static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si)
unsigned long pgaddr;
int st = -1;
- if (pfn >= next_hole_pfn && next_data_segment(fd, pfn, &next_data_pnf, &next_hole_pfn))
+ if (seek_data_supported && pfn >= next_hole_pfn && next_data_segment(fd, pfn, &next_data_pnf, &next_hole_pfn))
goto err_xfer;
if (si->pstate_map && is_shmem_tracking_en()) {
@@ -808,20 +808,50 @@ static int dump_one_shmem(struct shmem_info *si)
{
int fd, ret = -1;
void *addr;
+ bool seek_data_supported;
pr_info("Dumping shared memory %ld\n", si->shmid);
- fd = open_proc(si->pid, "map_files/%lx-%lx", si->start, si->end);
- if (fd < 0)
- goto err;
- addr = mmap(NULL, si->size, PROT_READ, MAP_SHARED, fd, 0);
- if (addr == MAP_FAILED) {
- pr_err("Can't map shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
- goto errc;
+ fd = __open_proc(si->pid, EPERM, O_RDONLY, "map_files/%lx-%lx", si->start, si->end);
+ if (fd >= 0) {
+ addr = mmap(NULL, si->size, PROT_READ, MAP_SHARED, fd, 0);
+ if (addr == MAP_FAILED) {
+ pr_err("Can't map shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
+ goto errc;
+ }
+
+ seek_data_supported = true;
+ } else {
+ if(errno != EPERM) {
+ goto err;
+ }
+
+ fd = open_proc(si->pid, "mem");
+ if(fd < 0) {
+ goto err;
+ }
+
+ addr = mmap(NULL, si->size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (addr == MAP_FAILED) {
+ pr_err("Can't map empty space for shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
+ goto errc;
+ }
+
+ if(lseek(fd, si->start, SEEK_SET) < 0) {
+ pr_perror("Can't seek virtual memory");
+ return -1;
+ }
+
+ if(read(fd, addr, si->size) < si->size) {
+ pr_perror("Can't read virtual memory");
+ return -1;
+ }
+
+ seek_data_supported = false;
}
- ret = do_dump_one_shmem(fd, addr, si);
+ ret = do_dump_one_shmem(fd, addr, si, seek_data_supported);
munmap(addr, si->size);
errc:
@@ -849,7 +879,7 @@ int dump_one_memfd_shmem(int fd, unsigned long shmid, unsigned long size)
goto err;
}
- ret = do_dump_one_shmem(fd, addr, &si);
+ ret = do_dump_one_shmem(fd, addr, &si, true);
munmap(addr, size);
err:
@@ -875,7 +905,7 @@ int dump_one_sysv_shmem(void *addr, unsigned long size, unsigned long shmid)
if (fd < 0)
return -1;
- ret = do_dump_one_shmem(fd, addr, si);
+ ret = do_dump_one_shmem(fd, addr, si, true);
close(fd);
return ret;
}
diff --git a/criu/sockets.c b/criu/sockets.c
index db772707b..d50866c65 100644
--- a/criu/sockets.c
+++ b/criu/sockets.c
@@ -12,6 +12,7 @@
#include "int.h"
#include "bitops.h"
+#include "cr_options.h"
#include "libnetlink.h"
#include "sockets.h"
#include "unix_diag.h"
@@ -457,6 +458,27 @@ int sk_collect_one(unsigned ino, int family, struct socket_desc *d, struct ns_id
int do_restore_opt(int sk, int level, int name, void *val, int len)
{
+ void *buf;
+ socklen_t cur_len;
+ bool match;
+
+ if(opts.unprivileged) {
+ buf = xmalloc(len);
+ if (!buf) {
+ return -1;
+ }
+ cur_len = len;
+ if (getsockopt(sk, level, name, buf, &cur_len) < 0) {
+ pr_perror("Can't get %d:%d (len %d)", level, name, len);
+ return -1;
+ }
+ match = cur_len == len && memcmp(buf, val, len) == 0;
+ xfree(buf);
+ if(match) {
+ return 0;
+ }
+ }
+
if (setsockopt(sk, level, name, val, len) < 0) {
pr_perror("Can't set %d:%d (len %d)", level, name, len);
return -1;
@@ -469,9 +491,20 @@ static int sk_setbufs(void *arg, int fd, pid_t pid)
{
u32 *buf = (u32 *)arg;
- if (restore_opt(fd, SOL_SOCKET, SO_SNDBUFFORCE, &buf[0]))
+ int snd_opt_name;
+ int rcv_opt_name;
+
+ if (!opts.unprivileged) {
+ snd_opt_name = SO_SNDBUFFORCE;
+ rcv_opt_name = SO_RCVBUFFORCE;
+ } else {
+ snd_opt_name = SO_SNDBUF;
+ rcv_opt_name = SO_RCVBUF;
+ }
+
+ if (restore_opt(fd, SOL_SOCKET, snd_opt_name, &buf[0]))
return -1;
- if (restore_opt(fd, SOL_SOCKET, SO_RCVBUFFORCE, &buf[1]))
+ if (restore_opt(fd, SOL_SOCKET, rcv_opt_name, &buf[1]))
return -1;
return 0; The omission of I tested this inside
CAP_CHECKPOINT_RESTORE is needed for obvious reasons, CAP_SYS_PTRACE is for Hope this helps someone. |
@imachug Thanks for testing, I'll get back to you shortly on your comments. |
You're right about the man-page, it appears to be incorrect. I've sent an email to the maintainers and list and CC'd you. Thanks for catching that. As for what to do about it or how to improve it, I'm not exactly sure. Wouldn't the kernel behaviour have to change or is there something you think CRIU can do better if we're not in the init user ns but need access to |
Can we avoid using |
Ah, I guess I see what the problem is: we want to dump the whole file, and |
We at the OpenJ9 and Open Liberty projects have been experimenting with CRIU to improve JVM and Java web application start-up times, primarily in container deployments.1 To that end we've been testing #1155 for a while now and having success and we want to help get it across the finish line.
With @adrianreber's agreement I've rebased the patches in #1155 and added 4 additional patches as follows:
Patches to add the
--unprivileged
option tolibcriu
and the RPC interface, and a fix to do the requiredcheck_caps()
on the service worker path, equivalent to what's done on the tool path:Patches to move some new code (that I believe requires root) recently landed in criu-dev to the root-only init path introduced by in #1155:
As suggested by @adrianreber I'm currently working on where to store
criu.kdat
, if/how to useXDG_RUNTIME_PATH
for that purpose, and what to do when it's not set (e.g. when running viasudo
).I'm also looking over #1155 to see if there are any other unresolved questions.
In the meantime I wanted to open this PR now in case anyone has any fresh thoughts on the subject, comments on how to proceed, and so on.
Footnotes
This use-case is a bit different from using CRIU to facilitate process migration and has some unique challenges, including the fact that we're using CRIU "under the covers" in a scenario that to end-users shouldn't appear too different from simply starting a process from scratch. Allowing non-root users to restore processes, particularly inside unprivileged containers, makes this use-case much more accessible to end-users because it reduces the privileges required to "start" a process in this way, relative to starting the process from scratch. ↩