-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC 2021] Introduce io_uring dump/restore support #1597
base: criu-dev
Are you sure you want to change the base?
Conversation
Hi, Kumar! Please take a look on tests failures.
Regards, |
I started the CI tests which were not running. |
7ca36e7
to
cab53ef
Compare
This should fix everything except the 'header not found' problem for <linux/io_uring.h>. |
@@ -704,6 +709,8 @@ int prepare_mm_pid(struct pstree_item *i) | |||
ret = collect_filemap(vma); | |||
else if (vma_area_is(vma, VMA_AREA_SOCKET)) | |||
ret = collect_socket_map(vma); | |||
else if (vma_area_is(vma, VMA_AREA_IO_URING)) | |||
ret = collect_io_uring_map(vma); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make sure that CRIU continues to work with older kernel versions that don't support io_uring. Adding a kerndat check for that might help. For example, consider the case when a process tree (container) has been migrated to a system with older kernel version where io_uring is not supported.
criu/proc_parse.c
Outdated
@@ -792,6 +817,7 @@ int parse_smaps(pid_t pid, struct vm_area_list *vma_area_list, dump_filemap_t du | |||
if (handle_vma(pid, vma_area, str + path_off, map_files_dir, &vfi, &prev_vfi, &vm_file_fd)) | |||
goto err; | |||
|
|||
//pr_info("dump_filemap=%p IO_URING=%d?\n", dump_filemap, (int) vma_area_is(vma_area, VMA_AREA_IO_URING)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be pr_debug
?
You can add |
IoUringFileEntry *iofe = io_uring_get_iofe(ctx); | ||
unsigned int nr; | ||
pid_t pid; | ||
int r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str
, r
, and f
are not very descriptive. Maybe we could use better variable names?
@rst0git Thanks for the review. I'll update this with your suggestions applied once I get the kernel side patches in order, which shouldn't take too long. |
A prerequisite series for this and other stuff planned for later got in, now the iterators are left. After that (leaving out some specific things on CRIU side, like buffer support), this can move forward. |
The iterators are in the criu-iter branch, so if someone is feeling bored, I'd appreciate review from as many people :). Note that this is feature complete , i.e. all tests part of the branch pass on BPF CI (when run locally using vmtest.sh in Note:
|
Change made through this commit: - Include copy of flog as a seperate tree. - Modify the makefile to add and compile flog code. Signed-off-by: prakritigoyal19 <prakritigoyal19@gmail.com>
CID 302713 (checkpoint-restore#1 of 1): Missing varargs init or cleanup (VARARGS) va_end was not called for argptr. Signed-off-by: Adrian Reber <areber@redhat.com>
Separate commit for easier criu-dev <-> master transfer. Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Adrian Reber <areber@redhat.com>
Support for external net namespaces has been introduced with commit c2b21fb (criu: add support for external net namespaces). Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Currently CRIU cannot handle Checkpoint Restore operations when a device file is involved in a process, however, CRIU allows flexible extensions via special plugins but still, for certain complex devices such as a GPU, the existing hooks are not sufficient. This introduces few new hooks that will be used to support Checkpoint Restore operation with AMD GPU devices and potentially to other similar devices too. - HANDLE_DEVICE_VMA - UPDATE_VMA_MAP - RESUME_DEVICES_LATE *HANDLE_DEVICE_VMA: Hook to detect a suitable plugin to handle device file VMA with PF | IO mappings. *UPDATE_VMA_MAP: Hook to handle VMAs during a device file restore. When restoring VMAs for the device files, criu runs sys_mmap in the pie restore context but the offsets and file path within a device file may change during restore operation so it needs to be adjusted properly. *RESUME_DEVICES_LATE: Hook to do some special handling in late restore phase. During criu restore phase when a device is getting restored with the help of a plugin, some device specific operations might need to be delayed until criu finalizes the VMA placements in address space of the target process. But by the time criu finalizes this, its too late since pie phase is over and control is back to criu master process. This hook allows an external trigger to each resuming task to check whether it has a device specific operation pending such as issuing an ioctl call? Since this is called from criu master process context, supply the pid of the target process and give a chance to each plugin registered to run device specific operation if the target pid is valid. A future patch will add consumers for these plugin hooks to support AMD GPUs. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
This is just a placeholder dummy plugin and will be replaced by a proper plugin that implements support for AMD GPU devices. This just facilitates the initial pull request and CI build test trigger for early code review of CRIU specific changes. Future PRs will bring in more support for amdgpu_plugin to enable CRIU with AMD ROCm. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Restore operation fails when we perform CR operation of multiple independent proceses that have device files because criu caches the ids for the device files with same mnt_ids, inode pair. This change ensures that even in case of a cached id found for a device, a unique subid is generated and returned which is used for dumping. Suggested-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Since commit e42f5e0 ("tcp: allow to specify --tcp-close on dump"), --tcp-close option can be used when checkpointing. This option skips checkpointing established socket's state (including once established but now closed socket). However, when restoring, we still try to restore closed socket's state. As a result, a non-existent protobuf image is opened. This commit skips TCP_CLOSE socket when restoring established TCP connection and removes the redundant check for TCP_LISTEN socket as TCP_LISTEN socket cannot reach this function. Suggested-by: Andrei Vagin <avagin@gmail.com> Suggested-by: Radostin Stoyanov <radostin@redhat.com> Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
The expected behavior of --tcp-close option when dumpping is to close all established tcp connections including connection that is once established but now closed. This adds an explicit description about that behavior. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Resolve the following python3 portability issues: 1) Python 3 needs explicit relative import path. 2) Coredumps are binary data, not unicode strings. Use byte strings (b"" instead of "") and open files in binary format. 3) Some functions (for example: filter) return a list in python 2, but an iterator in python 3. Port code to a common subset of python 2 and python 3 using itertool. 4) Division operator / changed meaning in Python 3. Use explicit integer division (//) where appropriate. Signed-off-by: Andrey Vyazovtsev <viazovtsev.av@phystech.edu>
Previous commit added support for python3 in criu-coredump. For convenience, add two files (coredump-python2 and coredump-python3) that start criu-coredump with respective python version. Edit env.sh accordingly. Signed-off-by: Andrey Vyazovtsev <viazovtsev.av@phystech.edu>
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
PEP8 recommends for comparisons to singletons like None to always be done with 'is' or 'is not', never the equality operators. https://python.org/dev/peps/pep-0008/#programming-recommendations Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
We face that btrfs returns anonymous device in stat instead of real superblock dev for volumes, thus all btrfs volume mounts does not pass check_mountpoint_fd due to dev missmatch between stat and mountinfo. We can use special helper get_sdev_from_fd instead of stat to try to get real dev of fd for btrfs. We move check_mountpoint_fd from open_mountpoint into get_clean_fd and ns_open_mountpoint to the point where temporary mount we open fd to is still in mountinfo, thus get_sdev_from_fd would be able to find tmp mount in mountinfo. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
All the bugs that were in the way got fixed. We can enable all tests. Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CI badges and logo are already present in the readme file. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Adrian Reber <areber@redhat.com>
A couple of months (or years) ago I looked into lgtm.com for CRIU. Today on a pull request I saw result from lgtm.com for the first time and it failed. Not sure what triggered the lgtm.com message into the CRIU repository, but with the .lgtm.yml file in this commit lgtm.com can actually build CRIU. Signed-off-by: Adrian Reber <areber@redhat.com>
This commit adds feature check support to libcriu. It already exists in the CLI and RPC and this just extends it to libcriu. This commit provides one function to do all possible feature checks in one call. The parameter to the feature check function is a structure and the user can enable which features should be checked. Using a structure makes the function extensible without the need to break the API/ABI in the future. Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
When requested iovs are huge, criu needs to invoke more then one preadv()s. In this situation criu truncates memory image with offset of first preadv() and length of last one, which leads to leakage of memory image. This patch fixs truncating with right offset and length. Signed-off-by: Liu Hua <weldonliu@tencent.com>
This case sometimes will cause SIGILL signal in arm64 platform. <<ARM Coretex-A series Programmer's Guide for ARMv8-A>> notes: The ARM architecture does not require the hardware to ensure coherency between instruction caches and memory, even for locations of shared memory. Therefore, we need flush dcache and icache for self-modifying code. - https://developer.arm.com/documentation/den0024/a/Caches/Point-of-coherency-and-unification Signed-off-by: fu.lin <fulin10@huawei.com>
This is a confusing change as it seems the original code was just wrong. GCC 12 complains with: In function ‘__conv_val’, inlined from ‘std_strtoul’ at compel/plugins/std/string.c:202:7: compel/plugins/std/string.c:154:24: error: array subscript 97 is above array bounds of ‘const char[37]’ [-Werror=array-bounds] 154 | return &conv_tab[__tolower(c)] - conv_tab; | ^~~~~~~~~~~~~~~~~~~~~~~ compel/plugins/std/string.c: In function ‘std_strtoul’: compel/plugins/std/string.c:10:19: note: while referencing ‘conv_tab’ 10 | static const char conv_tab[] = "0123456789abcdefghijklmnopqrstuvwxyz"; | ^~~~~~~~ cc1: all warnings being treated as errors Which sounds correct. The array conv_tab has just 37 elements. If I understand the code correctly we are trying to convert anything that is character between a-z and A-Z to a number for cases where the base is larger than 10. For a base 11 conversion b|B should return 11. For a base 35 conversion z|Z should return 35. This is all for a strtoul() implementation. The original code was: static const char conv_tab[] = "0123456789abcdefghijklmnopqrstuvwxyz"; return &conv_tab[__tolower(c)] - conv_tab; and that seems wrong. If conv_tab would have been some kind of hash it could have worked, but '__tolower()' will always return something larger than 97 ('a') which will always overflow the array. But maybe I just don't get that part of the code. I replaced it with return __tolower(c) - 'a' + 10; which does the right thing: 'A' = 10, 'B' = 11 ... 'Z' = 35 Signed-off-by: Adrian Reber <areber@redhat.com>
This fixes: criu/config.c: In function ‘parse_statement’: criu/config.c:232:43: error: the comparison will always evaluate as ‘true’ for the pointer operand in ‘*(configuration + (sizetype)((long unsigned int)i * 8)) + ((sizetype)offset + 1)’ must not be NULL [-Werror=address] 232 | if (configuration[i] + offset + 1 != 0 && strchr(configuration[i] + offset, ' ')) { | ^~ Signed-off-by: Adrian Reber <areber@redhat.com>
Parasite creation started to fail with GCC 12: On x86_64 with: ./compel/compel-host hgen -f criu/pie/restorer.built-in.o -o criu/pie/restorer-blob.h Error (compel/src/lib/handle-elf-host.c:337): Unexpected undefined symbol: `strlen'. External symbol in PIE? On aarch64 with: ld: criu/pie/restorer.o: in function `lsm_set_label': /drone/src/criu/pie/restorer.c:174: undefined reference to `strlen' Line 174 is: "for (len = 0; label[len]; len++)" Adding '-ffreestanding' to parasite compilation fixes these errors because, according to GCC developers: "strlen is a standard C function, so I don't see any bug in that being used unless you do a freestanding compilation (-nostdlib isn't that)." Signed-off-by: Adrian Reber <areber@redhat.com>
A friendly reminder that this PR had no activity for 30 days. |
Running cross compile tests with Debian unstable sometimes fails due to missing or outdated packages. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
autofs.c:66:17: error: pointer 'str' may be used after 'realloc' [-Werror=use-after-free] autofs.c: In function 'check_automount': ../lib/zdtmtst.h:131:9: error: pointer 'mountpoint' may be used after 'free' [-Werror=use-after-free] 131 | test_msg("ERR: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, strerror(errno)) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ autofs.c:277:17: note: in expansion of macro 'pr_perror' 277 | pr_perror("%s: failed to close fd %d", mountpoint, p->fd); | ^~~~~~~~~ autofs.c:268:9: note: call to 'free' here 268 | free(mountpoint); | ^~~~~~~~~~~~~~~~ Fixes: checkpoint-restore#1731 v2: (@Snorch) always update `str` after successful realloc() Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Ashutosh Mehra <asmehra@redhat.com>
This commit introduces basic support in CRIU to make use of eBPF kernel features to aid in the checkpoint/restore process. The immediate usecase is to provide an API to find fd to file pointer mapping, and vice-versa, for quick lookup from one file set (e.g. task, epoll, io_uring) to another. This is done by making use of eBPF iterator. This makes use of task, epoll, and io_uring file iterator features to be introduced in upcoming linux kernel versions. No dependency on clang's BPF toolchain or libbpf is taken, as we don't need those features just yet. It might be inevitable as we make more use of BPF, but for now we can tolerate just writing raw BPF assembly. To this end, also import bpf_insn.h from kernel's samples/bpf directory. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
As the io_uring interface is getting mature, is there any plan to resume the effort in supporting io_uring in CRIU? We have a data-intensive application that is going to significantly benefit from it. |
f392ea1
to
4d137b8
Compare
Currently, missing features are:
All of these depend on the eBPF iterator patches that are in https://github.com/kkdwivedi/linux/tree/criu-iter.
The following blockers exist currently:
For gathering data about io_uring itself:
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>