Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't dump pages which only contain zero bytes #2331

Open
wants to merge 198 commits into
base: criu-dev
Choose a base branch
from

Conversation

simonis
Copy link

@simonis simonis commented Jan 15, 2024

I want to propose a new command line option zero-pages which, if enabled, detects pages which only contain zero bytes and skip their dumping to the image file. At restore, such pages will be replaced by the kernels zero page automatically.This can be useful for runtimes like Java, which often allocate large memory regions without fully using them (e.g. for the heap). For a simple Helloworld Java program, this new feature shrinks the image size be about 20% from 13mb to 11mb.

I've enabled GitHub Actions on my fork and all the test are green except the Alpine test which I see failing on upstream as well and the Code Linter test where I can't understand what he's objecting.

I'm a first time contributor to this project so please let me know if I've missed something.

prakritigoyal19 and others added 30 commits June 11, 2023 23:30
Change made through this commit:
- Include copy of flog as a seperate tree.
- Modify the makefile to add and compile flog code.

Signed-off-by: prakritigoyal19 <prakritigoyal19@gmail.com>
CID 302713 (checkpoint-restore#1 of 1): Missing varargs init or cleanup (VARARGS)
 va_end was not called for argptr.

Signed-off-by: Adrian Reber <areber@redhat.com>
Separate commit for easier criu-dev <-> master transfer.

Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
It is mapped, not maped. Same applies for mmap I guess.

Found by codespell, except it wants to change it to mapped,
which will make it less specific.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Brought to you by

    codespell -w

(using codespell v2.1.0).

[v2: use "make indent" on the result]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Fixes: checkpoint-restore#2121

Signed-off-by: Pengda Yang <daz-3ux@proton.me>
The TOS(type of service) field in the ip header allows you specify the
priority of the socket data.

Signed-off-by: Suraj Shirvankar <surajshirvankar@gmail.com>
Signed-off-by: Suraj Shirvankar <surajshirvankar@gmail.com>
The pipe_size type is unsigned int, when the fcntl call fails and
return -1, it will cause a negative rollover problem.

Signed-off-by: zhoujie <zhoujie133@huawei.com>
Newer Intel CPUs (Sapphire Rapids) have a much larger xsave area than
before. Looking at older CPUs I see 2440 bytes.

    # cpuid -1 -l 0xd -s 0
    ...
        bytes required by XSAVE/XRSTOR area     = 0x00000988 (2440)

On newer CPUs (Sapphire Rapids) it grows to 11008 bytes.

    # cpuid -1 -l 0xd -s 0
    ...
        bytes required by XSAVE/XRSTOR area     = 0x00002b00 (11008)

This increase the xsave area from one page to four pages.

Without this patch the fpu03 test fails, with this patch it works again.

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Using the fact that we know criu_pid and criu is a parent of restored
process we can create pidfile with pid on caller pidns level.

We need to move mount namespace creation to child so that criu-ns can
see caller pidns proc.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
By default, the file name 'amdgpu_plugin.txt' is used also as the name
for the corresponding man page (`man amdgpu_plugin`). However, when
this man page is installed system-wide it would be more appropriate
to have a prefix 'criu-' (e.g., `man criu-amdgpu-plugin`).

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
crun wants to set empty_ns and this interface is missing from the
library. This adds it to libcriu.

Signed-off-by: Adrian Reber <areber@redhat.com>
--criu-binary argument provides a way to supply the CRIU binary
location to run_criu().

Related to: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
These changes remove and update the changes introduced in
7177938 in favor of the
Python version in CI.

os.waitstatus_to_exitcode() function appeared in Python 3.9

Related to: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
These changes add test implementations for criu-ns script.

Fixes: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
These changes fix the `ImportError: No module named pathlib`
error when executing criu-ns tests located at criu/test/others/criu-ns

Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
CentOS 7 CI environment uses Python 2. To execute criu-ns
script in CentOS 7 changing the current shebang line to
python is required.

This reverse the changes made in a15a63f

Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
This is a patch proposed by Thomas here:
https://lore.kernel.org/all/87ilczc7d9.ffs@tglx/

It removes (created id > desired id) "sanity" check and adds proper
checking that ids start at zero and increment by one each time when we
create/delete a posix timer.

First purpose of it is to fix infinite looping in create_posix_timers on
old pre 3.11 kernels.

Second purpose is to allow kernel interface of creating posix timers
with desired id change from iterating with predictable next id to just
setting next id directly. And at the same time removing predictable next
id so that criu with this patch would not get to infinite loop in
create_posix_timers if this happens.

Thanks a lot to Thomas!

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This hook allows to start image streamer process from an action script.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
…tions

This does cgroup namespace creation separately from joining task
cgroups. This makes the code more logical, because creating cgroup
namespace also involves joining cgroups but these cgroups can be
different to task's cgroups as they are cgroup namespace roots
(cgns_prefix), and mixing all of them together may lead to
misunderstanding.

Another positive thing is that we consolidate !item->parent checks in
one place in restore_task_with_children.

Signed-off-by: Valeriy Vdovin <valeriy.vdovin@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
4.15-based kernels don't allow F_*SEAL for memfds created with MFD_HUGETLB.
Since seals are not possible in this case, fake F_GETSEALS result as if it
was queried for a non-sealing-enabled memfd.

Signed-off-by: Michał Mirosław <emmir@google.com>
Linux 4.15 doesn't like empty string for cgroup2 mount options.
Pass NULL then to satisfy the kernel check. Log the options for
easier debugging.

Signed-off-by: Michał Mirosław <emmir@google.com>
The original commit added saving THP_DISABLED flag value, but missed
restoring it. There is restoring code, but used only when --lazy_pages
mode is enabled. Restore the prctl flag always. While at it, rename the
`has_thp_enabled` -> `!thp_disabled` for consistency.

Fixes: bbbd597 (2017-06-28 "mem: add dump state of THP_DISABLED prctl")
Signed-off-by: Michał Mirosław <emmir@google.com>
If prctl(SET_THP_DISABLE) is not used due to bad semantics, log it
for easier debugging.

Signed-off-by: Michał Mirosław <emmir@google.com>
While at it, don't carry over stale errno to the fail() message.

Signed-off-by: Michał Mirosław <emmir@google.com>
Signed-off-by: Michał Mirosław <emmir@google.com>
Add a sanity check for THP_DISABLE. This discovered a broken commit
in Google's kernel tree.

Signed-off-by: Michał Mirosław <emmir@google.com>
@adrianreber
Copy link
Member

I think the code linter wants a \n at the end of the warning.

Without looking too closely, the idea of this PR sounds like a good one.

@simonis
Copy link
Author

simonis commented Jan 16, 2024

I think the code linter wants a \n at the end of the warning.

Thanks, I've fixed that now.

Without looking too closely, the idea of this PR sounds like a good one.

Thanks.

criu/crtools.c Outdated Show resolved Hide resolved
criu/crtools.c Outdated Show resolved Hide resolved
criu/mem.c Outdated Show resolved Hide resolved
criu/include/cr_options.h Outdated Show resolved Hide resolved
criu/config.c Outdated Show resolved Hide resolved
When iptables-nft is used as backend for iptables, the rules for
network locking are translated into the following nft rules:

```
$ iptables-restore-translate -f lock.txt
add table ip filter
add chain ip filter CRIU
insert rule ip filter INPUT counter jump CRIU
insert rule ip filter OUTPUT counter jump CRIU
add rule ip filter CRIU mark 0xc114 counter accept
add rule ip filter CRIU counter drop
```

These rules create the following chains:

```
table ip filter { # handle 1
	chain CRIU { # handle 1
		meta mark 0x0000c114 counter packets 16 bytes 890 accept # handle 6
		counter packets 1 bytes 60 drop # handle 7
		meta mark 0x0000c114 counter packets 0 bytes 0 accept # handle 8
		counter packets 0 bytes 0 drop # handle 9
	}

	chain INPUT { # handle 2
		type filter hook input priority filter; policy accept;
		counter packets 8 bytes 445 jump CRIU # handle 3
		counter packets 0 bytes 0 jump CRIU # handle 10
	}

	chain OUTPUT { # handle 4
		type filter hook output priority filter; policy accept;
		counter packets 9 bytes 505 jump CRIU # handle 5
		counter packets 0 bytes 0 jump CRIU # handle 11
	}
}
```

In order to delete the CRIU chain, we need to first delete all four
jump targets. Otherwise, `-X CRIU` would fail with the following error:

iptables-restore v1.8.10 (nf_tables):
line 5: CHAIN_DEL failed (Resource busy): chain CRIU

Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
nft does not support xtables compat expressions
https://git.netfilter.org/nftables/commit/?id=79195a8cc9e9d9cf2d17165bf07ac4cc9d55539f

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Show appropriate error messages when restore of nftables fails.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
criu/mem.c Outdated Show resolved Hide resolved
@rst0git
Copy link
Member

rst0git commented Jan 19, 2024

Code Linter test where I can't understand what he's objecting

@simonis Would it be possible to run make indent before committing your changes? This should fix the code style problems such as missing spaces. In addition, it would be great if you can add a commit message with a description that explains what changes are introduced and why. The following blog post describes how to write good commit messages: https://cbea.ms/git-commit/

if (should_dump_page(vma->e, at[pfn])) {
if (opts.skip_zero_pages) {
remote[0].iov_base = (void*)vaddr;
nread = process_vm_readv(item->pid->real, local, 1, remote, 1, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea to read process memory twice. We have to avoid this.

btw: #2292 solves the same problem in a more optimal way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be better if we can avoid reading process memory twice.

#2292 solves the same problem in a more optimal way.

I think this is different. In #2292, we exclude pages with zero PFN (PAGE_IS_PFNZERO), while this option skips zero-filled memory (e.g., memory that has been filled with zeros using memset() would be skipped with this option).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a performance or a correctness issue? If it's just about performance, I think the benefit might justify the additional overhead and after all the feature is on by default.

In the case this is a correctness problem, do you have a suggestion how we can avoid this?

PS: and yes, @rst0git is right - this change is about skipping regular pages which are filled with only zero bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be useful for runtimes like Java, which often allocate large memory regions without fully using them (e.g. for the heap). For a simple Helloworld Java program, this new feature shrinks the image size be about 20% from 13mb to 11mb.

we exclude pages with zero PFN (PAGE_IS_PFNZERO), while this option skips zero-filled memory

I believe we need a zdtm test, which can reproduce such a zero page without PAGE_IS_PFNZERO but with zero data. Maybe, we even want to fix kernel to report such a page as "PAGE_IS_PFNZERO" instead.

I agree with Andrei that manually checking page content with memcmp is an anti-pattern. Nack from me, at list in current state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree. Reading a page second time and using memcmp() sounds not optimal.

I still like the idea of not including empty pages in the checkpoint, but it sounds difficult.

If the kernel could track it, that would be nice. Not sure the kernel has a better alternative than memcmp() to find a zeroed memory page.

At this point I think it would be nice to see some numbers. How much faster is restoring if something like this PR is applied. Although I don't like the memcmp() it would only be used if the corresponding command-line option is explicitly selected by the user. Maybe that makes it acceptable. Can the second reading of the page be avoided?

Maybe some post-processing of the checkpoint image would be an alternative. Remove the empty pages after checkpointing and have support during restore to handle pages like this.

Copy link
Member

@avagin avagin Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a performance or a correctness issue? If it's just about performance, I think the benefit might justify the additional overhead and after all the feature is on by default.

I don't understand why we need to read process pages to do this check? Why can't we do that before dumping these pages into the image (page_xfer_dump_pages)?

Copy link
Member

@Snorch Snorch Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe memset(0) is used in specific scenarios where applications/libraries are dealing with sensitive data and/or use custom memory management.

It's not just the "malloc+memset(0)" use case. Java does mmap and pretouch memory so we can have a lot of pages with only zero content (but not the kernel zero page) in some scenarios.

First, I thought it does not work, upd: that was stupid of me not to enable --skip-zero-pages, with option it works fine, sorry.

Second, If Java put so much effort to have those zeroed pages in RSS isn't it a bad idea to restore those pages like they are "PAGE_IS_PFNZERO" ones? =)

[root@turmoil tmp]# ./malloc-test 
Enter any char to stop

------ In another terminal ------
[root@turmoil snorch]# grep 2097164 -A3 -B1 /proc/$(pidof malloc-test)/smaps
7fb87b744000-7fb8fb747000 rw-p 00000000 00:00 0 
Size:            2097164 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:             2097160 kB
[root@turmoil tmp]# /home/snorch/devel/ms/criu/criu/criu dump --skip-zero-pages -v4 -o dump.log -t $(pidof malloc-test) -j -D /images-dir/
[root@turmoil tmp]# /home/snorch/devel/ms/criu/criu/criu restore -j -D /images-dir/

------ In another terminal ------
[root@turmoil snorch]# grep 2097164 -A3 -B1 /proc/$(pidof malloc-test)/smaps
7fb87b744000-7fb8fb747000 rw-p 00000000 00:00 0 
Size:            2097164 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:             1048584 kB

upd2: If we want to preserve them in RSS, we can remember those special zero-filled pages in images on dump without saving their data and then on restore put them to RSS by writing zeroes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second, If Java put so much effort to have those zeroed pages in RSS isn't it a bad idea to restore those pages like they are "PAGE_IS_PFNZERO" ones? =)

@Snorch, Java (or OpenJDK based JVMs to be more specific) was not designed and optimized for cloud/container use cases but rather for large, monolithic application servers. In the old days, pretouching/zeroing memory was a way to pre-allocate physical memory and not potentially get it from swap later. For current cloud/container use cases the huge memory footprint can be problem. With check-pointing, a small image is more important and COWing a PAGE_IS_PFNZERO page is much faster then loading and populating it from disk.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to read process pages to do this check? Why can't we do that before dumping these pages into the image (page_xfer_dump_pages)?

Thanks a lot for your suggestion @avagin. I'm short of time for the next week because of FOSDEM/Jfokus but I'll try to come up with a new version which moves the zero check to page_xfer_dump_pages() afterwards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonis FOSDEM was my favorite conference when I lived close by. btw @mihalicyn are there too, he is one of criu maintainers. He will be happy to help with any questions.

PAGEMAP_SCAN is a new ioctl that allows to get page attributes in a more
effeciant way than reading pagemap files.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
This change adds a new injectable fault (135) to disable PAGEMAP_SCAN and fault
back to read pagemap files.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
@simonis
Copy link
Author

simonis commented Jan 22, 2024

Code Linter test where I can't understand what he's objecting

@simonis Would it be possible to run make indent before committing your changes? This should fix the code style problems such as missing spaces. In addition, it would be great if you can add a commit message with a description that explains what changes are introduced and why. The following blog post describes how to write good commit messages: https://cbea.ms/git-commit/

Done (had to install a newer version of clang-format).

@simonis simonis force-pushed the skip-zero-pages branch 2 times, most recently from 4bddca9 to 9d70e5f Compare January 24, 2024 15:59
@simonis
Copy link
Author

simonis commented Jan 24, 2024

I believe we need a zdtm test, which can reproduce such a zero page without PAGE_IS_PFNZERO but with zero data.

I've now added a new zdtm test which verifies that the --skip-zero-pages option does indeed optimize pages which only contain zero bytes.

Introduces a new command line option '--skip-zero-bytes'
which detects pages which only contain zero bytes and
prohibits that they get dumped in the processes image file.
It is a potentially expensive operation because it checks for
every single process page if it contains only zeros, but
it can significantly decrease the image size and improve the
startup-time if many such pages exist. It effectively
replaces such pages which the kernel's zero-page on restore.

Signed-off-by: Volker Simonis <volker.simonis@gmail.com>
@@ -2697,6 +2701,9 @@ def get_cli_args():
rp.add_argument("--noauto-dedup",
help="Manual deduplicate images on iterations",
action='store_true')
rp.add_argument("--skip-zero-pages",
Copy link
Member

@rst0git rst0git Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonis Would you be able to also enable testing with existing ZDTM tests using --skip-zero-pages in run-ci-tests.sh?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonis It was great meeting you at FOSDEM and your talk was very good!

We can use something like the following patch to run the existing ZDTM tests with --skip-zero-pages: rst0git@45a8ca1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avagin Thank you for the review! Would it be sufficient to run the following tests?

diff --git a/scripts/ci/run-ci-tests.sh b/scripts/ci/run-ci-tests.sh
index ef7e869e0..8f5e25d03 100755
--- a/scripts/ci/run-ci-tests.sh
+++ b/scripts/ci/run-ci-tests.sh
@@ -268,6 +268,9 @@ make -C test/others/rpc/ run
 ./test/zdtm.py run -t zdtm/transition/maps007 --pre 2 --page-server --dedup
 ./test/zdtm.py run -t zdtm/transition/maps007 --pre 2 --pre-dump-mode read
 
+# Run tests with --skip-zero-pages
+./test/zdtm.py run --skip-zero-pages -T '.*maps0.*'
+
 ./test/zdtm.py run -t zdtm/transition/pid_reuse --pre 2 # start time based pid reuse detection
 ./test/zdtm.py run -t zdtm/transition/pidfd_store_sk --rpc --pre 2 # pidfd based pid reuse detection

@rppt
Copy link
Member

rppt commented Feb 3, 2024 via email

Copy link

github-actions bot commented Mar 8, 2024

A friendly reminder that this PR had no activity for 30 days.

@avagin avagin added no-auto-close Don't auto-close as a stale issue and removed stale-pr labels Mar 8, 2024
@simonis
Copy link
Author

simonis commented Mar 8, 2024

Still on my ToDo list, so post this to avoid auto-closing..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.