Skip to content

Commit

Permalink
Merge branch 'lru-le9-5.18' into linux-cachyos-5.18
Browse files Browse the repository at this point in the history
  • Loading branch information
ptr1337 committed May 8, 2022
2 parents 471da00 + aa22648 commit 527ea7d
Show file tree
Hide file tree
Showing 58 changed files with 5,354 additions and 491 deletions.
1 change: 1 addition & 0 deletions Documentation/admin-guide/mm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ the Linux memory management.
idle_page_tracking
ksm
memory-hotplug
multigen_lru
nommu-mmap
numa_memory_policy
numaperf
Expand Down
152 changes: 152 additions & 0 deletions Documentation/admin-guide/mm/multigen_lru.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
.. SPDX-License-Identifier: GPL-2.0
=============
Multi-Gen LRU
=============
The multi-gen LRU is an alternative LRU implementation that optimizes
page reclaim and improves performance under memory pressure. Page
reclaim decides the kernel's caching policy and ability to overcommit
memory. It directly impacts the kswapd CPU usage and RAM efficiency.

Quick start
===========
Build the kernel with the following configurations.

* ``CONFIG_LRU_GEN=y``
* ``CONFIG_LRU_GEN_ENABLED=y``

All set!

Runtime options
===============
``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
following subsections.

Kill switch
-----------
``enable`` accepts different values to enable or disable the following
components. Its default value depends on ``CONFIG_LRU_GEN_ENABLED``.
All the components should be enabled unless some of them have
unforeseen side effects. Writing to ``enable`` has no effect when a
component is not supported by the hardware, and valid values will be
accepted even when the main switch is off.

====== ===============================================================
Values Components
====== ===============================================================
0x0001 The main switch for the multi-gen LRU.
0x0002 Clearing the accessed bit in leaf page table entries in large
batches, when MMU sets it (e.g., on x86). This behavior can
theoretically worsen lock contention (mmap_lock). If it is
disabled, the multi-gen LRU will suffer a minor performance
degradation.
0x0004 Clearing the accessed bit in non-leaf page table entries as
well, when MMU sets it (e.g., on x86). This behavior was not
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
[yYnN] Apply to all the components above.
====== ===============================================================

E.g.,
::

echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005

Thrashing prevention
--------------------
Personal computers are more sensitive to thrashing because it can
cause janks (lags when rendering UI) and negatively impact user
experience. The multi-gen LRU offers thrashing prevention to the
majority of laptop and desktop users who do not have ``oomd``.

Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
``N`` milliseconds from getting evicted. The OOM killer is triggered
if this working set cannot be kept in memory. In other words, this
option works as an adjustable pressure relief valve, and when open, it
terminates applications that are hopefully not being used.

Based on the average human detectable lag (~100ms), ``N=1000`` usually
eliminates intolerable janks due to thrashing. Larger values like
``N=3000`` make janks less noticeable at the risk of premature OOM
kills.

The default value ``0`` means disabled.

Experimental features
=====================
``/sys/kernel/debug/lru_gen`` accepts commands described in the
following subsections. Multiple command lines are supported, so does
concatenation with delimiters ``,`` and ``;``.

``/sys/kernel/debug/lru_gen_full`` provides additional stats for
debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
evicted generations in this file.

Working set estimation
----------------------
Working set estimation measures how much memory an application
requires in a given time interval, and it is usually done with little
impact on the performance of the application. E.g., data centers want
to optimize job scheduling (bin packing) to improve memory
utilizations. When a new job comes in, the job scheduler needs to find
out whether each server it manages can allocate a certain amount of
memory for this new job before it can pick a candidate. To do so, this
job scheduler needs to estimate the working sets of the existing jobs.

When it is read, ``lru_gen`` returns a histogram of numbers of pages
accessed over different time intervals for each memcg and node.
``MAX_NR_GENS`` decides the number of bins for each histogram.
::

memcg memcg_id memcg_path
node node_id
min_gen_nr age_in_ms nr_anon_pages nr_file_pages
...
max_gen_nr age_in_ms nr_anon_pages nr_file_pages

Each generation contains an estimated number of pages that have been
accessed within ``age_in_ms`` non-cumulatively. E.g., ``min_gen_nr``
contains the coldest pages and ``max_gen_nr`` contains the hottest
pages, since ``age_in_ms`` of the former is the largest and that of
the latter is the smallest.

Users can write ``+ memcg_id node_id max_gen_nr
[can_swap[full_scan]]`` to ``lru_gen`` to create a new generation
``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it
is set to ``1``, it forces the scan of anon pages when swap is off.
``full_scan`` defaults to ``1`` and, if it is set to ``0``, it reduces
the overhead as well as the coverage when scanning page tables.

A typical use case is that a job scheduler writes to ``lru_gen`` at a
certain time interval to create new generations, and it ranks the
servers it manages based on the sizes of their cold memory defined by
this time interval.

Proactive reclaim
-----------------
Proactive reclaim induces memory reclaim when there is no memory
pressure and usually targets cold memory only. E.g., when a new job
comes in, the job scheduler wants to proactively reclaim memory on the
server it has selected to improve the chance of successfully landing
this new job.

Users can write ``- memcg_id node_id min_gen_nr [swappiness
[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
aged and therefore cannot be evicted. ``swappiness`` overrides the
default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
the number of pages to evict.

A typical use case is that a job scheduler writes to ``lru_gen``
before it tries to land a new job on a server, and if it fails to
materialize the cold memory without impacting the existing jobs on
this server, it retries on the next server according to the ranking
result obtained from the working set estimation step described
earlier.
66 changes: 66 additions & 0 deletions Documentation/admin-guide/sysctl/vm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ files can be found in mm/swap.c.
Currently, these files are in /proc/sys/vm:

- admin_reserve_kbytes
- anon_min_kbytes
- clean_low_kbytes
- clean_min_kbytes
- compact_memory
- compaction_proactiveness
- compact_unevictable_allowed
Expand Down Expand Up @@ -105,6 +108,61 @@ On x86_64 this is about 128MB.
Changing this takes effect whenever an application requests memory.


anon_min_kbytes
===============

This knob provides *hard* protection of anonymous pages. The anonymous pages
on the current node won't be reclaimed under any conditions when their amount
is below vm.anon_min_kbytes.

This knob may be used to prevent excessive swap thrashing when anonymous
memory is low (for example, when memory is going to be overfilled by
compressed data of zram module).

Setting this value too high (close to MemTotal) can result in inability to
swap and can lead to early OOM under memory pressure.

The default value is defined by CONFIG_ANON_MIN_KBYTES.


clean_low_kbytes
================

This knob provides *best-effort* protection of clean file pages. The file pages
on the current node won't be reclaimed under memory pressure when the amount of
clean file pages is below vm.clean_low_kbytes *unless* we threaten to OOM.

Protection of clean file pages using this knob may be used when swapping is
still possible to
- prevent disk I/O thrashing under memory pressure;
- improve performance in disk cache-bound tasks under memory pressure.

Setting it to a high value may result in a early eviction of anonymous pages
into the swap space by attempting to hold the protected amount of clean file
pages in memory.

The default value is defined by CONFIG_CLEAN_LOW_KBYTES.


clean_min_kbytes
================

This knob provides *hard* protection of clean file pages. The file pages on the
current node won't be reclaimed under memory pressure when the amount of clean
file pages is below vm.clean_min_kbytes.

Hard protection of clean file pages using this knob may be used to
- prevent disk I/O thrashing under memory pressure even with no free swap space;
- improve performance in disk cache-bound tasks under memory pressure;
- avoid high latency and prevent livelock in near-OOM conditions.

Setting it to a high value may result in a early out-of-memory condition due to
the inability to reclaim the protected amount of clean file pages when other
types of pages cannot be reclaimed.

The default value is defined by CONFIG_CLEAN_MIN_KBYTES.


compact_memory
==============

Expand Down Expand Up @@ -864,6 +922,14 @@ be 133 (x + 2x = 200, 2x = 133.33).
At 0, the kernel will not initiate swap until the amount of free and
file-backed pages is less than the high watermark in a zone.

This knob has no effect if the amount of clean file pages on the current
node is below vm.clean_low_kbytes or vm.clean_min_kbytes. In this case,
only anonymous pages can be reclaimed.

If the number of anonymous pages on the current node is below
vm.anon_min_kbytes, then only file pages can be reclaimed with
any vm.swappiness value.


unprivileged_userfaultfd
========================
Expand Down
1 change: 1 addition & 0 deletions Documentation/vm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ algorithms. If you are looking for advice on simply allocating memory, see the
ksm
memory-model
mmu_notifier
multigen_lru
numa
overcommit-accounting
page_migration
Expand Down
Loading

0 comments on commit 527ea7d

Please sign in to comment.