Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented : Simple LMK #15

Merged
merged 69 commits into from
Mar 16, 2024
Merged

Implemented : Simple LMK #15

merged 69 commits into from
Mar 16, 2024

Conversation

ravindu644
Copy link
Owner

in order for a third-party lmk to work normally in the kernel, we need to set ro.slmk.enable_userspace_lmk=false in vendor.prop or in magisk.

Disable-userspace-slmk - mod.zip

kerneltoast and others added 30 commits March 16, 2024 14:01
This is a complete low memory killer solution for Android that is small
and simple. Processes are killed according to the priorities that
Android gives them, so that the least important processes are always
killed first. Processes are killed until memory deficits are satisfied,
as observed from kswapd struggling to free up pages. Simple LMK stops
killing processes when kswapd finally goes back to sleep.

The only tunables are the desired amount of memory to be freed per
reclaim event and desired frequency of reclaim events. Simple LMK tries
to free at least the desired amount of memory per reclaim and waits
until all of its victims' memory is freed before proceeding to kill more
processes.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Previously, pages_found would be calculated using an uninitialized
variable. Fix it.

Reported-by: Julian Liu <wlootlxt123@gmail.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Simple LMK's reclaim thread should never stop; there's no need to have
this check.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
cmpxchg() is only atomic with respect to the local CPU, so it cannot be
relied on with how it's used in Simple LMK. Switch to fully atomic
operations instead for full atomic guarantees.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When the reclaim thread writes to victims_to_kill on one CPU, it expects
the updated value to be immediately reflected on all CPUs in order for
simple_lmk_mm_freed() to work correctly. Due to the lack of memory
barriers to guarantee multicopy atomicity, simple_lmk_mm_freed() can be
given a victim's mm without knowing the correct victims_to_kill value,
which can cause the reclaim thread to remain stuck waiting forever for
all victims to be freed. This scenario, despite being rare, has been
observed.

Fix this by using proper atomic helpers with memory barriers.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The 20 ms delay in the reclaim thread is a hacky fudge factor that can
cause Simple LMK to behave wildly differently depending on the
circumstances of when it is invoked. When kswapd doesn't get enough CPU
time to finish up and go back to sleep within 20 ms, Simple LMK performs
superfluous reclaims.

This is suboptimal, so make Simple LMK more deterministic by eliminating
the delay and instead queuing up reclaim requests from kswapd.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Using a parameter to pass around a unmodified pointer to a global
variable is crufty; just use the `victims` variable directly instead.
Also, compress the code in simple_lmk_init_set() a bit to make it look
cleaner.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
After commit "simple_lmk: Make reclaim deterministic", Simple LMK's
behavior changed and thus requires some slight re-tuning to make it work
well again.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Queuing up reclaim requests while a reclaim is in progress doesn't make
sense, since the additional reclaims may not be needed after the
existing reclaim completes. This would cause Simple LMK to go berserk
during periods of high memory pressure where kswapd would fire off
reclaim requests nonstop.

Make Simple LMK ignore new reclaim requests until an existing reclaim is
finished to prevent a slaughter-fest.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Makes it clear that Simple LMK tried its best but there was nothing it
could do.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The OOM killer only serves to be a liability when Simple LMK is used.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The OOM killer sets the TIF_MEMDIE thread flag for its victims to alert
other kernel code that the current process was killed due to memory
pressure, and needs to finish whatever it's doing quickly. In the page
allocator this allows victim processes to quickly allocate memory using
emergency reserves. This is especially important when memory pressure is
high; if all processes are taking a while to allocate memory, then our
victim processes will face the same problem and can potentially get
stuck in the page allocator for a while rather than die expeditiously.

To ensure that victim processes die quickly, set TIF_MEMDIE for the
entire victim thread group.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
exit_mmap() is responsible for freeing the vast majority of an mm's
memory; in order to unblock Simple LMK faster, report an mm as freed as
soon as exit_mmap() finishes.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
set_user_nice() doesn't schedule, and although set_cpus_allowed_ptr()
can schedule, it will only do so when the specified task cannot run on
the new set of allowed CPUs. Since cpu_all_mask is used,
set_cpus_allowed_ptr() will never schedule. Therefore, both the priority
elevation and cpus_allowed change can be moved to inside the task lock
to simplify and speed things up.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Dying processes aren't going to help free memory, so ignore them.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Simple LMK tries to wait until all of the victims it kills have their
memory freed; however, sometimes victims can take a while to die, which
can block Simple LMK from killing more processes in time when needed.
After the specified timeout elapses, Simple LMK will stop waiting and
make itself available to kill more processes.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Just increasing the victim's priority to the maximum niceness isn't
enough to make it totally preempt everything in SCHED_FAIR, which is
important to make sure victims die quickly. Resource-wise, this isn't
very burdensome since the RT priority is just set to zero, and because
dying victims don't have much to do: they only need to finish whatever
they're doing quickly. SCHED_RR is used over SCHED_FIFO so that CPU time
between the victims is divided evenly to help them all finish at around
the same time, as fast as possible.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
wake_up() executes a full memory barrier when waking a process up, so
there's no need for the acquire in the wait event. Additionally,
because of this, the atomic_cmpxchg() only needs a read barrier.

The cmpxchg() in simple_lmk_mm_freed() is atomic when it doesn't need to
be, so replace it with an extra line of code.

The atomic_inc_return() in simple_lmk_mm_freed() lies within a lock, so
it doesn't need explicit memory barriers.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Swap memory usage is important when determining what to kill, so include
it in the victim size calculation.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Keeping kswapd running when all the failed allocations that invoked it
are satisfied incurs a high overhead due to unnecessary page eviction
and writeback, as well as spurious VM pressure events to various
registered shrinkers. When kswapd doesn't need to work to make an
allocation succeed anymore, stop it prematurely to save resources.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Resolve -Wenum-compare issue when comparing vmpressure level/model
against -1 (invalid state).

Change-Id: I1c76667ee8390e2d396c96e5ed73f30d0700ffa8
Signed-off-by: David Ng <dave@codeaurora.org>
Currently, vmpressure is tied to memcg and its events are
available only to userspace clients. This patch removes
the dependency on CONFIG_MEMCG and adds a mechanism for
in-kernel clients to subscribe for vmpressure events (in
fact raw vmpressure values are delivered instead of vmpressure
levels, to provide clients more flexibility to take actions
on custom pressure levels which are not currently defined
by vmpressure module).

Change-Id: I38010f166546e8d7f12f5f355b5dbfd6ba04d587
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
The existing calculation of vmpressure takes into account only
the ratio of reclaimed to scanned pages, but not the time spent
or the difficulty in reclaiming those pages. For e.g. when there
are quite a number of file pages in the system, an allocation
request can be satisfied by reclaiming the file pages alone. If
such a reclaim is successful, the vmpressure value will remain low
irrespective of the time spent by the reclaim code to free up the
file pages. With a feature like lowmemorykiller, killing a task
can be faster than reclaiming the file pages alone. So if the
vmpressure values reflect the reclaim difficulty level, clients
can make a decision based on that, for e.g. to kill a task early.

This patch monitors the number of pages scanned in the direct
reclaim path and scales the vmpressure level according to that.

Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Change-Id: I6e643d29a9a1aa0814309253a8b690ad86ec0b13
At present any vmpressure value is scaled up if the pages are
reclaimed through direct reclaim. This can result in false
vmpressure values. Consider a case where a device is booted up
and most of the memory is occuppied by file pages. kswapd will
make sure that high watermark is maintained. Now when a sudden
huge allocation request comes in, the system will definitely
have to get into direct reclaims. The vmpressures can be very low,
but because of allocstall accounting logic even these low values
will be scaled to values nearing 100. This can result in
unnecessary LMK kills for example. So define a tunable threshold
for vmpressure above which the allocstalls will be accounted.

Change-Id: Idd7c6724264ac89f1f68f2e9d70a32390ffca3e5
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Right now the vmpressure window is of constant size 2MB, which
works well with the following exceptions.
1) False vmpressure triggers are seen when the RAM size is greater
than 3GB. This results in lowmemorykiller, which uses vmpressure
events, killing tasks unnecessarily.
2) Vmpressure events are received late under memory pressure. This
behaviour is seen prominently in <=2GB RAM targets. This results in
lowmemorykiller kicking in late to kill tasks resulting in avoidable
page cache reclaim.

The problem analysis shows that the issue is with the constant size
of the vmpressure window which does not adapt to the varying memory
conditions. This patch recalculates the vmpressure window size at
the end of each window. The chosen window size is proportional to
the total of free and cached memory at that point.

Change-Id: I7e9ef4ddd82e2c2dd04ce09ec8d58a8829cfb64d
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Using kswapd's scan depth to trigger task kills is inconsistent and
unreliable. When memory pressure quickly spikes, the kswapd scan depth
trigger fails to kick off Simple LMK fast enough, causing severe lag.
Additionally, kswapd could stop scanning prematurely before reaching the
desired scan depth to trigger Simple LMK, which could also cause stalls.

To remedy this, use the vmpressure framework instead, since it provides
more consistent and accurate readings on memory pressure. This is not
very tunable though, so remove CONFIG_ANDROID_SIMPLE_LMK_AGGRESSION.
Triggering Simple LMK to kill when the reported memory pressure is 100
should yield good results on all setups.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Android 10 changed its adj assignments. Update Simple LMK to use the
new adjs, which also requires looking at each pair of adjs as a range.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Although userspace processes can't directly help with kernel memory
pressure, killing userspace processes can relieve kernel memory if they
are responsible for that pressure in the first place. It doesn't make
sense to exclude any allocation types knowing that userspace can indeed
affect all memory pressure, so don't exclude any allocation types from
the pressure calculations.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
kerneltoast and others added 29 commits March 16, 2024 14:20
This aids in selecting an adequate timeout. If the timeout is hit often
and Simple LMK is killing too much, then the timeout should be
lengthened. If the timeout is rarely hit and Simple LMK is not killing
fast enough under pressure, then the timeout should be shortened.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When PSI is enabled, lmkd in userspace will use PSI notifications to
perform low memory kills. Therefore, to ensure that Simple LMK is the
only active LMK implementation, add a !PSI dependency.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Simple LMK uses VM pressure now, not a kswapd hook like before. Update
the Kconfig description to reflect such.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Userspace could change these tunables and make Simple LMK function
poorly. Don't export them.

Reported-by: attack11 <fernandobouchet@gmail.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When the mm_free_lock write lock is held, it means that reclaim is
either starting or ending, in which case there's nothing that needs to
be done in simple_lmk_mm_freed(). We can use a trylock here instead to
avoid blocking.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There's no reason to pass this constant around in a parameter.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When there aren't enough pages found, it means all of the victims that
were found need to be killed. The additional processing that attempts to
reduce the number of victims can be skipped in this case.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When sort() isn't provided with a custom swap function, it falls back
onto its generic implementation of just swapping one byte at a time,
which is quite slow. Since we know the type of the objects being sorted,
we can provide our own swap function which simply uses the swap() macro.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The victims array and mm_free_lock data structures can be used very
heavily in parallel on SMP, in which case they would benefit from being
cacheline-aligned. Make it so for SMP.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Hard-coding adj ranges to search for victims results in a few problems.
Firstly, the hard-coded adjs must be vigilantly updated to match what
userspace uses, which makes long-term support a headache. Secondly, a
full traversal of every running process must be done for each adj range,
which can turn out to be quite expensive, especially if userspace
assigns many different adj values and we want to enumerate them all.
This leads us to the final problem, which is that processes with
different adjs within the same hard-coded adj range will be treated the
same, even though they're not: the process with a higher adj is less
important, and the process with a lower adj is more important. This
could be fixed by enumerating every possible adj, but again, that would
necessitate several scans through the active process list, which is bad
for performance, especially since latency is critical here.

Since adjs are only 16 bits, and we only care about positive adjs, that
leaves us with 15 bits of the adj that matter. This is a relatively
small number of potential adjs (32,768), which makes it possible to
allocate a static array that's indexed using the adj. Each entry in this
array is a pointer to the first task_struct in a singly-linked list of
task_structs sharing an adj. A `simple_lmk_next` member is added to
task_struct to accommodate this linked list. The victim finder now
iterates downward through the array searching for linked lists of tasks,
starting from the highest adj found, so that the lowest-priority
processes are always considered first for reclaim. This fixes all of the
problems mentioned above, and now there is only one traversal through
every running process. The array itself only takes up 256 KiB of memory
on 64-bit, which is a very small price to pay for the advantages gained.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The direct reclaim vmpressure path was erroneously excluded from the
PAGE_ALLOC_COSTLY_ORDER check which was added in commit "mm: vmpressure:
Ignore allocation orders above PAGE_ALLOC_COSTLY_ORDER".

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Since the code that determines whether data should be cleared and the
code that actually clears the data are in separate spin-locked critical
sections, new data could be generated on another CPU after it is
determined that the existing data should be cleared, but before the
current CPU clears the existing data. This would cause the new data
reported by the other CPU to be lost.

Fix the race by clearing accumulated data within the same spin-locked
critical section that determines whether or not data should be cleared.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
After a period of intense memory pressure is over, it's common for
vmpressure to still have old reclaim efficiency data accumulated from
this time. When memory pressure starts to rise again, this stale data
will factor into vmpressure's calculations, and can cause vmpressure to
report an erroneously high pressure. The reverse is possible, too:
vmpressure may report pressures that are erroneously low due to stale
data that's been stored.

Furthermore, since kswapd can still be performing reclaim when there are
no failed memory allocations stuck in the page allocator's slow path,
vmpressure may still report pressures when there aren't any memory
allocations to satisfy. This can cause last-resort memory reclaimers to
kill processes to free memory when it's not needed.

To fix the rampant stale data, keep track of when there are processes
utilizing reclaim in the page allocator's slow path, and reset the
accumulated data in vmpressure when a new period of elevated memory
pressure begins. Extra measures are taken for the kswapd issue mentioned
above by ignoring all reclaim efficiency data reported by kswapd when
there aren't any failed memory allocations in the page allocator which
utilize reclaim.

Note that since sr_lock can now be used from IRQ context, IRQs must be
disabled whenever sr_lock is used to prevent deadlocks.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently use it for find_or_create_page(), which means that it
cannot fail. Ensure we also pass in 'retry == true' to
alloc_page_buffers(), which also ensure that it cannot fail.

After this, there are no failure cases in grow_dev_page() that
occur because of a failed memory allocation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the previous commit removed any case where grow_buffers()
would return failure due to memory allocations, we can safely
remove the case where we have to call free_more_memory() in
this function.

Since this is also the last user of free_more_memory(), kill
it off completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Throttled direct reclaimers will wake up kswapd and wait for kswapd to
satisfy their page allocation request, even when the failed allocation
lacks the __GFP_KSWAPD_RECLAIM flag in its gfp mask. As a result, kswapd
may think that there are no waiters and thus exit prematurely, causing
throttled direct reclaimers lacking __GFP_KSWAPD_RECLAIM to stall on
waiting for kswapd to wake them up. Incrementing the kswapd_waiters
counter when such direct reclaimers become throttled fixes the problem.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
If it's possible for a task to have no pages, then there could be a case
where `pages_found` is zero while `nr_found` isn't, which would cause
the found tasks' locks to never be unlocked, and thus mayhem. We can
change the `pages_found` check to use `nr_found` instead in order to
naturally defend against this scenario, in case it is indeed possible.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There are two problems with the current uninterruptible wait used in the
reclaim thread: the hung task detector is upset about an uninterruptible
thread being asleep for so long, and killing processes can generate I/O.

Since killing a process can generate I/O, the reclaim thread should
participate in system-wide suspend operations. This neatly solves the
hung task detector issue since wait_event_freezable() puts the current
process into an interruptible sleep.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
With freezable cgroups and their recent utilization in Android, it's
possible for some of Simple LMK's victims to be frozen at the time that
they're selected for killing. The forced SIGKILL used for killing
victims can only wake up processes containing TASK_WAKEKILL and/or
TASK_INTERRUPTIBLE, not TASK_UNINTERRUPTIBLE, which is the state used on
frozen tasks. In order to wake frozen tasks from their uninterruptible
slumber so that they can die, we must thaw them. Leaving victims frozen
can otherwise make them take an indefinite amount of time to process our
SIGKILL and thus free memory.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
As it turns out, victim scheduling priority elevation has always been
broken for two reasons:
1. The minimum valid RT priority is 1, not 0. As a result,
   sched_setscheduler_nocheck() always fails with -EINVAL.
2. The thread within a victim thread group which happens to hold the mm is
   not necessarily the only thread with references to the mm, and isn't
   necessarily the thread which will release the final mm reference. As a
   result, victim threads which hold mm references may take a while to
   release them, and the unlucky thread which puts the final mm reference
   may take a very long time to release all memory if it doesn't have RT
   scheduling priority.

These issues cause victims to often take a very long time to release their
memory, possibly up to several seconds depending on system load. This, in
turn, causes Simple LMK to constantly hit the reclaim timeout and kill more
processes, with Simple LMK being rather ineffective since victims may not
release any memory for several seconds.

Fix the broken scheduling priority elevation by changing the RT priority to
the valid lowest priority of 1 and applying it to all threads in the thread
group, instead of just the thread which holds the mm.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Under extreme simulated memory pressure, the 'no processes available to
kill' message can be spammed hundreds of thousands of times, which is not
productive. Ratelimit it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
We can check if the waitqueue is actually active before calling wake_up()
in order to avoid an unnecessary wake_up() if the reclaim thread is already
running. Furthermore, the release barrier when zeroing needs_reclaim is
unnecessary, so remove it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The OOM reaper makes it possible to immediately release anonymous memory
from a dying process in order to free up memory faster. This provides
immediate relief under heavy memory pressure instead of waiting for victim
processes to naturally release their memory.

Utilize the OOM reaper by creating another kthread in Simple LMK to perform
victim reaping. Similar to the OOM reaper kthread (which is unused with
Simple LMK), this new kthread allows reaping to race with exit_mmap() in
order to preclude the need to take a reference to an mm's address space and
thus potentially mmput() an mm's last reference. Doing so would stall the
reaper kthread, preventing it from being able to quickly reap new victims.

Reaping is done on victims one at a time by descending order of anonymous
pages, so that the most promising victims with the most anonymous pages
are reaped first. Victims are also marked for reaping via MMF_OOM_VICTIM so
that they reap themselves first in exit_mmap(). Even if a victim isn't
reaped by the reaper thread, it'll free its anonymous memory first thing in
exit_mmap() as a small win towards making memory available sooner.

By relieving memory pressure faster via reaping, Simple LMK not only
doesn't need to kill as many processes, but also improves system
responsiveness when memory is low since memory pressure is relieved sooner.

Although not strictly required, Simple LMK should be the only one utilizing
the OOM reaper. Any other code that may utilize the OOM reaper, such as
patches that invoke the OOM reaper for all SIGKILLs, should be disabled.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When Simple LMK is enabled, the page allocator slowpath always thinks that
no OOM kill progress is made because out_of_memory() returns false. As a
result, spurious page allocation failures are observed when memory is low
and Simple LMK is killing tasks, simply because the page allocator slowpath
doesn't think that any OOM killing is taking place.

Fix this by simply making out_of_memory() always return true when Simple
LMK is enabled.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Dark-Matter7232 <kerneldeveloper7232@gmail.com>
@ravindu644 ravindu644 merged commit 9c2069c into LPoS Mar 16, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants