-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented : Simple LMK #15
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a complete low memory killer solution for Android that is small and simple. Processes are killed according to the priorities that Android gives them, so that the least important processes are always killed first. Processes are killed until memory deficits are satisfied, as observed from kswapd struggling to free up pages. Simple LMK stops killing processes when kswapd finally goes back to sleep. The only tunables are the desired amount of memory to be freed per reclaim event and desired frequency of reclaim events. Simple LMK tries to free at least the desired amount of memory per reclaim and waits until all of its victims' memory is freed before proceeding to kill more processes. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Previously, pages_found would be calculated using an uninitialized variable. Fix it. Reported-by: Julian Liu <wlootlxt123@gmail.com> Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Simple LMK's reclaim thread should never stop; there's no need to have this check. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
cmpxchg() is only atomic with respect to the local CPU, so it cannot be relied on with how it's used in Simple LMK. Switch to fully atomic operations instead for full atomic guarantees. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When the reclaim thread writes to victims_to_kill on one CPU, it expects the updated value to be immediately reflected on all CPUs in order for simple_lmk_mm_freed() to work correctly. Due to the lack of memory barriers to guarantee multicopy atomicity, simple_lmk_mm_freed() can be given a victim's mm without knowing the correct victims_to_kill value, which can cause the reclaim thread to remain stuck waiting forever for all victims to be freed. This scenario, despite being rare, has been observed. Fix this by using proper atomic helpers with memory barriers. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The 20 ms delay in the reclaim thread is a hacky fudge factor that can cause Simple LMK to behave wildly differently depending on the circumstances of when it is invoked. When kswapd doesn't get enough CPU time to finish up and go back to sleep within 20 ms, Simple LMK performs superfluous reclaims. This is suboptimal, so make Simple LMK more deterministic by eliminating the delay and instead queuing up reclaim requests from kswapd. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Using a parameter to pass around a unmodified pointer to a global variable is crufty; just use the `victims` variable directly instead. Also, compress the code in simple_lmk_init_set() a bit to make it look cleaner. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
After commit "simple_lmk: Make reclaim deterministic", Simple LMK's behavior changed and thus requires some slight re-tuning to make it work well again. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Queuing up reclaim requests while a reclaim is in progress doesn't make sense, since the additional reclaims may not be needed after the existing reclaim completes. This would cause Simple LMK to go berserk during periods of high memory pressure where kswapd would fire off reclaim requests nonstop. Make Simple LMK ignore new reclaim requests until an existing reclaim is finished to prevent a slaughter-fest. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Makes it clear that Simple LMK tried its best but there was nothing it could do. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The OOM killer only serves to be a liability when Simple LMK is used. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The OOM killer sets the TIF_MEMDIE thread flag for its victims to alert other kernel code that the current process was killed due to memory pressure, and needs to finish whatever it's doing quickly. In the page allocator this allows victim processes to quickly allocate memory using emergency reserves. This is especially important when memory pressure is high; if all processes are taking a while to allocate memory, then our victim processes will face the same problem and can potentially get stuck in the page allocator for a while rather than die expeditiously. To ensure that victim processes die quickly, set TIF_MEMDIE for the entire victim thread group. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
exit_mmap() is responsible for freeing the vast majority of an mm's memory; in order to unblock Simple LMK faster, report an mm as freed as soon as exit_mmap() finishes. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
set_user_nice() doesn't schedule, and although set_cpus_allowed_ptr() can schedule, it will only do so when the specified task cannot run on the new set of allowed CPUs. Since cpu_all_mask is used, set_cpus_allowed_ptr() will never schedule. Therefore, both the priority elevation and cpus_allowed change can be moved to inside the task lock to simplify and speed things up. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Dying processes aren't going to help free memory, so ignore them. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Simple LMK tries to wait until all of the victims it kills have their memory freed; however, sometimes victims can take a while to die, which can block Simple LMK from killing more processes in time when needed. After the specified timeout elapses, Simple LMK will stop waiting and make itself available to kill more processes. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Just increasing the victim's priority to the maximum niceness isn't enough to make it totally preempt everything in SCHED_FAIR, which is important to make sure victims die quickly. Resource-wise, this isn't very burdensome since the RT priority is just set to zero, and because dying victims don't have much to do: they only need to finish whatever they're doing quickly. SCHED_RR is used over SCHED_FIFO so that CPU time between the victims is divided evenly to help them all finish at around the same time, as fast as possible. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
wake_up() executes a full memory barrier when waking a process up, so there's no need for the acquire in the wait event. Additionally, because of this, the atomic_cmpxchg() only needs a read barrier. The cmpxchg() in simple_lmk_mm_freed() is atomic when it doesn't need to be, so replace it with an extra line of code. The atomic_inc_return() in simple_lmk_mm_freed() lies within a lock, so it doesn't need explicit memory barriers. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Swap memory usage is important when determining what to kill, so include it in the victim size calculation. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Keeping kswapd running when all the failed allocations that invoked it are satisfied incurs a high overhead due to unnecessary page eviction and writeback, as well as spurious VM pressure events to various registered shrinkers. When kswapd doesn't need to work to make an allocation succeed anymore, stop it prematurely to save resources. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Resolve -Wenum-compare issue when comparing vmpressure level/model against -1 (invalid state). Change-Id: I1c76667ee8390e2d396c96e5ed73f30d0700ffa8 Signed-off-by: David Ng <dave@codeaurora.org>
Currently, vmpressure is tied to memcg and its events are available only to userspace clients. This patch removes the dependency on CONFIG_MEMCG and adds a mechanism for in-kernel clients to subscribe for vmpressure events (in fact raw vmpressure values are delivered instead of vmpressure levels, to provide clients more flexibility to take actions on custom pressure levels which are not currently defined by vmpressure module). Change-Id: I38010f166546e8d7f12f5f355b5dbfd6ba04d587 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
The existing calculation of vmpressure takes into account only the ratio of reclaimed to scanned pages, but not the time spent or the difficulty in reclaiming those pages. For e.g. when there are quite a number of file pages in the system, an allocation request can be satisfied by reclaiming the file pages alone. If such a reclaim is successful, the vmpressure value will remain low irrespective of the time spent by the reclaim code to free up the file pages. With a feature like lowmemorykiller, killing a task can be faster than reclaiming the file pages alone. So if the vmpressure values reflect the reclaim difficulty level, clients can make a decision based on that, for e.g. to kill a task early. This patch monitors the number of pages scanned in the direct reclaim path and scales the vmpressure level according to that. Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Change-Id: I6e643d29a9a1aa0814309253a8b690ad86ec0b13
At present any vmpressure value is scaled up if the pages are reclaimed through direct reclaim. This can result in false vmpressure values. Consider a case where a device is booted up and most of the memory is occuppied by file pages. kswapd will make sure that high watermark is maintained. Now when a sudden huge allocation request comes in, the system will definitely have to get into direct reclaims. The vmpressures can be very low, but because of allocstall accounting logic even these low values will be scaled to values nearing 100. This can result in unnecessary LMK kills for example. So define a tunable threshold for vmpressure above which the allocstalls will be accounted. Change-Id: Idd7c6724264ac89f1f68f2e9d70a32390ffca3e5 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Right now the vmpressure window is of constant size 2MB, which works well with the following exceptions. 1) False vmpressure triggers are seen when the RAM size is greater than 3GB. This results in lowmemorykiller, which uses vmpressure events, killing tasks unnecessarily. 2) Vmpressure events are received late under memory pressure. This behaviour is seen prominently in <=2GB RAM targets. This results in lowmemorykiller kicking in late to kill tasks resulting in avoidable page cache reclaim. The problem analysis shows that the issue is with the constant size of the vmpressure window which does not adapt to the varying memory conditions. This patch recalculates the vmpressure window size at the end of each window. The chosen window size is proportional to the total of free and cached memory at that point. Change-Id: I7e9ef4ddd82e2c2dd04ce09ec8d58a8829cfb64d Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Using kswapd's scan depth to trigger task kills is inconsistent and unreliable. When memory pressure quickly spikes, the kswapd scan depth trigger fails to kick off Simple LMK fast enough, causing severe lag. Additionally, kswapd could stop scanning prematurely before reaching the desired scan depth to trigger Simple LMK, which could also cause stalls. To remedy this, use the vmpressure framework instead, since it provides more consistent and accurate readings on memory pressure. This is not very tunable though, so remove CONFIG_ANDROID_SIMPLE_LMK_AGGRESSION. Triggering Simple LMK to kill when the reported memory pressure is 100 should yield good results on all setups. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Android 10 changed its adj assignments. Update Simple LMK to use the new adjs, which also requires looking at each pair of adjs as a range. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Although userspace processes can't directly help with kernel memory pressure, killing userspace processes can relieve kernel memory if they are responsible for that pressure in the first place. It doesn't make sense to exclude any allocation types knowing that userspace can indeed affect all memory pressure, so don't exclude any allocation types from the pressure calculations. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This aids in selecting an adequate timeout. If the timeout is hit often and Simple LMK is killing too much, then the timeout should be lengthened. If the timeout is rarely hit and Simple LMK is not killing fast enough under pressure, then the timeout should be shortened. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When PSI is enabled, lmkd in userspace will use PSI notifications to perform low memory kills. Therefore, to ensure that Simple LMK is the only active LMK implementation, add a !PSI dependency. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Simple LMK uses VM pressure now, not a kswapd hook like before. Update the Kconfig description to reflect such. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Userspace could change these tunables and make Simple LMK function poorly. Don't export them. Reported-by: attack11 <fernandobouchet@gmail.com> Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When the mm_free_lock write lock is held, it means that reclaim is either starting or ending, in which case there's nothing that needs to be done in simple_lmk_mm_freed(). We can use a trylock here instead to avoid blocking. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There's no reason to pass this constant around in a parameter. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When there aren't enough pages found, it means all of the victims that were found need to be killed. The additional processing that attempts to reduce the number of victims can be skipped in this case. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When sort() isn't provided with a custom swap function, it falls back onto its generic implementation of just swapping one byte at a time, which is quite slow. Since we know the type of the objects being sorted, we can provide our own swap function which simply uses the swap() macro. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The victims array and mm_free_lock data structures can be used very heavily in parallel on SMP, in which case they would benefit from being cacheline-aligned. Make it so for SMP. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Hard-coding adj ranges to search for victims results in a few problems. Firstly, the hard-coded adjs must be vigilantly updated to match what userspace uses, which makes long-term support a headache. Secondly, a full traversal of every running process must be done for each adj range, which can turn out to be quite expensive, especially if userspace assigns many different adj values and we want to enumerate them all. This leads us to the final problem, which is that processes with different adjs within the same hard-coded adj range will be treated the same, even though they're not: the process with a higher adj is less important, and the process with a lower adj is more important. This could be fixed by enumerating every possible adj, but again, that would necessitate several scans through the active process list, which is bad for performance, especially since latency is critical here. Since adjs are only 16 bits, and we only care about positive adjs, that leaves us with 15 bits of the adj that matter. This is a relatively small number of potential adjs (32,768), which makes it possible to allocate a static array that's indexed using the adj. Each entry in this array is a pointer to the first task_struct in a singly-linked list of task_structs sharing an adj. A `simple_lmk_next` member is added to task_struct to accommodate this linked list. The victim finder now iterates downward through the array searching for linked lists of tasks, starting from the highest adj found, so that the lowest-priority processes are always considered first for reclaim. This fixes all of the problems mentioned above, and now there is only one traversal through every running process. The array itself only takes up 256 KiB of memory on 64-bit, which is a very small price to pay for the advantages gained. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The direct reclaim vmpressure path was erroneously excluded from the PAGE_ALLOC_COSTLY_ORDER check which was added in commit "mm: vmpressure: Ignore allocation orders above PAGE_ALLOC_COSTLY_ORDER". Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Since the code that determines whether data should be cleared and the code that actually clears the data are in separate spin-locked critical sections, new data could be generated on another CPU after it is determined that the existing data should be cleared, but before the current CPU clears the existing data. This would cause the new data reported by the other CPU to be lost. Fix the race by clearing accumulated data within the same spin-locked critical section that determines whether or not data should be cleared. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
After a period of intense memory pressure is over, it's common for vmpressure to still have old reclaim efficiency data accumulated from this time. When memory pressure starts to rise again, this stale data will factor into vmpressure's calculations, and can cause vmpressure to report an erroneously high pressure. The reverse is possible, too: vmpressure may report pressures that are erroneously low due to stale data that's been stored. Furthermore, since kswapd can still be performing reclaim when there are no failed memory allocations stuck in the page allocator's slow path, vmpressure may still report pressures when there aren't any memory allocations to satisfy. This can cause last-resort memory reclaimers to kill processes to free memory when it's not needed. To fix the rampant stale data, keep track of when there are processes utilizing reclaim in the page allocator's slow path, and reset the accumulated data in vmpressure when a new period of elevated memory pressure begins. Extra measures are taken for the kswapd issue mentioned above by ignoring all reclaim efficiency data reported by kswapd when there aren't any failed memory allocations in the page allocator which utilize reclaim. Note that since sr_lock can now be used from IRQ context, IRQs must be disabled whenever sr_lock is used to prevent deadlocks. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Instead of adding weird retry logic in that function, utilize __GFP_NOFAIL to ensure that the vm takes care of handling any potential retries appropriately. This means we don't have to call free_more_memory() from here. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently use it for find_or_create_page(), which means that it cannot fail. Ensure we also pass in 'retry == true' to alloc_page_buffers(), which also ensure that it cannot fail. After this, there are no failure cases in grow_dev_page() that occur because of a failed memory allocation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the previous commit removed any case where grow_buffers() would return failure due to memory allocations, we can safely remove the case where we have to call free_more_memory() in this function. Since this is also the last user of free_more_memory(), kill it off completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Throttled direct reclaimers will wake up kswapd and wait for kswapd to satisfy their page allocation request, even when the failed allocation lacks the __GFP_KSWAPD_RECLAIM flag in its gfp mask. As a result, kswapd may think that there are no waiters and thus exit prematurely, causing throttled direct reclaimers lacking __GFP_KSWAPD_RECLAIM to stall on waiting for kswapd to wake them up. Incrementing the kswapd_waiters counter when such direct reclaimers become throttled fixes the problem. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
If it's possible for a task to have no pages, then there could be a case where `pages_found` is zero while `nr_found` isn't, which would cause the found tasks' locks to never be unlocked, and thus mayhem. We can change the `pages_found` check to use `nr_found` instead in order to naturally defend against this scenario, in case it is indeed possible. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There are two problems with the current uninterruptible wait used in the reclaim thread: the hung task detector is upset about an uninterruptible thread being asleep for so long, and killing processes can generate I/O. Since killing a process can generate I/O, the reclaim thread should participate in system-wide suspend operations. This neatly solves the hung task detector issue since wait_event_freezable() puts the current process into an interruptible sleep. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
With freezable cgroups and their recent utilization in Android, it's possible for some of Simple LMK's victims to be frozen at the time that they're selected for killing. The forced SIGKILL used for killing victims can only wake up processes containing TASK_WAKEKILL and/or TASK_INTERRUPTIBLE, not TASK_UNINTERRUPTIBLE, which is the state used on frozen tasks. In order to wake frozen tasks from their uninterruptible slumber so that they can die, we must thaw them. Leaving victims frozen can otherwise make them take an indefinite amount of time to process our SIGKILL and thus free memory. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
As it turns out, victim scheduling priority elevation has always been broken for two reasons: 1. The minimum valid RT priority is 1, not 0. As a result, sched_setscheduler_nocheck() always fails with -EINVAL. 2. The thread within a victim thread group which happens to hold the mm is not necessarily the only thread with references to the mm, and isn't necessarily the thread which will release the final mm reference. As a result, victim threads which hold mm references may take a while to release them, and the unlucky thread which puts the final mm reference may take a very long time to release all memory if it doesn't have RT scheduling priority. These issues cause victims to often take a very long time to release their memory, possibly up to several seconds depending on system load. This, in turn, causes Simple LMK to constantly hit the reclaim timeout and kill more processes, with Simple LMK being rather ineffective since victims may not release any memory for several seconds. Fix the broken scheduling priority elevation by changing the RT priority to the valid lowest priority of 1 and applying it to all threads in the thread group, instead of just the thread which holds the mm. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Under extreme simulated memory pressure, the 'no processes available to kill' message can be spammed hundreds of thousands of times, which is not productive. Ratelimit it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
We can check if the waitqueue is actually active before calling wake_up() in order to avoid an unnecessary wake_up() if the reclaim thread is already running. Furthermore, the release barrier when zeroing needs_reclaim is unnecessary, so remove it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The OOM reaper makes it possible to immediately release anonymous memory from a dying process in order to free up memory faster. This provides immediate relief under heavy memory pressure instead of waiting for victim processes to naturally release their memory. Utilize the OOM reaper by creating another kthread in Simple LMK to perform victim reaping. Similar to the OOM reaper kthread (which is unused with Simple LMK), this new kthread allows reaping to race with exit_mmap() in order to preclude the need to take a reference to an mm's address space and thus potentially mmput() an mm's last reference. Doing so would stall the reaper kthread, preventing it from being able to quickly reap new victims. Reaping is done on victims one at a time by descending order of anonymous pages, so that the most promising victims with the most anonymous pages are reaped first. Victims are also marked for reaping via MMF_OOM_VICTIM so that they reap themselves first in exit_mmap(). Even if a victim isn't reaped by the reaper thread, it'll free its anonymous memory first thing in exit_mmap() as a small win towards making memory available sooner. By relieving memory pressure faster via reaping, Simple LMK not only doesn't need to kill as many processes, but also improves system responsiveness when memory is low since memory pressure is relieved sooner. Although not strictly required, Simple LMK should be the only one utilizing the OOM reaper. Any other code that may utilize the OOM reaper, such as patches that invoke the OOM reaper for all SIGKILLs, should be disabled. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When Simple LMK is enabled, the page allocator slowpath always thinks that no OOM kill progress is made because out_of_memory() returns false. As a result, spurious page allocation failures are observed when memory is low and Simple LMK is killing tasks, simply because the page allocator slowpath doesn't think that any OOM killing is taking place. Fix this by simply making out_of_memory() always return true when Simple LMK is enabled. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Dark-Matter7232 <kerneldeveloper7232@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
in order for a third-party lmk to work normally in the kernel, we need to set ro.slmk.enable_userspace_lmk=false in vendor.prop or in magisk.
Disable-userspace-slmk - mod.zip