Skip to content

Commit b6e6edc

Browse files
hnaztorvalds
authored andcommitted
mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage
Setting the original memory.limit_in_bytes hardlimit is subject to a race condition when the desired value is below the current usage. The code tries a few times to first reclaim and then see if the usage has dropped to where we would like it to be, but there is no locking, and the workload is free to continue making new charges up to the old limit. Thus, attempting to shrink a workload relies on pure luck and hope that the workload happens to cooperate. To fix this in the cgroup2 memory.max knob, do it the other way round: set the limit first, then try enforcement. And if reclaim is not able to succeed, trigger OOM kills in the group. Keep going until the new limit is met, we run out of OOM victims and there's only unreclaimable memory left, or the task writing to memory.max is killed. This allows users to shrink groups reliably, and the behavior is consistent with what happens when new charges are attempted in excess of memory.max. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 588083b commit b6e6edc

File tree

2 files changed

+40
-4
lines changed

2 files changed

+40
-4
lines changed

Diff for: Documentation/cgroup-v2.txt

+6
Original file line numberDiff line numberDiff line change
@@ -1387,6 +1387,12 @@ system than killing the group. Otherwise, memory.max is there to
13871387
limit this type of spillover and ultimately contain buggy or even
13881388
malicious applications.
13891389

1390+
Setting the original memory.limit_in_bytes below the current usage was
1391+
subject to a race condition, where concurrent charges could cause the
1392+
limit setting to fail. memory.max on the other hand will first set the
1393+
limit to prevent new charges, and then reclaim and OOM kill until the
1394+
new limit is met - or the task writing to memory.max is killed.
1395+
13901396
The combined memory+swap accounting and limiting is replaced by real
13911397
control over swap space.
13921398

Diff for: mm/memcontrol.c

+34-4
Original file line numberDiff line numberDiff line change
@@ -1236,7 +1236,7 @@ static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
12361236
return limit;
12371237
}
12381238

1239-
static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
1239+
static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
12401240
int order)
12411241
{
12421242
struct oom_control oc = {
@@ -1314,6 +1314,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
13141314
}
13151315
unlock:
13161316
mutex_unlock(&oom_lock);
1317+
return chosen;
13171318
}
13181319

13191320
#if MAX_NUMNODES > 1
@@ -5029,6 +5030,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
50295030
char *buf, size_t nbytes, loff_t off)
50305031
{
50315032
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
5033+
unsigned int nr_reclaims = MEM_CGROUP_RECLAIM_RETRIES;
5034+
bool drained = false;
50325035
unsigned long max;
50335036
int err;
50345037

@@ -5037,9 +5040,36 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
50375040
if (err)
50385041
return err;
50395042

5040-
err = mem_cgroup_resize_limit(memcg, max);
5041-
if (err)
5042-
return err;
5043+
xchg(&memcg->memory.limit, max);
5044+
5045+
for (;;) {
5046+
unsigned long nr_pages = page_counter_read(&memcg->memory);
5047+
5048+
if (nr_pages <= max)
5049+
break;
5050+
5051+
if (signal_pending(current)) {
5052+
err = -EINTR;
5053+
break;
5054+
}
5055+
5056+
if (!drained) {
5057+
drain_all_stock(memcg);
5058+
drained = true;
5059+
continue;
5060+
}
5061+
5062+
if (nr_reclaims) {
5063+
if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
5064+
GFP_KERNEL, true))
5065+
nr_reclaims--;
5066+
continue;
5067+
}
5068+
5069+
mem_cgroup_events(memcg, MEMCG_OOM, 1);
5070+
if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
5071+
break;
5072+
}
50435073

50445074
memcg_wb_domain_size_changed(memcg);
50455075
return nbytes;

0 commit comments

Comments
 (0)