Make cache reclaim support v2 #8497

utam0k · 2022-03-01T01:09:23Z

No description provided.

utam0k · 2022-03-01T06:39:18Z

I research how to implement cache reclaim. First of all, cgroup v2 doesn't have a way to reset the page cache in units of tasks. So we have to consider the new algorithm for the cache reclaim.
Next, describes the memory subsystem of cgroup v2.
https://github.com/giuseppe/enhancements/blob/5b4d3d5ec07b8e2ee7d231c6f99c09b0da04a48a/keps/sig-node/20191118-cgroups-v2.md

File Description

memory.min memory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup's memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked. We map it to requests.memory.

memory.max memory.max is the memory usage hard limit, acting as the final protection mechanism: If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM killer is invoked on the cgroup. Under certain circumstances, usage may go over the memory.high limit temporarily. When the high limit is used and monitored properly, memory.max serves mainly to provide the final safety net. The default is max. We map it to limits.memory as consistent with existing memory.limit_in_bytes for cgroups v1.

memory.low memory.low is the best-effort memory protection, a "soft guarantee" that if the cgroup and all its descendants are below this threshold, the cgroup's memory won't be reclaimed unless memory can’t be reclaimed from any unprotected cgroups. Not yet considered for now.

memory.high memory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit. We use a formula to calculate memory.high depending on limits.memory/node allocatable memory and a memory throttling factor.

In other words, when low ~ min or vice versa, the kernel will reclaim the memory. We need to think a bit about how to set the min and max values, but we may be better off relying on these instead of having our own memory reclaim. And there is currently nothing else in cgroup v2 to rely on to implement the cache-reclaim.

Also, in cgroup v2 kubernetes, kubelet will make decisions based on memory usage including page cache.

Reference

utam0k · 2022-03-02T03:49:36Z

I've researched the effects of cgv2 memory.max. When I set the memory.max, the kernel was like trying not to exceed a set value. For testing, we opened gitpod-io/gitpod in a workspace as follows and ran the following command

$ tar cvvfz $HOME/backup.tar.gz /workspace

Then, in another terminal, I ran the following measurement command.

$ while true; do sleep 1; (echo $(date +%s), $(cat /sys/fs/cgroup/memory.current)) >> /tmp/log.txt; done

Details is here.

I'm going to investigate the following a bit more, basically leaving the reclaim to the linux kernel.

How the memory cache works when the kernel is reclaiming memory
Performance impact during doing memory reclaiming

Also, if there are no particular problems after investigating these, I would like to consider what values would be appropriate to set. For example, 80% of the actual limit of the workspace.

@csweichel @Furisto WDYT?

Furisto · 2022-03-02T12:35:03Z

Wondering why the process was not killed by the OOM killer when it reached max. Do we set oom_score_adj to -1000 somewhere?
How is the behavior with memory.high? memory.max only prevents the process from going over the limit but does not try to reclaim.

utam0k · 2022-03-03T00:39:19Z

@Furisto I had made a big mistake 😭 All of the above results are setted to high, not max. The data is correct, but I had mistakenly written high as max.

utam0k · 2022-03-03T07:28:35Z

Additional research was done. I set the memory.high of the workspace to 10GB and used stress-ng to repeatedly stress 20GB for 30 seconds.

#!/bin/bash
set -e

while true; do
	stress-ng -m 1 --vm-bytes $1 --timeout 30 # $1 = 20GB
	sleep 30
done

a script for logging

#!/bin/bash
set -e

file=/tmp/$(date +%s)_log.txt
echo start logging to $file
cat /sys/fs/cgroup/memory.stat | awk '{ print $1 }' | xargs echo memory.current >> $file
while true; do 
    sleep 1
    echo $(cat /sys/fs/cgroup/memory.current) $(cat /sys/fs/cgroup/memory.stat | awk '{ print $2 }' | xargs echo) >> $file
done

If the kernel exceeds the value of memory.high, it will try to reduce the file cache and other caches and use swap. Therefore, it is not a good idea to set this value too low as it will affect the performance. However, reducing the cache is the result we want, and it may be better than the current cache reclaim. I basically found it best to use this feature of the kernel. Also, since this function is left up to the kernel, it may work differently depending on the kernel version. This is a negative point. This is a negative point because we need to support many different kernel versions. However, I felt that it would be a sufficient feature to use for our cache reclaim.

Detailed survey results are here.

I would like to present two paths:

Rely on kubernetes memory QoS
Set a fixed value to memory.high
Implement the memory limit, like CPU Limit

Rely on kubernetes memory QoS

kubernetes has a QoS feature that uses cgroup v2.
https://kubernetes.io/blog/2021/11/26/qos-memory-resources/
kubernetes/kubernetes#102970

The default seems to be to set memory.high at 80% of limit. This can only be set to a fixed value, but we can let Kubernetes do the work for us, which means less implementation for us.

👍 good

Our implementation will be less (but I haven't tried it, so maybe it could be tricky)

🤔 not good

Once a pod is started, the value cannot be changed

Set a fixed value to `memory.high`

👍 good

Simple

🤔 not good

Once a pod is started, the value cannot be changed

Implement the memory limit, like CPU Limit

We can set the memory.high and memory.max by ourselves without relying on the kubernetes feature, like CPU Limit.
#8036

For example, let's set the overall memory usage limit to 50GB, and the memory usage limit for a single workspace to 10GB (memory.max). Then you should be able to set the amount of memory that can be used (memory.max) to go up by 5GB. Also, let's respect the value used by kubernetes and set memromy.high to 80%. Then the following behavior is assumed.
The workspace tries to use 3GB of memory, in which case there is nothing wrong with it. The workspace tries to use 4.5GB of memory. This exceeds the memory.high (5 GB * 0.8 = 4 GB). So the workspace dispatcher will update memory.max and memory.high to 10 GB (memory.high = 8GB) for the next stage. Also, make sure that the entire workspace does not exceed 50 GB at this time.

👍 good

We have flexible control over memory limits

🤔 not good

Implementation is more difficult than the other proposal. (Of course, these come with maintenance.)

Finally

The caveat to both is that they both function completely differently from cache reclaim in cgroup v1. cgroup v2 does not have memory.force_empty, so this is unavoidable.(It is possible to empty the cache of a node, but this is not very nice.)

I think it would be a good idea to take this opportunity to implement Memory Limit, which should not be too difficult to implement since there is a good example of CPU Limit.

utam0k · 2022-03-03T07:29:28Z

@Furisto @csweichel I'd like to hear what you think. And if you have any other good ideas, I'd love to hear them.

csweichel · 2022-03-03T09:12:34Z

Thanks @utam0k for the level of detail - much appreciated.

It strikes me that option 1 would be the way forward. We could just live with the 80% setting and see how things behave in prod.

Re implementing the memory limit: how would the system behave if we lowered memory.max? The great thing about "bandwidth controlled resources" like CPU is that you can always reduce the bandwidth without adverse side effects (other than the performance penalty of course). For "space controlled resources" like memory or disk that's much harder. Would the Kernel start killing processes if the cgroup exceeded memory.max?

Furisto · 2022-03-03T10:18:16Z

It strikes me that option 1 would be the way forward. We could just live with the 80% setting and see how things behave in prod.

Agreed, this option would be the most straightforward solution. We still can switch to another option if it turns out that this is not sufficient for us.

Re implementing the memory limit: how would the system behave if we lowered memory.max? The great thing about "bandwidth controlled resources" like CPU is that you can always reduce the bandwidth without adverse side effects (other than the performance penalty of course). For "space controlled resources" like memory or disk that's much harder. Would the Kernel start killing processes if the cgroup exceeded memory.max?

Yes, depending on the value of memory.oom.group it would either kill some or all processes inside the cgroup. Processes with an oom_score_adj set to -1000 will not be killed.

utam0k · 2022-03-04T13:25:23Z

@csweichel @Furisto
Thanks for the great input. Yes, I think it would be great to use Kubernetes MemoryQoS once and see how it looks in production. I would also recommend one.
https://github.com/gitpod-io/gitpod-packer-gcp-image/pull/54

This was referenced Mar 1, 2022

Epic: Support cgroup v2 #8350

Closed

Enable cpu limiting using cgroup v2 #8471

Merged

utam0k added the team: workspace Issue belongs to the Workspace team label Mar 1, 2022

utam0k added this to 🌌 Workspace Team Mar 1, 2022

kylos101 assigned utam0k Mar 1, 2022

kylos101 moved this to Scheduled in 🌌 Workspace Team Mar 1, 2022

kylos101 moved this from Scheduled to In Progress in 🌌 Workspace Team Mar 1, 2022

This was referenced Mar 7, 2022

Disable cache reclaim when cgroup v2. #8626

Closed

Disable cache reclaim when cgroup v2. #8629

Merged

roboquat closed this as completed in #8629 Mar 8, 2022

Repository owner moved this from In Progress to Done in 🌌 Workspace Team Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cache reclaim support v2 #8497

Make cache reclaim support v2 #8497

utam0k commented Mar 1, 2022

utam0k commented Mar 1, 2022 •

edited

Loading

utam0k commented Mar 2, 2022

Furisto commented Mar 2, 2022

utam0k commented Mar 3, 2022

utam0k commented Mar 3, 2022 •

edited

Loading

utam0k commented Mar 3, 2022

csweichel commented Mar 3, 2022

Furisto commented Mar 3, 2022

utam0k commented Mar 4, 2022

Make cache reclaim support v2 #8497

Make cache reclaim support v2 #8497

Comments

utam0k commented Mar 1, 2022

utam0k commented Mar 1, 2022 • edited Loading

Reference

utam0k commented Mar 2, 2022

Furisto commented Mar 2, 2022

utam0k commented Mar 3, 2022

utam0k commented Mar 3, 2022 • edited Loading

Rely on kubernetes memory QoS

Set a fixed value to memory.high

Implement the memory limit, like CPU Limit

Finally

utam0k commented Mar 3, 2022

csweichel commented Mar 3, 2022

Furisto commented Mar 3, 2022

utam0k commented Mar 4, 2022

utam0k commented Mar 1, 2022 •

edited

Loading

utam0k commented Mar 3, 2022 •

edited

Loading

Set a fixed value to `memory.high`