Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cache reclaim support v2 #8497

Closed
Tracked by #8350
utam0k opened this issue Mar 1, 2022 · 9 comments · Fixed by #8629
Closed
Tracked by #8350

Make cache reclaim support v2 #8497

utam0k opened this issue Mar 1, 2022 · 9 comments · Fixed by #8629
Assignees
Labels
team: workspace Issue belongs to the Workspace team

Comments

@utam0k
Copy link
Contributor

utam0k commented Mar 1, 2022

No description provided.

@utam0k utam0k added the team: workspace Issue belongs to the Workspace team label Mar 1, 2022
@utam0k
Copy link
Contributor Author

utam0k commented Mar 1, 2022

I research how to implement cache reclaim. First of all, cgroup v2 doesn't have a way to reset the page cache in units of tasks. So we have to consider the new algorithm for the cache reclaim.
Next, describes the memory subsystem of cgroup v2.
https://github.com/giuseppe/enhancements/blob/5b4d3d5ec07b8e2ee7d231c6f99c09b0da04a48a/keps/sig-node/20191118-cgroups-v2.md

File Description
memory.min memory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup's memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked. We map it to requests.memory.
memory.max memory.max is the memory usage hard limit, acting as the final protection mechanism: If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM killer is invoked on the cgroup. Under certain circumstances, usage may go over the memory.high limit temporarily. When the high limit is used and monitored properly, memory.max serves mainly to provide the final safety net. The default is max. We map it to limits.memory as consistent with existing memory.limit_in_bytes for cgroups v1.
memory.low memory.low is the best-effort memory protection, a "soft guarantee" that if the cgroup and all its descendants are below this threshold, the cgroup's memory won't be reclaimed unless memory can’t be reclaimed from any unprotected cgroups. Not yet considered for now.
memory.high memory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit. We use a formula to calculate memory.high depending on limits.memory/node allocatable memory and a memory throttling factor.

In other words, when low ~ min or vice versa, the kernel will reclaim the memory. We need to think a bit about how to set the min and max values, but we may be better off relying on these instead of having our own memory reclaim. And there is currently nothing else in cgroup v2 to rely on to implement the cache-reclaim.

Also, in cgroup v2 kubernetes, kubelet will make decisions based on memory usage including page cache.

Reference

@kylos101 kylos101 moved this to Scheduled in 🌌 Workspace Team Mar 1, 2022
@kylos101 kylos101 moved this from Scheduled to In Progress in 🌌 Workspace Team Mar 1, 2022
@utam0k
Copy link
Contributor Author

utam0k commented Mar 2, 2022

I've researched the effects of cgv2 memory.max. When I set the memory.max, the kernel was like trying not to exceed a set value. For testing, we opened gitpod-io/gitpod in a workspace as follows and ran the following command

$ tar cvvfz $HOME/backup.tar.gz /workspace

Then, in another terminal, I ran the following measurement command.

$ while true; do sleep 1; (echo $(date +%s), $(cat /sys/fs/cgroup/memory.current)) >> /tmp/log.txt; done

Details is here.
image
image

I'm going to investigate the following a bit more, basically leaving the reclaim to the linux kernel.

  • How the memory cache works when the kernel is reclaiming memory
  • Performance impact during doing memory reclaiming

Also, if there are no particular problems after investigating these, I would like to consider what values would be appropriate to set. For example, 80% of the actual limit of the workspace.

@csweichel @Furisto WDYT?

@Furisto
Copy link
Member

Furisto commented Mar 2, 2022

  • Wondering why the process was not killed by the OOM killer when it reached max. Do we set oom_score_adj to -1000 somewhere?
  • How is the behavior with memory.high? memory.max only prevents the process from going over the limit but does not try to reclaim.

@utam0k
Copy link
Contributor Author

utam0k commented Mar 3, 2022

@Furisto I had made a big mistake 😭 All of the above results are setted to high, not max. The data is correct, but I had mistakenly written high as max.

@utam0k
Copy link
Contributor Author

utam0k commented Mar 3, 2022

Additional research was done. I set the memory.high of the workspace to 10GB and used stress-ng to repeatedly stress 20GB for 30 seconds.

#!/bin/bash
set -e

while true; do
	stress-ng -m 1 --vm-bytes $1 --timeout 30 # $1 = 20GB
	sleep 30
done
a script for logging
#!/bin/bash
set -e

file=/tmp/$(date +%s)_log.txt
echo start logging to $file
cat /sys/fs/cgroup/memory.stat | awk '{ print $1 }' | xargs echo memory.current >> $file
while true; do 
    sleep 1
    echo $(cat /sys/fs/cgroup/memory.current) $(cat /sys/fs/cgroup/memory.stat | awk '{ print $2 }' | xargs echo) >> $file
done

image

If the kernel exceeds the value of memory.high, it will try to reduce the file cache and other caches and use swap. Therefore, it is not a good idea to set this value too low as it will affect the performance. However, reducing the cache is the result we want, and it may be better than the current cache reclaim. I basically found it best to use this feature of the kernel. Also, since this function is left up to the kernel, it may work differently depending on the kernel version. This is a negative point. This is a negative point because we need to support many different kernel versions. However, I felt that it would be a sufficient feature to use for our cache reclaim.

Detailed survey results are here.

I would like to present two paths:

  1. Rely on kubernetes memory QoS
  2. Set a fixed value to memory.high
  3. Implement the memory limit, like CPU Limit

Rely on kubernetes memory QoS

kubernetes has a QoS feature that uses cgroup v2.
https://kubernetes.io/blog/2021/11/26/qos-memory-resources/
kubernetes/kubernetes#102970

The default seems to be to set memory.high at 80% of limit. This can only be set to a fixed value, but we can let Kubernetes do the work for us, which means less implementation for us.

👍 good

  • Our implementation will be less (but I haven't tried it, so maybe it could be tricky)

🤔 not good

  • Once a pod is started, the value cannot be changed

Set a fixed value to memory.high

👍 good

  • Simple

🤔 not good

  • Once a pod is started, the value cannot be changed

Implement the memory limit, like CPU Limit

We can set the memory.high and memory.max by ourselves without relying on the kubernetes feature, like CPU Limit.
#8036

For example, let's set the overall memory usage limit to 50GB, and the memory usage limit for a single workspace to 10GB (memory.max). Then you should be able to set the amount of memory that can be used (memory.max) to go up by 5GB. Also, let's respect the value used by kubernetes and set memromy.high to 80%. Then the following behavior is assumed.
The workspace tries to use 3GB of memory, in which case there is nothing wrong with it. The workspace tries to use 4.5GB of memory. This exceeds the memory.high (5 GB * 0.8 = 4 GB). So the workspace dispatcher will update memory.max and memory.high to 10 GB (memory.high = 8GB) for the next stage. Also, make sure that the entire workspace does not exceed 50 GB at this time.

👍 good

  • We have flexible control over memory limits

🤔 not good

  • Implementation is more difficult than the other proposal. (Of course, these come with maintenance.)

Finally

The caveat to both is that they both function completely differently from cache reclaim in cgroup v1. cgroup v2 does not have memory.force_empty, so this is unavoidable.(It is possible to empty the cache of a node, but this is not very nice.)

I think it would be a good idea to take this opportunity to implement Memory Limit, which should not be too difficult to implement since there is a good example of CPU Limit.

@utam0k
Copy link
Contributor Author

utam0k commented Mar 3, 2022

@Furisto @csweichel I'd like to hear what you think. And if you have any other good ideas, I'd love to hear them.

@csweichel
Copy link
Contributor

Thanks @utam0k for the level of detail - much appreciated.

It strikes me that option 1 would be the way forward. We could just live with the 80% setting and see how things behave in prod.

Re implementing the memory limit: how would the system behave if we lowered memory.max? The great thing about "bandwidth controlled resources" like CPU is that you can always reduce the bandwidth without adverse side effects (other than the performance penalty of course). For "space controlled resources" like memory or disk that's much harder. Would the Kernel start killing processes if the cgroup exceeded memory.max?

@Furisto
Copy link
Member

Furisto commented Mar 3, 2022

It strikes me that option 1 would be the way forward. We could just live with the 80% setting and see how things behave in prod.

Agreed, this option would be the most straightforward solution. We still can switch to another option if it turns out that this is not sufficient for us.

Re implementing the memory limit: how would the system behave if we lowered memory.max? The great thing about "bandwidth controlled resources" like CPU is that you can always reduce the bandwidth without adverse side effects (other than the performance penalty of course). For "space controlled resources" like memory or disk that's much harder. Would the Kernel start killing processes if the cgroup exceeded memory.max?

Yes, depending on the value of memory.oom.group it would either kill some or all processes inside the cgroup. Processes with an oom_score_adj set to -1000 will not be killed.

@utam0k
Copy link
Contributor Author

utam0k commented Mar 4, 2022

@csweichel @Furisto
Thanks for the great input. Yes, I think it would be great to use Kubernetes MemoryQoS once and see how it looks in production. I would also recommend one.
https://github.com/gitpod-io/gitpod-packer-gcp-image/pull/54

Repository owner moved this from In Progress to Done in 🌌 Workspace Team Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team
Projects
No open projects
Archived in project
3 participants