Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-daemon] Add page-cache reclaim #8139

Merged
merged 1 commit into from
Feb 10, 2022
Merged

[ws-daemon] Add page-cache reclaim #8139

merged 1 commit into from
Feb 10, 2022

Conversation

csweichel
Copy link
Contributor

@csweichel csweichel commented Feb 10, 2022

Description

This PR introduces a page cache reclaim mechanism for workspace cgroups. The kubelet computes memory pressure using the cgroup memory controller, which includes page-caches. This means that when a workspace does a lot of file IO, we're likely to overestimate the memory pressure, and evict pods when we should not - or fail the scheduling process.

With this change, we regularly check for the page cache use for each workspace on the node. If we consume more than 15% of the cgroup memory limit (1.8GiB) in caches, we'll trigger a page cache reclaim using memory.force_reclaim. This should keep the cgroup memory closer to actual memory consumption at the expense of file IO and kernel CPU time.

Related Issue(s)

Fixes #7969

How to test

Make sure your workspace has a memory limit set, e.g. by inspecting/modifying ws-manager's config. Beware: on core-dev we currently don't set workspace resource limits.

while true; do sleep 1; (echo N; date +%s; cat /sys/fs/cgroup/memory/memory.stat | grep cache) >> /tmp/log.txt; done

in a new terminal

tar cvvfz $HOME/backup.tar.gz /workspace

once the tar is done, look at the entries in log.txt. They should climb for a few seconds (max 15) and drop sharply afterwards. Then they'll climb again.

You would expect something like this behaviour:
viz

Release Notes

Improved workspace memory-pressure eviction resilience

@codecov
Copy link

codecov bot commented Feb 10, 2022

Codecov Report

Merging #8139 (64c7ea2) into main (149ca15) will increase coverage by 5.12%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #8139      +/-   ##
==========================================
+ Coverage   10.03%   15.15%   +5.12%     
==========================================
  Files          24       45      +21     
  Lines        1944     4375    +2431     
==========================================
+ Hits          195      663     +468     
- Misses       1742     3655    +1913     
- Partials        7       57      +50     
Flag Coverage Δ
components-gitpod-cli-app 10.82% <ø> (ø)
components-local-app-app-darwin-amd64 ?
components-local-app-app-darwin-arm64 ?
components-local-app-app-linux-amd64 ?
components-local-app-app-linux-arm64 ?
components-local-app-app-windows-386 ?
components-local-app-app-windows-amd64 ?
components-local-app-app-windows-arm64 ?
components-workspacekit-app ?
components-ws-daemon-app 22.45% <ø> (?)
components-ws-daemon-lib 22.45% <ø> (?)
installer-raw-app 5.63% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/workspacekit/cmd/rings.go
components/workspacekit/cmd/lift.go
components/local-app/pkg/auth/pkce.go
components/local-app/pkg/auth/auth.go
components/workspacekit/cmd/nsenter.go
components/workspacekit/cmd/root.go
installer/pkg/components/ws-manager/deployment.go 0.00% <0.00%> (ø)
installer/pkg/common/objects.go 0.00% <0.00%> (ø)
components/ws-daemon/pkg/quota/size.go 87.30% <0.00%> (ø)
installer/pkg/components/ws-manager/tlssecret.go 0.00% <0.00%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 149ca15...64c7ea2. Read the comment docs.

@csweichel csweichel force-pushed the cw/cache-reclaim branch 2 times, most recently from 3dc4b76 to 71fdf9c Compare February 10, 2022 17:52
@csweichel csweichel marked this pull request as ready for review February 10, 2022 18:28
@csweichel csweichel requested a review from a team February 10, 2022 18:28
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Feb 10, 2022
@csweichel csweichel force-pushed the cw/cache-reclaim branch 2 times, most recently from 467790b to 251cef7 Compare February 10, 2022 19:28
if cache > uint64(float64(limit)*0.15) {
err := ioutil.WriteFile(filepath.Join(memCgroupPath, "memory.force_empty"), []byte("1"), 0644)
if err != nil {
return nil, xerrors.Errorf("cannot read memory.force_empty: %v", err)
Copy link
Contributor

@sagor999 sagor999 Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be cannot write I think


p, err := strconv.ParseInt(s, 10, 64)
if err != nil {
return 0, xerrors.Errorf("cannot parse memory.limit_in_bytes: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add original value that we are trying to parse here.

@roboquat roboquat merged commit 1daa0db into main Feb 10, 2022
@roboquat roboquat deleted the cw/cache-reclaim branch February 10, 2022 20:38
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note size/L team: workspace Issue belongs to the Workspace team
Projects
None yet
3 participants