[ws-daemon] Add page-cache reclaim #8139

csweichel · 2022-02-10T17:29:06Z

Description

This PR introduces a page cache reclaim mechanism for workspace cgroups. The kubelet computes memory pressure using the cgroup memory controller, which includes page-caches. This means that when a workspace does a lot of file IO, we're likely to overestimate the memory pressure, and evict pods when we should not - or fail the scheduling process.

With this change, we regularly check for the page cache use for each workspace on the node. If we consume more than 15% of the cgroup memory limit (1.8GiB) in caches, we'll trigger a page cache reclaim using memory.force_reclaim. This should keep the cgroup memory closer to actual memory consumption at the expense of file IO and kernel CPU time.

Related Issue(s)

Fixes #7969

How to test

Make sure your workspace has a memory limit set, e.g. by inspecting/modifying ws-manager's config. Beware: on core-dev we currently don't set workspace resource limits.

while true; do sleep 1; (echo N; date +%s; cat /sys/fs/cgroup/memory/memory.stat | grep cache) >> /tmp/log.txt; done

in a new terminal

tar cvvfz $HOME/backup.tar.gz /workspace

once the tar is done, look at the entries in log.txt. They should climb for a few seconds (max 15) and drop sharply afterwards. Then they'll climb again.

You would expect something like this behaviour:

Release Notes

Improved workspace memory-pressure eviction resilience

codecov · 2022-02-10T17:29:39Z

Codecov Report

Merging #8139 (64c7ea2) into main (149ca15) will increase coverage by 5.12%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #8139      +/-   ##
==========================================
+ Coverage   10.03%   15.15%   +5.12%     
==========================================
  Files          24       45      +21     
  Lines        1944     4375    +2431     
==========================================
+ Hits          195      663     +468     
- Misses       1742     3655    +1913     
- Partials        7       57      +50

Flag	Coverage Δ
components-gitpod-cli-app	`10.82% <ø> (ø)`
components-local-app-app-darwin-amd64	`?`
components-local-app-app-darwin-arm64	`?`
components-local-app-app-linux-amd64	`?`
components-local-app-app-linux-arm64	`?`
components-local-app-app-windows-386	`?`
components-local-app-app-windows-amd64	`?`
components-local-app-app-windows-arm64	`?`
components-workspacekit-app	`?`
components-ws-daemon-app	`22.45% <ø> (?)`
components-ws-daemon-lib	`22.45% <ø> (?)`
installer-raw-app	`5.63% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
components/workspacekit/cmd/rings.go
components/workspacekit/cmd/lift.go
components/local-app/pkg/auth/pkce.go
components/local-app/pkg/auth/auth.go
components/workspacekit/cmd/nsenter.go
components/workspacekit/cmd/root.go
installer/pkg/components/ws-manager/deployment.go	`0.00% <0.00%> (ø)`
installer/pkg/common/objects.go	`0.00% <0.00%> (ø)`
components/ws-daemon/pkg/quota/size.go	`87.30% <0.00%> (ø)`
installer/pkg/components/ws-manager/tlssecret.go	`0.00% <0.00%> (ø)`
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 149ca15...64c7ea2. Read the comment docs.

components/ws-daemon/pkg/daemon/cache_reclaim.go

sagor999 · 2022-02-10T19:30:29Z

components/ws-daemon/pkg/daemon/cache_reclaim.go

+	if cache > uint64(float64(limit)*0.15) {
+		err := ioutil.WriteFile(filepath.Join(memCgroupPath, "memory.force_empty"), []byte("1"), 0644)
+		if err != nil {
+			return nil, xerrors.Errorf("cannot read memory.force_empty: %v", err)


this should be cannot write I think

sagor999 · 2022-02-10T19:31:20Z

components/ws-daemon/pkg/daemon/cache_reclaim.go

+
+	p, err := strconv.ParseInt(s, 10, 64)
+	if err != nil {
+		return 0, xerrors.Errorf("cannot parse memory.limit_in_bytes: %v", err)


can you also add original value that we are trying to parse here.

components/ws-daemon/pkg/daemon/cache_reclaim.go

roboquat added release-note do-not-merge/work-in-progress size/L labels Feb 10, 2022

csweichel force-pushed the cw/cache-reclaim branch 2 times, most recently from 3dc4b76 to 71fdf9c Compare February 10, 2022 17:52

csweichel marked this pull request as ready for review February 10, 2022 18:28

csweichel requested a review from a team February 10, 2022 18:28

roboquat removed the do-not-merge/work-in-progress label Feb 10, 2022

github-actions bot added the team: workspace Issue belongs to the Workspace team label Feb 10, 2022

sagor999 reviewed Feb 10, 2022

View reviewed changes

components/ws-daemon/pkg/daemon/cache_reclaim.go Outdated Show resolved Hide resolved

components/ws-daemon/pkg/daemon/cache_reclaim.go Outdated Show resolved Hide resolved

components/ws-daemon/pkg/daemon/cache_reclaim.go Outdated Show resolved Hide resolved

csweichel force-pushed the cw/cache-reclaim branch 2 times, most recently from 467790b to 251cef7 Compare February 10, 2022 19:28

sagor999 reviewed Feb 10, 2022

View reviewed changes

kylos101 mentioned this pull request Feb 10, 2022

Oh, no! Something went wrong! OutOfmemory: Pod Node didn't have enough resource: memory, requested: 2576980378, used: 65006469130, capacity: 66388508672 #7969

Closed

csweichel force-pushed the cw/cache-reclaim branch from 251cef7 to 46ee4be Compare February 10, 2022 19:48

[ws-daemon] Add page-cache reclaim

64c7ea2

csweichel force-pushed the cw/cache-reclaim branch from 46ee4be to 64c7ea2 Compare February 10, 2022 19:48

sagor999 reviewed Feb 10, 2022

View reviewed changes

components/ws-daemon/pkg/daemon/cache_reclaim.go Show resolved Hide resolved

sagor999 approved these changes Feb 10, 2022

View reviewed changes

roboquat merged commit 1daa0db into main Feb 10, 2022

roboquat deleted the cw/cache-reclaim branch February 10, 2022 20:38

roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ws-daemon] Add page-cache reclaim #8139

[ws-daemon] Add page-cache reclaim #8139

csweichel commented Feb 10, 2022 •

edited

Loading

codecov bot commented Feb 10, 2022 •

edited

Loading

sagor999 Feb 10, 2022 •

edited

Loading

sagor999 Feb 10, 2022

[ws-daemon] Add page-cache reclaim #8139

[ws-daemon] Add page-cache reclaim #8139

Conversation

csweichel commented Feb 10, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

codecov bot commented Feb 10, 2022 • edited Loading

Codecov Report

sagor999 Feb 10, 2022 • edited Loading

Choose a reason for hiding this comment

sagor999 Feb 10, 2022

Choose a reason for hiding this comment

csweichel commented Feb 10, 2022 •

edited

Loading

codecov bot commented Feb 10, 2022 •

edited

Loading

sagor999 Feb 10, 2022 •

edited

Loading