Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-daemon] Refactor CPU limiting #8036

Merged
merged 3 commits into from
Feb 8, 2022
Merged

[ws-daemon] Refactor CPU limiting #8036

merged 3 commits into from
Feb 8, 2022

Conversation

csweichel
Copy link
Contributor

@csweichel csweichel commented Feb 4, 2022

Description

This PR refactors the dynamic CPU limiting performed by ws-daemon. More details regarding motivation and solution can be found in this loom video: https://www.loom.com/share/0922bf11430c4ac084ebab6616e40b7e

This PR also adds a set of metrics to observe the new CPU limits:

  • gitpod_ws_daemon_cpulimit_workspaces_added_total: Number of workspaces added to CPU control
  • gitpod_ws_daemon_cpulimit_workspaces_removed_total Number of workspaces removed from CPU control
  • gitpod_ws_daemon_cpulimit_workspaces_burst_total: Number of workspaces which received burst - gitpod_ws_daemon_cpulimit_workspaces_cputime_seconds CPU time of all observed workspaces
  • gitpod_ws_daemon_cpulimit_workspaces_throttled_total Number of workspaces which ran with throttled CPU

Related Issue(s)

Fixes #8035

How to test

By default dynamic CPU limiting is disabled. It can be enabled using the experimental config in the installer, or by modifying the ws-daemon configmap. Therein, set the cpuLimiting.enabled to true, and configure the total available bandwidth on the node (e.g. 12 CPUs), limit and burst limit.

Then, start a few workspaces, observe their /sys/fs/cgroup/cpu/cpu.cfs_quota_us value while producing some CPU load.

In addition, one could try this version on a single VM and experiment with workspaces exhibiting different load profiles.

Chaveats

  • this PR removes the process priority management we had prior. It's unclear how much that actually helped, considering that it very outdated and focused on bash and theia and gitpod-node only. If we find user experience deteriorated too much by this removal, we'll want to bring it back in a future change.
  • this change does not yet introduce QoS classes, but that would be straight forward to add down the road using annotations on the pod.
  • this change removes the fixed CPU limiting that was in place as agent-smith action (limit CPU), and that was meant to afford paying customers more CPU. It's doubtful that the latter had much of an effect, but we'll want to follow up quickly with the QoS change.

Release Notes

Refactor dynamic CPU limiting to provide fairer scheduling.

@roboquat
Copy link
Contributor

roboquat commented Feb 4, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from csweichel and additionally assign mrsimonemms after the PR has been reviewed.
You can assign the PR to them by writing /assign @mrsimonemms in a comment when ready.

Associated issue: #8035

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@roboquat roboquat added team: delivery Issue belongs to the self-hosted team team: workspace Issue belongs to the Workspace team size/XXL labels Feb 4, 2022
@codecov
Copy link

codecov bot commented Feb 4, 2022

Codecov Report

Merging #8036 (1ad8304) into main (5b74af7) will increase coverage by 3.19%.
The diff coverage is 33.70%.

❗ Current head 1ad8304 differs from pull request most recent head cf3b4c6. Consider uploading reports for the commit cf3b4c6 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #8036      +/-   ##
==========================================
+ Coverage   12.01%   15.21%   +3.19%     
==========================================
  Files          20       45      +25     
  Lines        1190     4352    +3162     
==========================================
+ Hits          143      662     +519     
- Misses       1043     3633    +2590     
- Partials        4       57      +53     
Flag Coverage Δ
components-gitpod-cli-app 10.86% <ø> (ø)
components-local-app-app-darwin-amd64 ?
components-local-app-app-darwin-arm64 ?
components-local-app-app-linux-amd64 ?
components-local-app-app-linux-arm64 ?
components-local-app-app-windows-386 ?
components-local-app-app-windows-amd64 ?
components-local-app-app-windows-arm64 ?
components-ws-daemon-app 22.61% <33.70%> (?)
components-ws-daemon-lib 22.61% <33.70%> (?)
components-ws-manager-bridge-api-go-lib ?
installer-raw-app 5.63% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/ws-daemon/pkg/cpulimit/cfs.go 0.00% <0.00%> (ø)
components/ws-daemon/pkg/cpulimit/dispatch.go 0.00% <0.00%> (ø)
components/ws-daemon/pkg/cpulimit/cpulimit.go 69.07% <69.07%> (ø)
components/ws-daemon/pkg/cpulimit/bandwidth.go 71.42% <71.42%> (ø)
components/local-app/pkg/auth/auth.go
components/local-app/pkg/auth/pkce.go
installer/pkg/common/common.go 4.45% <0.00%> (ø)
installer/pkg/components/ws-manager/rolebinding.go 0.00% <0.00%> (ø)
installer/pkg/components/ws-manager/tlssecret.go 0.00% <0.00%> (ø)
...onents/ws-daemon/pkg/internal/session/workspace.go 50.73% <0.00%> (ø)
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b74af7...cf3b4c6. Read the comment docs.

@csweichel csweichel force-pushed the cw/cpulimit branch 4 times, most recently from a217998 to 03e64e1 Compare February 4, 2022 13:54
@csweichel csweichel force-pushed the cw/cpulimit branch 2 times, most recently from 982022d to 206e844 Compare February 4, 2022 15:25
@csweichel csweichel marked this pull request as ready for review February 4, 2022 17:23
@csweichel csweichel requested review from a team February 4, 2022 17:23
Copy link
Contributor

@corneliusludmann corneliusludmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installer changes look good to me.

/approve

components/ws-daemon/pkg/cpulimit/cfs.go Outdated Show resolved Hide resolved
components/ws-daemon/pkg/cpulimit/cfs.go Outdated Show resolved Hide resolved
components/ws-daemon/pkg/cpulimit/cfs.go Outdated Show resolved Hide resolved
components/ws-daemon/pkg/cpulimit/cfs.go Outdated Show resolved Hide resolved
components/ws-daemon/pkg/cpulimit/cpulimit.go Outdated Show resolved Hide resolved
components/ws-daemon/pkg/cpulimit/cpulimit.go Outdated Show resolved Hide resolved
@csweichel
Copy link
Contributor Author

@sagor999 I've incorporated your feedback - please take another look

@roboquat roboquat merged commit 462e269 into main Feb 8, 2022
@roboquat roboquat deleted the cw/cpulimit branch February 8, 2022 15:36
@roboquat roboquat added the deployed: workspace Workspace team change is running in production label Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production release-note size/XXL team: delivery Issue belongs to the self-hosted team team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement globally optimal dynamic CPU limiting in ws-daemon
5 participants