Provide pressure stall information for workspaces #13703

Furisto · 2022-10-10T08:45:20Z

Description

Retrieves pressure stall information for workspaces. Followup to #13539 which retrieved PSI on node level.

Related Issue(s)

n.a.

How to test

Start workspace in preview environment
kubectl port-forward ds/ws-daemon 9500
curl XGET localhost:9500/metrics

Release Notes

NONE

Werft options:

/werft with-local-preview
If enabled this will build install/preview
/werft with-preview
/werft with-integration-tests=all
Valid options are all, workspace, webapp, ide

werft-gitpod-dev-com · 2022-10-10T08:45:29Z

started the job as gitpod-build-fo-workspace-psi.12 because the annotations in the pull request description changed
(with .werft/ from main)

components/common-go/cgroups/cgroup.go

easyCZ · 2022-10-10T12:11:49Z

components/common-go/cgroups/v2/io.go

+// Licensed under the GNU Affero General Public License (AGPL).
+// See License-AGPL.txt in the project root for license information.
+
+package v2


I think you'll want to name it cgroups_v2 because otherwise you'd import it as v2.IO which isn't very descriptive

Why?
Shouldn't the other package import as common-go/cgroups/v2 and calls either v2.NewIOControllerWithMount or v2.NewIOController?

kylos101 · 2022-10-10T15:07:16Z

/werft run with-integration-tests=workspace with-large-vm=true

👍 started the job as gitpod-build-fo-workspace-psi.13
(with .werft/ from main)

kylos101 · 2022-10-10T16:29:57Z

/werft run with-integration-tests=workspace with-large-vm=true

👍 started the job as gitpod-build-fo-workspace-psi.14
(with .werft/ from main)

kylos101 · 2022-10-10T16:30:10Z

Running tests again, the prior job failed setup.

kylos101 · 2022-10-10T21:35:38Z

@Furisto you might want to rebase with main and run another build? The integration tests are timing out after 2 hours.

Furisto · 2022-10-10T21:48:17Z

Might be due to this: https://gitpod.slack.com/archives/C032A46PWR0/p1665414537023909

components/ws-daemon/pkg/cgroup/plugin_psi.go

components/common-go/cgroups/cgroup.go

Furisto · 2022-10-14T09:36:00Z

/werft run with-integration-tests=workspace with-large-vm=true with-clean-slate-deployment=true

👍 started the job as gitpod-build-fo-workspace-psi.16
(with .werft/ from main)

Furisto · 2022-10-14T09:45:20Z

/werft run with-integration-tests=workspace with-large-vm=true with-clean-slate-deployment=true

👍 started the job as gitpod-build-fo-workspace-psi.17
(with .werft/ from main)

Furisto · 2022-10-16T10:14:45Z

/werft run with-integration-tests=workspace with-large-vm=true with-clean-slate-deployment=true

👍 started the job as gitpod-build-fo-workspace-psi.23
(with .werft/ from main)

Furisto · 2022-10-16T11:16:32Z

@easyCZ @utam0k PTAL

ArthurSens · 2022-10-18T13:45:05Z

@ArthurSens the new PSI metrics are available in our ws-daemon component (given the how to test nodes above). Perhaps that is enough to resolve your concern?

Oh I misunderstood the implementation, but unfortunately this doesn't solve the problem 😕. I understood it was exposed by workspaces pod, so when prometheus scrapes workspaces, a label called pod would introduce high-cardinality. While exposing everything by ws-daemon will solve the problem of multiple label values for each pod, you're adding the label workspace that have the exactly same behavior.

What you need to do here is let go of the premise that you need to monitor every single workspace individually, at least metrics won't work for this use-case 😬. I don't have knowledge about PSI, but would it be possible to aggregate those values when exposing those metrics?

Furisto · 2022-10-18T13:53:07Z

What you need to do here is let go of the premise that you need to monitor every single workspace individually, at least metrics won't work for this use-case grimacing. I don't have knowledge about PSI, but would it be possible to aggregate those values when exposing those metrics?

The idea behind this is not to aggregate it because it would allow us to investigate why a workspace exhibited a certain behavior. I am struggling to understand why this would introduce so much additional load that it could break our metrics. Are we not already collecting a significant number of metrics for every pod on the cluster e.g. everything in here. In comparison to that, this PR does not introduce that much additional metrics from my point of view but I am happy to discuss this.

ArthurSens · 2022-10-18T14:10:08Z

The idea behind this is not to aggregate it because it would allow us to investigate why a workspace exhibited a certain behavior.

Hmm I see, would it be possible to expose these metrics only for workspaces from paying customers? Thinking of self-hosted, maybe this metric could be turned on/off in the admin panel (Gitpod admins can choose teams that will have this metric exposed), so we can also choose our own customers there.

The problem with the current implementation is that every single workspace will introduce a new metric and we're definitely not analyzing PSI metrics for all of them, this means that we're wasting a huge amount of resources on something that we'll never use.

I am struggling to understand why this would introduce so much additional load that it could break our metrics. Are we not already collecting a significant number of metrics for every pod on the cluster e.g. everything in here. In comparison to that, this PR does not introduce that much additional metrics from my point of view but I am happy to discuss this.

Indeed, compared to container metrics that are exposed by cAdvisor and Kubelet the costs are probably the same. This is the type of metric that we're really looking forward to removing 😅, they are indeed expensive.

The difference here is that we don't have much control over the kubelet nor cAdvisor metrics, while we do have control over our own metrics and we can be more conscious about them 🙂

Furisto · 2022-10-18T14:25:28Z

Hmm I see, would it be possible to expose these metrics only for workspaces from paying customers? Thinking of self-hosted, maybe this metric could be turned on/off in the admin panel (Gitpod admins can choose teams that will have this metric exposed), so we can also choose our own customers there.

Yes, that is certainly possible! I expect that this will be used more for paying customers anyway.
Thank you for your explanation!

iQQBot · 2022-10-18T16:09:07Z

It is possible that you can reuse IDE metrics endpoints. We could consider to move them behind the shared proxy if it is useful generally

Since the main purpose here is to collect workspace-specific metrics, this does not apply to ide-metrics (because in ide-metrics, we mainly collect aggregated metrics)

Furisto · 2022-10-19T13:33:34Z

Metrics are now only retrieved for workspaces of paying users.

Furisto · 2022-10-21T08:41:40Z

/unhold

atduarte · 2022-11-28T14:48:23Z

Hey @Furisto! 👋 Removed release notes as this does not affect the Gitpod end user (developer).

utam0k · 2022-11-28T23:19:49Z

Hey @Furisto! wave Removed release notes as this does not affect the Gitpod end user (developer).

@atduarte
You are right, but some users may be interested in that gitpod starting using PSI if it is on ChangeLog because PSI is lastest feature ;) But it actually doesn't affect the user.

Furisto · 2022-11-29T09:58:24Z

@atduarte I am actually unsure why I wrote a release note 😆 As you said it does not affect the end user.

utam0k · 2022-11-29T11:42:01Z

@atduarte I am actually unsure why I wrote a release note laughing As you said it does not affect the end user.

I guess release notes are used creating monthly Changelog

atduarte · 2022-11-29T12:03:18Z

@utam0k yap, that's their sole purpose 😁 Thank you both!

Furisto added the team: workspace Issue belongs to the Workspace team label Oct 10, 2022

Furisto self-assigned this Oct 10, 2022

roboquat added release-note do-not-merge/work-in-progress labels Oct 10, 2022

roboquat added the size/L label Oct 10, 2022

Furisto added the feature: psi Pressure Stall Information label Oct 10, 2022

Furisto marked this pull request as ready for review October 10, 2022 11:14

Furisto requested review from a team October 10, 2022 11:14

roboquat removed the do-not-merge/work-in-progress label Oct 10, 2022

github-actions bot added the team: webapp Issue belongs to the WebApp team label Oct 10, 2022

easyCZ reviewed Oct 10, 2022

View reviewed changes

utam0k reviewed Oct 12, 2022

View reviewed changes

components/ws-daemon/pkg/cgroup/plugin_psi.go Show resolved Hide resolved

utam0k reviewed Oct 12, 2022

View reviewed changes

components/common-go/cgroups/cgroup.go Show resolved Hide resolved

Furisto force-pushed the fo/workspace-psi branch from d979b73 to 337ce0c Compare October 14, 2022 09:33

roboquat added size/XL and removed size/L labels Oct 14, 2022

Furisto force-pushed the fo/workspace-psi branch from 337ce0c to 568aa5d Compare October 16, 2022 07:47

jenting requested review from easyCZ and utam0k October 17, 2022 01:58

Furisto requested review from aledbf and sagor999 as code owners October 19, 2022 09:40

Furisto added 4 commits October 19, 2022 13:08

[server] Enable psi for paying users

1bfc9e4

[ws-daemon] Scrape psi conditionally

c1074b9

[ws-manager-api] Generate grpc for psi

f82baf1

[ws-manager] Handle psi feature flag

4d1b476

Furisto force-pushed the fo/workspace-psi branch from 3871511 to 4d1b476 Compare October 19, 2022 13:09

sagor999 approved these changes Oct 20, 2022

View reviewed changes

ArthurSens approved these changes Oct 20, 2022

View reviewed changes

aledbf approved these changes Oct 20, 2022

View reviewed changes

roboquat removed the do-not-merge/hold label Oct 21, 2022

roboquat merged commit 1cf84ad into main Oct 21, 2022

roboquat deleted the fo/workspace-psi branch October 21, 2022 08:42

roboquat added the deployed: webapp Meta team change is running in production label Oct 21, 2022

roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Nov 3, 2022

roboquat added release-note-none and removed release-note labels Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide pressure stall information for workspaces #13703

Provide pressure stall information for workspaces #13703

Furisto commented Oct 10, 2022 •

edited by atduarte

Loading

werft-gitpod-dev-com bot commented Oct 10, 2022

easyCZ Oct 10, 2022

jenting Oct 17, 2022

kylos101 commented Oct 10, 2022 •

edited by werft-gitpod-dev-com bot

Loading

kylos101 commented Oct 10, 2022 •

edited by werft-gitpod-dev-com bot

Loading

kylos101 commented Oct 10, 2022

kylos101 commented Oct 10, 2022

Furisto commented Oct 10, 2022

Furisto commented Oct 14, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Furisto commented Oct 14, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Furisto commented Oct 16, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Furisto commented Oct 16, 2022

ArthurSens commented Oct 18, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Furisto commented Oct 18, 2022

ArthurSens commented Oct 18, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Furisto commented Oct 18, 2022

iQQBot commented Oct 18, 2022

Furisto commented Oct 19, 2022

Furisto commented Oct 21, 2022

atduarte commented Nov 28, 2022 •

edited by werft-gitpod-dev-com bot

Loading

utam0k commented Nov 28, 2022

Furisto commented Nov 29, 2022

utam0k commented Nov 29, 2022

atduarte commented Nov 29, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Provide pressure stall information for workspaces #13703

Provide pressure stall information for workspaces #13703

Conversation

Furisto commented Oct 10, 2022 • edited by atduarte Loading

Description

Related Issue(s)

How to test

Release Notes

Werft options:

werft-gitpod-dev-com bot commented Oct 10, 2022

easyCZ Oct 10, 2022

Choose a reason for hiding this comment

jenting Oct 17, 2022

Choose a reason for hiding this comment

kylos101 commented Oct 10, 2022 • edited by werft-gitpod-dev-com bot Loading

kylos101 commented Oct 10, 2022 • edited by werft-gitpod-dev-com bot Loading

kylos101 commented Oct 10, 2022

kylos101 commented Oct 10, 2022

Furisto commented Oct 10, 2022

Furisto commented Oct 14, 2022 • edited by werft-gitpod-dev-com bot Loading

Furisto commented Oct 14, 2022 • edited by werft-gitpod-dev-com bot Loading

Furisto commented Oct 16, 2022 • edited by werft-gitpod-dev-com bot Loading

Furisto commented Oct 16, 2022

ArthurSens commented Oct 18, 2022 • edited by werft-gitpod-dev-com bot Loading

Furisto commented Oct 18, 2022

ArthurSens commented Oct 18, 2022 • edited by werft-gitpod-dev-com bot Loading

Furisto commented Oct 18, 2022

iQQBot commented Oct 18, 2022

Furisto commented Oct 19, 2022

Furisto commented Oct 21, 2022

atduarte commented Nov 28, 2022 • edited by werft-gitpod-dev-com bot Loading

utam0k commented Nov 28, 2022

Furisto commented Nov 29, 2022

utam0k commented Nov 29, 2022

atduarte commented Nov 29, 2022 • edited by werft-gitpod-dev-com bot Loading

Furisto commented Oct 10, 2022 •

edited by atduarte

Loading

kylos101 commented Oct 10, 2022 •

edited by werft-gitpod-dev-com bot

Loading

kylos101 commented Oct 10, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Furisto commented Oct 14, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Furisto commented Oct 14, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Furisto commented Oct 16, 2022 •

edited by werft-gitpod-dev-com bot

Loading

ArthurSens commented Oct 18, 2022 •

edited by werft-gitpod-dev-com bot

Loading

ArthurSens commented Oct 18, 2022 •

edited by werft-gitpod-dev-com bot

Loading

atduarte commented Nov 28, 2022 •

edited by werft-gitpod-dev-com bot

Loading

atduarte commented Nov 29, 2022 •

edited by werft-gitpod-dev-com bot

Loading