Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide pressure stall information for workspaces #13703

Merged
merged 10 commits into from
Oct 21, 2022
Merged

Conversation

Furisto
Copy link
Member

@Furisto Furisto commented Oct 10, 2022

Description

Retrieves pressure stall information for workspaces. Followup to #13539 which retrieved PSI on node level.

Related Issue(s)

n.a.

How to test

  • Start workspace in preview environment
  • kubectl port-forward ds/ws-daemon 9500
  • curl XGET localhost:9500/metrics

Release Notes

NONE

Werft options:

  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-integration-tests=all
    Valid options are all, workspace, webapp, ide

@Furisto Furisto added the team: workspace Issue belongs to the Workspace team label Oct 10, 2022
@Furisto Furisto self-assigned this Oct 10, 2022
@werft-gitpod-dev-com
Copy link

started the job as gitpod-build-fo-workspace-psi.12 because the annotations in the pull request description changed
(with .werft/ from main)

@Furisto Furisto added the feature: psi Pressure Stall Information label Oct 10, 2022
@Furisto Furisto marked this pull request as ready for review October 10, 2022 11:14
@Furisto Furisto requested review from a team October 10, 2022 11:14
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Oct 10, 2022
components/common-go/cgroups/cgroup.go Outdated Show resolved Hide resolved
// Licensed under the GNU Affero General Public License (AGPL).
// See License-AGPL.txt in the project root for license information.

package v2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll want to name it cgroups_v2 because otherwise you'd import it as v2.IO which isn't very descriptive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?
Shouldn't the other package import as common-go/cgroups/v2 and calls either v2.NewIOControllerWithMount or v2.NewIOController?

@kylos101
Copy link
Contributor

kylos101 commented Oct 10, 2022

/werft run with-integration-tests=workspace with-large-vm=true

👍 started the job as gitpod-build-fo-workspace-psi.13
(with .werft/ from main)

@kylos101
Copy link
Contributor

kylos101 commented Oct 10, 2022

/werft run with-integration-tests=workspace with-large-vm=true

👍 started the job as gitpod-build-fo-workspace-psi.14
(with .werft/ from main)

@kylos101
Copy link
Contributor

Running tests again, the prior job failed setup.

@kylos101
Copy link
Contributor

@Furisto you might want to rebase with main and run another build? The integration tests are timing out after 2 hours.

@Furisto
Copy link
Member Author

Furisto commented Oct 10, 2022

@Furisto
Copy link
Member Author

Furisto commented Oct 14, 2022

/werft run with-integration-tests=workspace with-large-vm=true with-clean-slate-deployment=true

👍 started the job as gitpod-build-fo-workspace-psi.16
(with .werft/ from main)

@Furisto
Copy link
Member Author

Furisto commented Oct 14, 2022

/werft run with-integration-tests=workspace with-large-vm=true with-clean-slate-deployment=true

👍 started the job as gitpod-build-fo-workspace-psi.17
(with .werft/ from main)

@Furisto
Copy link
Member Author

Furisto commented Oct 16, 2022

/werft run with-integration-tests=workspace with-large-vm=true with-clean-slate-deployment=true

👍 started the job as gitpod-build-fo-workspace-psi.23
(with .werft/ from main)

@Furisto
Copy link
Member Author

Furisto commented Oct 16, 2022

@easyCZ @utam0k PTAL

@jenting jenting requested review from easyCZ and utam0k October 17, 2022 01:58
@ArthurSens
Copy link
Contributor

ArthurSens commented Oct 18, 2022

@ArthurSens the new PSI metrics are available in our ws-daemon component (given the how to test nodes above). Perhaps that is enough to resolve your concern?

Oh I misunderstood the implementation, but unfortunately this doesn't solve the problem 😕. I understood it was exposed by workspaces pod, so when prometheus scrapes workspaces, a label called pod would introduce high-cardinality. While exposing everything by ws-daemon will solve the problem of multiple label values for each pod, you're adding the label workspace that have the exactly same behavior.

What you need to do here is let go of the premise that you need to monitor every single workspace individually, at least metrics won't work for this use-case 😬. I don't have knowledge about PSI, but would it be possible to aggregate those values when exposing those metrics?

@Furisto
Copy link
Member Author

Furisto commented Oct 18, 2022

What you need to do here is let go of the premise that you need to monitor every single workspace individually, at least metrics won't work for this use-case grimacing. I don't have knowledge about PSI, but would it be possible to aggregate those values when exposing those metrics?

The idea behind this is not to aggregate it because it would allow us to investigate why a workspace exhibited a certain behavior. I am struggling to understand why this would introduce so much additional load that it could break our metrics. Are we not already collecting a significant number of metrics for every pod on the cluster e.g. everything in here. In comparison to that, this PR does not introduce that much additional metrics from my point of view but I am happy to discuss this.

@ArthurSens
Copy link
Contributor

ArthurSens commented Oct 18, 2022

The idea behind this is not to aggregate it because it would allow us to investigate why a workspace exhibited a certain behavior.

Hmm I see, would it be possible to expose these metrics only for workspaces from paying customers? Thinking of self-hosted, maybe this metric could be turned on/off in the admin panel (Gitpod admins can choose teams that will have this metric exposed), so we can also choose our own customers there.

The problem with the current implementation is that every single workspace will introduce a new metric and we're definitely not analyzing PSI metrics for all of them, this means that we're wasting a huge amount of resources on something that we'll never use.

I am struggling to understand why this would introduce so much additional load that it could break our metrics. Are we not already collecting a significant number of metrics for every pod on the cluster e.g. everything in here. In comparison to that, this PR does not introduce that much additional metrics from my point of view but I am happy to discuss this.

Indeed, compared to container metrics that are exposed by cAdvisor and Kubelet the costs are probably the same. This is the type of metric that we're really looking forward to removing 😅, they are indeed expensive.

The difference here is that we don't have much control over the kubelet nor cAdvisor metrics, while we do have control over our own metrics and we can be more conscious about them 🙂

@Furisto
Copy link
Member Author

Furisto commented Oct 18, 2022

Hmm I see, would it be possible to expose these metrics only for workspaces from paying customers? Thinking of self-hosted, maybe this metric could be turned on/off in the admin panel (Gitpod admins can choose teams that will have this metric exposed), so we can also choose our own customers there.

Yes, that is certainly possible! I expect that this will be used more for paying customers anyway.
Thank you for your explanation!

@iQQBot
Copy link
Contributor

iQQBot commented Oct 18, 2022

It is possible that you can reuse IDE metrics endpoints. We could consider to move them behind the shared proxy if it is useful generally

Since the main purpose here is to collect workspace-specific metrics, this does not apply to ide-metrics (because in ide-metrics, we mainly collect aggregated metrics)

@Furisto
Copy link
Member Author

Furisto commented Oct 19, 2022

Metrics are now only retrieved for workspaces of paying users.

@Furisto
Copy link
Member Author

Furisto commented Oct 21, 2022

/unhold

@roboquat roboquat merged commit 1cf84ad into main Oct 21, 2022
@roboquat roboquat deleted the fo/workspace-psi branch October 21, 2022 08:42
@roboquat roboquat added the deployed: webapp Meta team change is running in production label Oct 21, 2022
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Nov 3, 2022
@atduarte
Copy link
Contributor

atduarte commented Nov 28, 2022

Hey @Furisto! 👋 Removed release notes as this does not affect the Gitpod end user (developer).

@utam0k
Copy link
Contributor

utam0k commented Nov 28, 2022

Hey @Furisto! wave Removed release notes as this does not affect the Gitpod end user (developer).

@atduarte
You are right, but some users may be interested in that gitpod starting using PSI if it is on ChangeLog because PSI is lastest feature ;) But it actually doesn't affect the user.

@Furisto
Copy link
Member Author

Furisto commented Nov 29, 2022

@atduarte I am actually unsure why I wrote a release note 😆 As you said it does not affect the end user.

@utam0k
Copy link
Contributor

utam0k commented Nov 29, 2022

@atduarte I am actually unsure why I wrote a release note laughing As you said it does not affect the end user.

I guess release notes are used creating monthly Changelog

@atduarte
Copy link
Contributor

atduarte commented Nov 29, 2022

@utam0k yap, that's their sole purpose 😁 Thank you both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed: workspace Workspace team change is running in production deployed Change is completely running in production feature: psi Pressure Stall Information release-note-none size/XL team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.