Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[observability] Add SLI numbers to the Workspace Success Criteria Dashboard #10382

Merged
merged 1 commit into from
Jun 1, 2022

Conversation

atduarte
Copy link
Contributor

@atduarte atduarte commented May 31, 2022

Description

Currently, only the timeseries is shown, but in order to evaluate whether we are following our SLOs the dashboard should present a number.

See this for an example.

Related Issue(s)

N/A

How to test

Load this json in Grafana by:

  1. Creating a new dashboard
  2. Opening Dashboard Settings > JSON Model
  3. Changing the JSON model except for the uid and version

Or see https://grafana.gitpod.io/d/Qljo7br7z/success-criteria-temporary-10382

Release Notes

NONE

Documentation

@atduarte atduarte added the team: workspace Issue belongs to the Workspace team label May 31, 2022
@atduarte atduarte requested a review from a team May 31, 2022 12:20
@atduarte atduarte self-assigned this May 31, 2022
@kylos101
Copy link
Contributor

@atduarte I'm not able to test this in the preview environment or prod (by loading the JSON). I must be doing something wrong! Can you share a Loom showing how you're loading the JSON? 🙏

@ArthurSens I tried testing this in the preview environment, but, I think cluster not having an actual value makes many of the dashboards not function correctly (including this one). Would it make sense to populate the cluster value for preview environments? I figure that would make testing dashboard changes easier (cause we do do, assuming we have the data, in preview environments).

@atduarte
Copy link
Contributor Author

atduarte commented Jun 1, 2022

@kylos101 updated how to test to be more clear. Easier way is just visiting https://grafana.gitpod.io/d/Qljo7br7z/success-criteria-temporary-10382

Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful

@roboquat roboquat merged commit 135a7de into main Jun 1, 2022
@roboquat roboquat deleted the ad-observability-ws-sc-sli branch June 1, 2022 12:14
@ArthurSens
Copy link
Contributor

ArthurSens commented Jun 1, 2022

@ArthurSens I tried testing this in the preview environment, but, I think cluster not having an actual value makes many of the dashboards not function correctly (including this one). Would it make sense to populate the cluster value for preview environments? I figure that would make testing dashboard changes easier (cause we do do, assuming we have the data, in preview environments).

The lack of a cluster label should not be a problem, because the filter says "anything that is different from ephemeral or meta" and empty fulfills this regex. You can confirm this by removing the cluster filter from the preview environment, it will keep showing "NO DATA".

The problem here is that your query measures workspace stop failures, while not a single failure happened. You need to generate this data somehow 😅. Maybe by doing some load tests?

@ArthurSens
Copy link
Contributor

ArthurSens commented Jun 1, 2022

I agree that performing load tests should be easier for preview environments... Maybe a feature request for our newest previewctl CLI? cc @gitpod-io/platform 👀

@kylos101
Copy link
Contributor

kylos101 commented Jun 1, 2022

@ArthurSens I started and stopped workspaces many times - no data showed - even on the overview dashboard. I should have seen that at least 1 workspace was in the Running phase or seen an entry on the startup time heatmap, I think.

@ArthurSens
Copy link
Contributor

ArthurSens commented Jun 1, 2022

@ArthurSens I started and stopped workspaces many times - no data showed - even on the overview dashboard. I should have seen that at least 1 workspace was in the Running phase or seen an entry on the startup time heatmap, I think.

But your query depends on a workspace to fail, not only to stop 😬

Stopping workspaces in a regular manner won't produce the data that you want

@kylos101
Copy link
Contributor

kylos101 commented Jun 2, 2022

@ArthurSens ignore the workspace success criteria dashboard for a moment in this PR. I mean in general, I was looking at the overview dashboard, to see the heatmap for start times, and count of workspaces by phase - and I didn't see any related data on the overview dashboard. If I start workspaces, should the overview dashboard show me stuff?

@ArthurSens
Copy link
Contributor

Hmmm that is super awkward... Gitpod overview has always worked for me on previews, let me double check

@ArthurSens
Copy link
Contributor

ArthurSens commented Jun 2, 2022

Yep, I could reproduce this problem on the preview from main branch. The problem is not lack of producing metrics, but something changed in Gitpod that makes Prometheus unable to scrape metrics from ws-manager. Not only ws-manager, but also some other components.

image

This problem is also happening in production/staging... not really a problem with preview environments, but with Gitpod itself 😬

@kylos101
Copy link
Contributor

kylos101 commented Jun 2, 2022

Now this makes sense given our conversation from earlier, thank you so much, @ArthurSens ! 🙏

@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/L team: workspace Issue belongs to the Workspace team
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants