Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alerts] Group by cluster, where relevant, ahead of centralizing rule evaluation #13766

Merged
merged 2 commits into from
Oct 26, 2022

Conversation

easyCZ
Copy link
Member

@easyCZ easyCZ commented Oct 11, 2022

Description

To ensure we continue to receive the alerts with Cluster information, we've gotta group by cluster.

Right now, the cluster label does not exist on the leaf Prometheus instances (in monitoring-satellite) but it doesn't affect the metrics - it's aggregated away

Related Issue(s)

Fixes https://github.com/gitpod-io/ops/issues/5598

How to test

Release Notes

NONE

Documentation

Werft options:

  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-integration-tests=all
    Valid options are all, workspace, webapp, ide

@easyCZ easyCZ requested review from ArthurSens and a team October 11, 2022 12:54
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Oct 11, 2022
@easyCZ easyCZ force-pushed the mp/webapp-rules-cluster branch from 92a0b0f to adf8ed2 Compare October 11, 2022 12:55
@@ -87,20 +87,20 @@ spec:
# description: db-sync pod not running

- alert: MessagebusNotRunning
expr: up{job="messagebus"} < 1
expr: sum(up{job="messagebus"}) by (cluster) < 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use sum now because with the centralization, the metric can be aggregated across all Webapp clusters?

Copy link
Member Author

@easyCZ easyCZ Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, with the centralization there would be 2 things happening:

  1. There would be multiple series for the raw metric - 1 for each cluster
  2. Aggregations can only be applied to summary series (vectors), so we need to sum anyway to be able to group by cluster

Copy link
Contributor

@laushinka laushinka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving and holding in case @ArthurSens has comments.
Feel free to unhold anytime, @easyCZ

/hold

Copy link
Contributor

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks perfect from my side!

@easyCZ, we discussed the problem with the namespace during the call today, right? I think I have a better approach!

In ArgoCD we specify the namespace where we want to deploy things (example), we could just delete the namespace from the YAML files here, and our ArgoCD App definitions would take care of choosing which namespace to deploy to 🙂

Do you mind updating this PR to the move the namespace as well?

@easyCZ
Copy link
Member Author

easyCZ commented Oct 12, 2022

@ArthurSens Thanks for that pointer. I'm guessing the intent is to drop the namespace from our alerts and keep the monitoring-satellite here. When we go to switch it to central, we upate the namespace there?

If that's the case, how do we ensure that for the transitionary period (when we run both rule eval in central and in leafs) it gets deployed to both central and leafs? Would we we add a second Application definition?

Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record: ✔️

@ArthurSens
Copy link
Contributor

@ArthurSens Thanks for that pointer. I'm guessing the intent is to drop the namespace from our alerts and keep the monitoring-satellite here. When we go to switch it to central, we upate the namespace there?

Correct! Not only namespace actually, but we'll also need to change destination too (i.e. cluster where it is deployed)

If that's the case, how do we ensure that for the transitionary period (when we run both rule eval in central and in leafs) it gets deployed to both central and leafs? Would we we add a second Application definition?

I was planning to create a whole new app for central verify that it works, then start removing the satellite apps

@easyCZ easyCZ force-pushed the mp/webapp-rules-cluster branch from adf8ed2 to 4ee1f2c Compare October 14, 2022 12:27
@easyCZ
Copy link
Member Author

easyCZ commented Oct 14, 2022

@ArthurSens I've now also removed the namespace definitions. If this looks correct to you, please unhold the PR otherwise let me know where I went wrong, please :)

@easyCZ
Copy link
Member Author

easyCZ commented Oct 18, 2022

@ArthurSens ping on this. Could you please have a look and let me know if this PR is as expected? If so, please unhold.

@ArthurSens
Copy link
Contributor

@ArthurSens ping on this. Could you please have a look and let me know if this PR is as expected? If so, please unhold.

Sorry for the ghosting, we got hit by a few problems with ArgoCD and I got distracted. I'll get back to it this week!

@easyCZ
Copy link
Member Author

easyCZ commented Oct 24, 2022

@ArthurSens Ping on this.

@ArthurSens
Copy link
Contributor

Alright, things are very slowly progressing, but we can merge this one without problems now

/unhold

@roboquat roboquat merged commit 72771c0 into main Oct 26, 2022
@roboquat roboquat deleted the mp/webapp-rules-cluster branch October 26, 2022 12:32
@roboquat roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note-none size/S team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants