-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[alerts] Group by cluster, where relevant, ahead of centralizing rule evaluation #13766
Conversation
92a0b0f
to
adf8ed2
Compare
@@ -87,20 +87,20 @@ spec: | |||
# description: db-sync pod not running | |||
|
|||
- alert: MessagebusNotRunning | |||
expr: up{job="messagebus"} < 1 | |||
expr: sum(up{job="messagebus"}) by (cluster) < 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use sum
now because with the centralization, the metric can be aggregated across all Webapp clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, with the centralization there would be 2 things happening:
- There would be multiple series for the raw metric - 1 for each cluster
- Aggregations can only be applied to summary series (vectors), so we need to sum anyway to be able to group by cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving and holding in case @ArthurSens has comments.
Feel free to unhold anytime, @easyCZ
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks perfect from my side!
@easyCZ, we discussed the problem with the namespace during the call today, right? I think I have a better approach!
In ArgoCD we specify the namespace where we want to deploy things (example), we could just delete the namespace from the YAML files here, and our ArgoCD App definitions would take care of choosing which namespace to deploy to 🙂
Do you mind updating this PR to the move the namespace as well?
@ArthurSens Thanks for that pointer. I'm guessing the intent is to drop the namespace from our alerts and keep the If that's the case, how do we ensure that for the transitionary period (when we run both rule eval in central and in leafs) it gets deployed to both central and leafs? Would we we add a second Application definition? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record: ✔️
Correct! Not only namespace actually, but we'll also need to change
I was planning to create a whole new app for central verify that it works, then start removing the satellite apps |
adf8ed2
to
4ee1f2c
Compare
@ArthurSens I've now also removed the namespace definitions. If this looks correct to you, please unhold the PR otherwise let me know where I went wrong, please :) |
@ArthurSens ping on this. Could you please have a look and let me know if this PR is as expected? If so, please unhold. |
Sorry for the ghosting, we got hit by a few problems with ArgoCD and I got distracted. I'll get back to it this week! |
@ArthurSens Ping on this. |
Alright, things are very slowly progressing, but we can merge this one without problems now /unhold |
Description
To ensure we continue to receive the alerts with Cluster information, we've gotta group by cluster.
Right now, the cluster label does not exist on the leaf Prometheus instances (in monitoring-satellite) but it doesn't affect the metrics - it's aggregated away
Related Issue(s)
Fixes https://github.com/gitpod-io/ops/issues/5598
How to test
Release Notes
Documentation
Werft options:
If enabled this will build
install/preview
Valid options are
all
,workspace
,webapp
,ide