-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancements/monitoring: User-defined alerting rules proposal #958
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6532a7e
to
865de6e
Compare
cc: @openshift/openshift-team-monitoring @openshift/sre-alert-sme |
- key: | ||
name: "Watchdog" # Alerts are identified by name and severity tuple. | ||
severity: "none" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name
& severity
together are not enough to uniquely identify rules (for example 2 rules in different rule groups with the same name and severity).
In order to uniquely identify rules in the Console UI, we use all of: (group file, group name, rule name, rule duration, rule query, rule labels). I think we will need to do the same or something similar here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To extend on this a bit: I think the enhancement proposal could benefit from explicitly stating whether overrides must uniquely identify a single alert or not. Overriding or dropping multiple alerts with one spec could certainly be useful.
However this adds significant complexity from the UI perspective. How can a user ensure their override does what they intend.
If we want an explicit one to one matching lets call that out and sketch how a user interaction would look like if they provide an insufficient spec (i.e. one that doesn't uniquely identify one alert).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a point that absolutely needs clarification, thanks.
On a fresh 4.10-nightly cluster on AWS, I only see two alerts with this problem. One is AlertmanagerClusterFailedToSendAlerts
which is due to cluster-monitoring-operator explicitly bringing the critical
version of the alert down to warning
and the other is KubeAPIErrorBudgetBurn
which is a bit more complicated. That one legitimately seems to have two versions each at the warning
and critical
severities. Of course there could be other instances shipped by add-ons.
I suppose this is a trade-off between preventing an override matching multiple rules, and requiring the user to specify a bunch of fields to match against -- ones that often aren't the most visible to users. Given the above, I don't think either of these options are unreasonable:
-
Consider multiple matches of name + severity as an error condition. Can't override that, sorry. We could couple this with amending the style-guide to disallow this, adding e2e tests for it in
origin
, and fixing the offending alerts. -
Just apply the override to all matching alerts. Though, like you said, that may present UX challenges.
I guess I would lean towards the first option personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a clear an concise failure mode definitely beats a murky UI, so +1 for 1. Though it might be valuable to at least mention both approaches in this document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another example here is Kube* rules that are generated with a namespace=
field. SRE use this currently to determine what should/could be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another example here is Kube* rules that are generated with a namespace= field. SRE use this currently to determine what should/could be ignored.
Sorry, not sure I follow. Do you mean that you'd like to be able to match on the namespace
label in addition to name and severity when overriding an alert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some content about this failure mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KubeAPIErrorBudgetBurn
is interesting because it is a "valid" case and it spans 2 PrometheusRules resources (kube-apiserver-slos-basic
and kube-apiserver-slos-extended
both in the openshift-kube-apiserver
namespace).
Maybe we should allow an arbitrary set of static labels for defining the alert's key and patch the existing rule only if there's a unique match? The name
of the alerting rule would be required at least.
For instance to patch all KubeAPIErrorBudgetBurn
rules, you'd had to write
overrides:
- key:
name: "KubeAPIErrorBudgetBurn"
severity: "critical"
long: "1h"
- key:
name: "KubeAPIErrorBudgetBurn"
severity: "warning"
long: "6h"
- key:
name: "KubeAPIErrorBudgetBurn"
severity: "warning"
long: "1d"
- key:
name: "KubeAPIErrorBudgetBurn"
severity: "warning"
long: "3d"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. I've updated the proposal with a CRD structure that allows matching against an arbitrary set of static labels, but requires an exact match.
This leaves the original alerting rules present and untouched. These, as well | ||
as any rules listed with the `drop` action, must be ignored by the platform | ||
Prometheus instance. A `Secret` containing [alert_relabel_configs][1] data for | ||
Prometheus is generated to accomplish this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the reasons for keeping the overridden rules?
Conceptually it seems simpler to actually remove an overridden rule and replace it with the new rule. That would avoid needing the alert_relabel_configs
secret and needing to handle a magic label for dropped and patched rules and should reduce load on the system too. What are the benefits that make it worth keeping the old rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to allow limited overrides of existing rules without the operators having to do anything. We can't modify or remove existing alerts shipped by any OpenShift operator, because the operator will just revert the changes. So the only option is to generate a new alert and somehow ignore the original one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think it would be good to briefly explain that background in the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note about this to the proposal section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The othe reason is that we want to collect the "product" alerts via Telemetry. Otherwise we would lose the ability to catch regressions and such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note about telemetry. Thanks.
865de6e
to
8370017
Compare
Nice work! |
|
||
TBD | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to describe here why a ConfigMap is being used, instead of defining a new CRD for a regular API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, to be honest I'm not sure there's a reason other than that's what we prototyped with. It's called out as an open question. If the consensus is to just go with a CRD from the start, we can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ended up deciding a CRD is best. I've added some detail about the proposed structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job here @bison ! this stuff seemingly gets really complicated quickly. Maybe too many options here become un-productive. I know for OSD we simply do:
- Silence an alert
- Create an alert (if the silence needs replacing/customizing)
This has served us well thus far and we do this via https://github.com/openshift/configure-alertmanager-operator/blob/master/pkg/controller/secret/secret_controller.go#L136-L249 . This is not pretty :) but gives you an example of what SRE is silencing based on X,Y,Z values that are avaialbe within alert metadata.
For usability + console, I think two functionalities.
- silence alert
- create alert (the create could enable the user to pick an current alert to template from, but this in no way links them together)
- created alerts could be stored in PrometheusRule called 'custom-admin-alerts'
- key: | ||
name: "Watchdog" # Alerts are identified by name and severity tuple. | ||
severity: "none" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another example here is Kube* rules that are generated with a namespace=
field. SRE use this currently to determine what should/could be ignored.
found, and the alert extracted. | ||
|
||
For each override a new alerting rule is generated with the overrides applied on | ||
top of the existing rule, and this new rule is added to the "alert-overrides" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having a hard time grok'ing this. Could you please clarify?
- CM exists with identifiers and actions
- Existing PrometheusRules are identified using the identifiers found in the CM
- Actions are applied (Patch,Drop?) to these found PrometheusRules
- Replacement rules are added to a PrometheusRules object called "alert-overrides"
- label is add to each new replacement rule
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's basically it. For each entry in the overrides
section of the ConfigMap
, find the matching rule in the existing PrometheusRule
objects, then apply the action -- that is either drop
or patch
.
If the action is patch
, apply the user's changes on top of the existing rule, and add the result to the "alert-overrides" PrometheusRule
. These always get a static label identifying them as a generated override rule. That label is what allows us to drop the original rule in the relabel config, because we can match only rules that don't have the label.
So, the last step is to then generate the alert_relabel_configs
which drops the original version of any overridden alerts and any alerts the user specified the drop
action for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a bit trying to summarize this better in the proposal section.
|
||
### User Stories | ||
|
||
- As an OpenShift user, I want to raise the severity of an alert from `Warning` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- As an OpenShift user, I want to raise the severity of an alert from `Warning` | |
- As an OpenShift user, I want to manage the severity of an alert choosing between ("Info", "Warning", "Critical") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I made this a bit more general.
|
||
### API Extensions | ||
|
||
None, but this is an open question. Is there good reason to make this a CRD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to CRD, could build status into suppress/changed alerts so as a user I have a summary of what I have changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A CRD also provides better UX since it has built-in validation and documentation mechanisms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I've added a proposed structure for a CRD now.
8370017
to
d250283
Compare
@jan--f, yeah, this is definitely something to think about. I guess in the case of OpenShift simply removing an alert, it's no worse than it would have been. The original alert and the generated alert with the overrides both go away. I think the UX is a little stranger if OpenShift renames an alert or changes its severity. In that case, my overrides are no longer applied, and I guess I have to go look at the cluster-monitoring-operator (or the CRD if we do that) to know why. In the worst case, that could lead to a critical alert not being routed the way I expect. Though, again, that's maybe not too different from the case without this feature. |
Thanks for the feedback, @dofinn. So, does this feature sound useful to you? Or are you saying you think silences and copying / creating new alerts is enough, and this is too complicated? |
I think the ability to silence and create alerts combined with ease of management:
As mentioned previously, SRE already does this using workarounds. These workarounds are not Managing "whats replacing what" or "what alert have i modified" (especially across releases) seems like it would introduce large complexities. |
- key: | ||
name: "Watchdog" # Alerts are identified by name and severity tuple. | ||
severity: "none" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KubeAPIErrorBudgetBurn
is interesting because it is a "valid" case and it spans 2 PrometheusRules resources (kube-apiserver-slos-basic
and kube-apiserver-slos-extended
both in the openshift-kube-apiserver
namespace).
Maybe we should allow an arbitrary set of static labels for defining the alert's key and patch the existing rule only if there's a unique match? The name
of the alerting rule would be required at least.
For instance to patch all KubeAPIErrorBudgetBurn
rules, you'd had to write
overrides:
- key:
name: "KubeAPIErrorBudgetBurn"
severity: "critical"
long: "1h"
- key:
name: "KubeAPIErrorBudgetBurn"
severity: "warning"
long: "6h"
- key:
name: "KubeAPIErrorBudgetBurn"
severity: "warning"
long: "1d"
- key:
name: "KubeAPIErrorBudgetBurn"
severity: "warning"
long: "3d"
This leaves the original alerting rules present and untouched. These, as well | ||
as any rules listed with the `drop` action, must be ignored by the platform | ||
Prometheus instance. A `Secret` containing [alert_relabel_configs][1] data for | ||
Prometheus is generated to accomplish this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The othe reason is that we want to collect the "product" alerts via Telemetry. Otherwise we would lose the ability to catch regressions and such.
OpenShift platform alerting rules -- there seem to be only two cases. We may | ||
want to add the requirement that these be unique to the [style-guide][2] | ||
however, and enforce this in e2e tests going forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I wrote above, the use case of multiple alerting rules with the same name and severity is valid IMHO. In practice there should be at least 1 static label that differentiate between rules with the same name/severity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have changed this to a CRD that allows matching an arbitrary set of static labels, but requires a unique match.
d250283
to
b8723f7
Compare
namespace: openshift-monitoring | ||
spec: | ||
# List of overrides for alerts in platform namespaces. | ||
overrides: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be redundant because the resource name is already alertoverrides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's basically just there because it differentiates the list of overrides from the list of new additional rules. Not sure what else to call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I forgot about the additional rules. Then this makes sense 👍
matchLabels: # optional, but multiple matches is an error reported in status. | ||
severity: "critical" | ||
long: "1h" | ||
action: "patch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having the action as a value, have we considered having it as a key? This would allow us to define different parameters for different actions in the future if such a need arises. E.g.
spec:
patch:
...
drop:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a super strong opinion on this. The way it is now somehow feels like a better middle ground between being Kubernetes-ish, and looking a bit like the existing Prometheus / Alertmanager configs to me. But if the consensus is to have a separate key for each action, that works too.
Good point on changing the parameters. Though, I guess we could already add new optional parameters in the existing version, and anything else probably needs a bump to the API resource version anyway, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the advantage of separate keys is that we can have different configuration options for each operation. This makes it more visible to the user which option can be used for each action, and also adds some "typing" to action parameters.
But anyway, it's just an option to consider. I don't think it's really a show stopper to have the action as a value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. This is easier to read when we divide the list into patch, drop, add sections.
I also wanted to ping @openshift/team-cnv and @openshift/openshift-team-storage -- I think there were people on those teams that had expressed interest in this. |
From storage team perspective, we're interested for sure. This proposal looks OK. It would be nice to have if users could see in some "advanced view":
I think both are really hard to achieve, just leaving here for consideration. |
It would be great if more details can be given on validation of the rule specification. for example
|
action: "patch" | ||
labels: | ||
patched: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaiu any patched rule will have a automatic label add to it, so that we can identify its as a patched alert. If a patch directive attempts to overwrite this generated label, we'll run into issues where users can mess with the ability to discern whether an alert stems from a shipped rule or from an overwritten one. For example assume here the generated labelname is patched
and a patch rule adds patched: "false"
.
The proposal should specifically call out this (error) scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, but having thought about it again, we add the label identifying the alert as patched to the newly generated rule, and we do it after applying the user-supplied overrides. So in practice, it's not possible to change it.
Also, in the relabel config, we match the original alerts by testing for the absence of the label. We don't actually care what the label's value is. So I don't think this presents a problem in the current design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note about the label being added after patches are applied.
b8723f7
to
033ab2e
Compare
Good points, @raptorsun. I added a note about validation with a link to the code Prometheus uses. Should be easy to reuse and allow validation of each generated rule. I'm not sure I fully understand what you mean about dropped rules being referenced. Though, keep in mind dropped rules are only prevented from being sent to Alertmanager. They still exist in Prometheus. Can you maybe give an example? |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
- Watch for a user-defined list of overrides and new rules. | ||
- For each override, find the existing rule matching the name and labels, and | ||
apply the action specified by the user, i.e. `patch` or `drop`. | ||
- If the action is `patch`, generate a new rule with supplied overrides. | ||
- If the action is `drop`, simply add the alert to the alert relabel config. | ||
- If multiple alerts match the name and labels, notify the user and skip | ||
this override. | ||
- Generate new `PrometheusRule` objects for any new user-defined rules. | ||
- Generate the [alert_relabel_configs][1] to drop original alerts that have been | ||
replaced with a patched version, and any that were explicitly dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the operator approach. I wonder if a scenario is missing here:
Should the operator not also watch PrometheusRule
objects for changes? The original object might change requiring the generated object to be regenerated? I'm not really sure if the PrometheusRule
object that CMO creates is an input to the object generated by this new operator.
In any case it might be worth calling this out either way here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, that is already happening in the proof-of-concept branch. I added a note here about it:
Processing of overrides is also triggered by
PrometheusRule
object changes, and changes to the set of platform namespaces.
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@bison: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
This adds an enhancment proposal for supporting user-defined alerting rules and limited alert overrides in the OpenShift monitoring stack.
033ab2e
to
c2d3da0
Compare
I've updated openshift/cluster-monitoring-operator#1473 to match the implementation described here -- using a CRD and matching on label sets. |
@bison: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I started a possible alternative proposal here: #1030 -- Not sure if that's the best way to go about that, but I didn't know what else to do. |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
#1030 supersedes this |
This adds an enhancment proposal for supporting user-defined alerting
rules and limited alert overrides in the OpenShift monitoring stack.