enhancements/monitoring: User-defined alerting rules proposal #958

bison · 2021-11-17T13:41:21Z

This adds an enhancment proposal for supporting user-defined alerting
rules and limited alert overrides in the OpenShift monitoring stack.

openshift-ci · 2021-11-17T13:45:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign simonpasquier after the PR has been reviewed.
You can assign the PR to them by writing /assign @simonpasquier in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

enhancements/monitoring/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bison · 2021-11-17T14:34:14Z

cc: @openshift/openshift-team-monitoring @openshift/sre-alert-sme

enhancements/monitoring/alert-overrides.md

kyoto · 2021-11-18T04:54:47Z

enhancements/monitoring/alert-overrides.md

+    - key:
+        name: "Watchdog" # Alerts are identified by name and severity tuple.
+        severity: "none"


The name & severity together are not enough to uniquely identify rules (for example 2 rules in different rule groups with the same name and severity).

In order to uniquely identify rules in the Console UI, we use all of: (group file, group name, rule name, rule duration, rule query, rule labels). I think we will need to do the same or something similar here.

To extend on this a bit: I think the enhancement proposal could benefit from explicitly stating whether overrides must uniquely identify a single alert or not. Overriding or dropping multiple alerts with one spec could certainly be useful.
However this adds significant complexity from the UI perspective. How can a user ensure their override does what they intend.
If we want an explicit one to one matching lets call that out and sketch how a user interaction would look like if they provide an insufficient spec (i.e. one that doesn't uniquely identify one alert).

Yeah, this is a point that absolutely needs clarification, thanks.

On a fresh 4.10-nightly cluster on AWS, I only see two alerts with this problem. One is AlertmanagerClusterFailedToSendAlerts which is due to cluster-monitoring-operator explicitly bringing the critical version of the alert down to warning and the other is KubeAPIErrorBudgetBurn which is a bit more complicated. That one legitimately seems to have two versions each at the warning and critical severities. Of course there could be other instances shipped by add-ons.

I suppose this is a trade-off between preventing an override matching multiple rules, and requiring the user to specify a bunch of fields to match against -- ones that often aren't the most visible to users. Given the above, I don't think either of these options are unreasonable:

Consider multiple matches of name + severity as an error condition. Can't override that, sorry. We could couple this with amending the style-guide to disallow this, adding e2e tests for it in origin, and fixing the offending alerts.

Just apply the override to all matching alerts. Though, like you said, that may present UX challenges.

I guess I would lean towards the first option personally.

Having a clear an concise failure mode definitely beats a murky UI, so +1 for 1. Though it might be valuable to at least mention both approaches in this document?

another example here is Kube* rules that are generated with a namespace= field. SRE use this currently to determine what should/could be ignored.

another example here is Kube* rules that are generated with a namespace= field. SRE use this currently to determine what should/could be ignored.

Sorry, not sure I follow. Do you mean that you'd like to be able to match on the namespace label in addition to name and severity when overriding an alert?

I've added some content about this failure mode.

KubeAPIErrorBudgetBurn is interesting because it is a "valid" case and it spans 2 PrometheusRules resources (kube-apiserver-slos-basic and kube-apiserver-slos-extended both in the openshift-kube-apiserver namespace).
Maybe we should allow an arbitrary set of static labels for defining the alert's key and patch the existing rule only if there's a unique match? The name of the alerting rule would be required at least.
For instance to patch all KubeAPIErrorBudgetBurn rules, you'd had to write

overrides: - key: name: "KubeAPIErrorBudgetBurn" severity: "critical" long: "1h" - key: name: "KubeAPIErrorBudgetBurn" severity: "warning" long: "6h" - key: name: "KubeAPIErrorBudgetBurn" severity: "warning" long: "1d" - key: name: "KubeAPIErrorBudgetBurn" severity: "warning" long: "3d"

This makes sense to me. I've updated the proposal with a CRD structure that allows matching against an arbitrary set of static labels, but requires an exact match.

kyoto · 2021-11-18T05:11:08Z

enhancements/monitoring/alert-overrides.md

+This leaves the original alerting rules present and untouched.  These, as well
+as any rules listed with the `drop` action, must be ignored by the platform
+Prometheus instance.  A `Secret` containing [alert_relabel_configs][1] data for
+Prometheus is generated to accomplish this:


Could you explain the reasons for keeping the overridden rules?

Conceptually it seems simpler to actually remove an overridden rule and replace it with the new rule. That would avoid needing the alert_relabel_configs secret and needing to handle a magic label for dropped and patched rules and should reduce load on the system too. What are the benefits that make it worth keeping the old rules?

We want to allow limited overrides of existing rules without the operators having to do anything. We can't modify or remove existing alerts shipped by any OpenShift operator, because the operator will just revert the changes. So the only option is to generate a new alert and somehow ignore the original one.

Thanks. I think it would be good to briefly explain that background in the proposal.

Added a note about this to the proposal section.

The othe reason is that we want to collect the "product" alerts via Telemetry. Otherwise we would lose the ability to catch regressions and such.

Added a note about telemetry. Thanks.

jan--f · 2021-11-18T11:44:37Z

Nice work!
I'm wondering if it makes sense to call out the upgrade scenario where an upgrade changes an overridden alert more explicitly. What happens if the spec no longer matches an alert or now matches multiple? Even listing this as "can happend, considered out of scope" would be beneficial I think.

dhellmann · 2021-11-18T17:28:42Z

enhancements/monitoring/alert-overrides.md

+
+TBD
+
+## Alternatives


It would be useful to describe here why a ConfigMap is being used, instead of defining a new CRD for a regular API.

Yeah, to be honest I'm not sure there's a reason other than that's what we prototyped with. It's called out as an open question. If the consensus is to just go with a CRD from the start, we can do that.

We ended up deciding a CRD is best. I've added some detail about the proposed structure.

dofinn

Nice job here @bison ! this stuff seemingly gets really complicated quickly. Maybe too many options here become un-productive. I know for OSD we simply do:

Silence an alert
Create an alert (if the silence needs replacing/customizing)

This has served us well thus far and we do this via https://github.com/openshift/configure-alertmanager-operator/blob/master/pkg/controller/secret/secret_controller.go#L136-L249 . This is not pretty :) but gives you an example of what SRE is silencing based on X,Y,Z values that are avaialbe within alert metadata.

For usability + console, I think two functionalities.

silence alert
create alert (the create could enable the user to pick an current alert to template from, but this in no way links them together)
created alerts could be stored in PrometheusRule called 'custom-admin-alerts'

dofinn · 2021-11-22T09:27:52Z

enhancements/monitoring/alert-overrides.md

+    - key:
+        name: "Watchdog" # Alerts are identified by name and severity tuple.
+        severity: "none"


another example here is Kube* rules that are generated with a namespace= field. SRE use this currently to determine what should/could be ignored.

dofinn · 2021-11-22T09:32:20Z

enhancements/monitoring/alert-overrides.md

+found, and the alert extracted.
+
+For each override a new alerting rule is generated with the overrides applied on
+top of the existing rule, and this new rule is added to the "alert-overrides"


I am having a hard time grok'ing this. Could you please clarify?

CM exists with identifiers and actions

Existing PrometheusRules are identified using the identifiers found in the CM

Actions are applied (Patch,Drop?) to these found PrometheusRules

Replacement rules are added to a PrometheusRules object called "alert-overrides"

label is add to each new replacement rule
?

Yeah, that's basically it. For each entry in the overrides section of the ConfigMap, find the matching rule in the existing PrometheusRule objects, then apply the action -- that is either drop or patch.

If the action is patch, apply the user's changes on top of the existing rule, and add the result to the "alert-overrides" PrometheusRule. These always get a static label identifying them as a generated override rule. That label is what allows us to drop the original rule in the relabel config, because we can match only rules that don't have the label.

So, the last step is to then generate the alert_relabel_configs which drops the original version of any overridden alerts and any alerts the user specified the drop action for.

I've added a bit trying to summarize this better in the proposal section.

dofinn · 2021-11-22T09:33:54Z

enhancements/monitoring/alert-overrides.md

+
+### User Stories
+
+- As an OpenShift user, I want to raise the severity of an alert from `Warning`


Suggested change

- As an OpenShift user, I want to raise the severity of an alert from `Warning`

- As an OpenShift user, I want to manage the severity of an alert choosing between ("Info", "Warning", "Critical")

Thanks, I made this a bit more general.

dofinn · 2021-11-22T09:35:19Z

enhancements/monitoring/alert-overrides.md

+
+### API Extensions
+
+None, but this is an open question.  Is there good reason to make this a CRD


+1 to CRD, could build status into suppress/changed alerts so as a user I have a summary of what I have changed?

A CRD also provides better UX since it has built-in validation and documentation mechanisms

Agreed. I've added a proposed structure for a CRD now.

bison · 2021-11-23T11:30:24Z

I'm wondering if it makes sense to call out the upgrade scenario where an upgrade changes an overridden alert more explicitly. What happens if the spec no longer matches an alert or now matches multiple? Even listing this as "can happend, considered out of scope" would be beneficial I think.

@jan--f, yeah, this is definitely something to think about. I guess in the case of OpenShift simply removing an alert, it's no worse than it would have been. The original alert and the generated alert with the overrides both go away.

I think the UX is a little stranger if OpenShift renames an alert or changes its severity. In that case, my overrides are no longer applied, and I guess I have to go look at the cluster-monitoring-operator (or the CRD if we do that) to know why. In the worst case, that could lead to a critical alert not being routed the way I expect. Though, again, that's maybe not too different from the case without this feature.

bison · 2021-11-23T11:34:24Z

Nice job here @bison ! this stuff seemingly gets really complicated quickly. Maybe too many options here become un-productive. I know for OSD we simply do:

Silence an alert
Create an alert (if the silence needs replacing/customizing)

Thanks for the feedback, @dofinn. So, does this feature sound useful to you? Or are you saying you think silences and copying / creating new alerts is enough, and this is too complicated?

dofinn · 2021-11-24T04:55:20Z

Nice job here @bison ! this stuff seemingly gets really complicated quickly. Maybe too many options here become un-productive. I know for OSD we simply do:
Silence an alert
Create an alert (if the silence needs replacing/customizing)

Thanks for the feedback, @dofinn. So, does this feature sound useful to you? Or are you saying you think silences and copying / creating new alerts is enough, and this is too complicated?

I think the ability to silence and create alerts combined with ease of management:

what alerts have I silenced?
what alerts have I created?
has great merit.

As mentioned previously, SRE already does this using workarounds. These workarounds are not
necessarily bad but Im thinking this enhancement could/would provide a better way to acheive this.
Silences/modificiations - > https://github.com/openshift/configure-alertmanager-operator/blob/master/pkg/controller/secret/secret_controller.go#L136
Creates -> https://github.com/openshift/managed-cluster-config/tree/master/deploy/sre-prometheus

Managing "whats replacing what" or "what alert have i modified" (especially across releases) seems like it would introduce large complexities.

enhancements/monitoring/alert-overrides.md

simonpasquier · 2021-11-24T14:39:28Z

enhancements/monitoring/alert-overrides.md

+    - key:
+        name: "Watchdog" # Alerts are identified by name and severity tuple.
+        severity: "none"


KubeAPIErrorBudgetBurn is interesting because it is a "valid" case and it spans 2 PrometheusRules resources (kube-apiserver-slos-basic and kube-apiserver-slos-extended both in the openshift-kube-apiserver namespace).
Maybe we should allow an arbitrary set of static labels for defining the alert's key and patch the existing rule only if there's a unique match? The name of the alerting rule would be required at least.
For instance to patch all KubeAPIErrorBudgetBurn rules, you'd had to write

overrides: - key: name: "KubeAPIErrorBudgetBurn" severity: "critical" long: "1h" - key: name: "KubeAPIErrorBudgetBurn" severity: "warning" long: "6h" - key: name: "KubeAPIErrorBudgetBurn" severity: "warning" long: "1d" - key: name: "KubeAPIErrorBudgetBurn" severity: "warning" long: "3d"

simonpasquier · 2021-11-24T14:44:31Z

enhancements/monitoring/alert-overrides.md

+This leaves the original alerting rules present and untouched.  These, as well
+as any rules listed with the `drop` action, must be ignored by the platform
+Prometheus instance.  A `Secret` containing [alert_relabel_configs][1] data for
+Prometheus is generated to accomplish this:


The othe reason is that we want to collect the "product" alerts via Telemetry. Otherwise we would lose the ability to catch regressions and such.

enhancements/monitoring/alert-overrides.md

simonpasquier · 2021-11-24T14:48:47Z

enhancements/monitoring/alert-overrides.md

+OpenShift platform alerting rules -- there seem to be only two cases.  We may
+want to add the requirement that these be unique to the [style-guide][2]
+however, and enforce this in e2e tests going forward.


As I wrote above, the use case of multiple alerting rules with the same name and severity is valid IMHO. In practice there should be at least 1 static label that differentiate between rules with the same name/severity.

Have changed this to a CRD that allows matching an arbitrary set of static labels, but requires a unique match.

fpetkovski · 2021-11-30T06:35:05Z

enhancements/monitoring/alert-overrides.md

+  namespace: openshift-monitoring
+spec:
+  # List of overrides for alerts in platform namespaces.
+  overrides:


I think this might be redundant because the resource name is already alertoverrides

Yeah, it's basically just there because it differentiates the list of overrides from the list of new additional rules. Not sure what else to call it.

Ah ok, I forgot about the additional rules. Then this makes sense 👍

fpetkovski · 2021-11-30T06:36:04Z

enhancements/monitoring/alert-overrides.md

+      matchLabels: # optional, but multiple matches is an error reported in status.
+        severity: "critical"
+        long: "1h"
+    action: "patch"


Instead of having the action as a value, have we considered having it as a key? This would allow us to define different parameters for different actions in the future if such a need arises. E.g.

spec: patch: ... drop: ...

I don't have a super strong opinion on this. The way it is now somehow feels like a better middle ground between being Kubernetes-ish, and looking a bit like the existing Prometheus / Alertmanager configs to me. But if the consensus is to have a separate key for each action, that works too.

Good point on changing the parameters. Though, I guess we could already add new optional parameters in the existing version, and anything else probably needs a bump to the API resource version anyway, right?

I think the advantage of separate keys is that we can have different configuration options for each operation. This makes it more visible to the user which option can be used for each action, and also adds some "typing" to action parameters.

But anyway, it's just an option to consider. I don't think it's really a show stopper to have the action as a value.

Agree. This is easier to read when we divide the list into patch, drop, add sections.

bison · 2021-11-30T12:14:37Z

I also wanted to ping @openshift/team-cnv and @openshift/openshift-team-storage -- I think there were people on those teams that had expressed interest in this.

jsafrane · 2021-11-30T17:02:09Z

From storage team perspective, we're interested for sure. This proposal looks OK. It would be nice to have if users could see in some "advanced view":

What alerts definitions are patched / dropped.
What alerts would be firing right now if they were not patched / dropped.

I think both are really hard to achieve, just leaving here for consideration.

raptorsun · 2021-12-07T15:54:30Z

It would be great if more details can be given on validation of the rule specification. for example

how to validate the expression is valid, every referenced rule exists
when dropping a rule, other rules are not referencing to it

jan--f · 2021-12-07T15:57:26Z

enhancements/monitoring/alert-overrides.md

+    action: "patch"
+    labels:
+      patched: "true"


Afaiu any patched rule will have a automatic label add to it, so that we can identify its as a patched alert. If a patch directive attempts to overwrite this generated label, we'll run into issues where users can mess with the ability to discern whether an alert stems from a shipped rule or from an overwritten one. For example assume here the generated labelname is patched and a patch rule adds patched: "false".
The proposal should specifically call out this (error) scenario.

This is a good point, but having thought about it again, we add the label identifying the alert as patched to the newly generated rule, and we do it after applying the user-supplied overrides. So in practice, it's not possible to change it.

Also, in the relabel config, we match the original alerts by testing for the absence of the label. We don't actually care what the label's value is. So I don't think this presents a problem in the current design.

Added a note about the label being added after patches are applied.

bison · 2021-12-08T12:44:46Z

It would be great if more details can be given on validation of the rule specification. for example

how to validate the expression is valid, every referenced rule exists

when dropping a rule, other rules are not referencing to it

Good points, @raptorsun. I added a note about validation with a link to the code Prometheus uses. Should be easy to reuse and allow validation of each generated rule.

I'm not sure I fully understand what you mean about dropped rules being referenced. Though, keep in mind dropped rules are only prevented from being sent to Alertmanager. They still exist in Prometheus. Can you maybe give an example?

openshift-bot · 2022-01-05T17:03:56Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-01-12T22:59:28Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

jan--f · 2022-01-13T09:03:03Z

enhancements/monitoring/alert-overrides.md

+- Watch for a user-defined list of overrides and new rules.
+- For each override, find the existing rule matching the name and labels, and
+  apply the action specified by the user, i.e. `patch` or `drop`.
+  - If the action is `patch`, generate a new rule with supplied overrides.
+  - If the action is `drop`, simply add the alert to the alert relabel config.
+  - If multiple alerts match the name and labels, notify the user and skip
+    this override.
+- Generate new `PrometheusRule` objects for any new user-defined rules.
+- Generate the [alert_relabel_configs][1] to drop original alerts that have been
+  replaced with a patched version, and any that were explicitly dropped.


I like the operator approach. I wonder if a scenario is missing here:
Should the operator not also watch PrometheusRule objects for changes? The original object might change requiring the generated object to be regenerated? I'm not really sure if the PrometheusRule object that CMO creates is an input to the object generated by this new operator.

In any case it might be worth calling this out either way here.

Good point, that is already happening in the proof-of-concept branch. I added a note here about it:

Processing of overrides is also triggered by PrometheusRule object changes, and changes to the set of platform namespaces.

openshift-bot · 2022-01-20T11:04:49Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-01-20T11:06:29Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bison · 2022-01-20T11:18:49Z

/reopen

openshift-ci · 2022-01-20T11:19:22Z

@bison: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bison · 2022-01-20T11:20:01Z

/remove-lifecycle rotten

This adds an enhancment proposal for supporting user-defined alerting rules and limited alert overrides in the OpenShift monitoring stack.

bison · 2022-02-07T11:30:39Z

I've updated openshift/cluster-monitoring-operator#1473 to match the implementation described here -- using a CRD and matching on label sets.

openshift-ci · 2022-02-07T11:31:04Z

@bison: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

bison · 2022-02-08T13:23:28Z

I started a possible alternative proposal here: #1030 -- Not sure if that's the best way to go about that, but I didn't know what else to do.

openshift-bot · 2022-03-08T17:53:59Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-03-15T17:55:28Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2022-03-22T23:54:39Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-03-22T23:55:21Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jan--f · 2022-05-11T14:38:27Z

#1030 supersedes this

openshift-ci bot requested review from bill3tt and mfojtik November 17, 2021 13:45

bison force-pushed the alert-overrides branch from 6532a7e to 865de6e Compare November 17, 2021 14:25

kyoto reviewed Nov 18, 2021

View reviewed changes

enhancements/monitoring/alert-overrides.md Outdated Show resolved Hide resolved

kyoto reviewed Nov 18, 2021

View reviewed changes

enhancements/monitoring/alert-overrides.md Outdated Show resolved Hide resolved

kyoto reviewed Nov 18, 2021

View reviewed changes

bison force-pushed the alert-overrides branch from 865de6e to 8370017 Compare November 18, 2021 10:54

dhellmann reviewed Nov 18, 2021

View reviewed changes

dofinn reviewed Nov 22, 2021

View reviewed changes

bison force-pushed the alert-overrides branch from 8370017 to d250283 Compare November 23, 2021 10:58

simonpasquier reviewed Nov 24, 2021

View reviewed changes

bison force-pushed the alert-overrides branch from d250283 to b8723f7 Compare November 29, 2021 15:53

fpetkovski reviewed Nov 30, 2021

View reviewed changes

jan--f reviewed Dec 7, 2021

View reviewed changes

bison force-pushed the alert-overrides branch from b8723f7 to 033ab2e Compare December 8, 2021 12:41

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2022

travisn mentioned this pull request Jan 12, 2022

monitoring: customize prometheus rule alerts rook/rook#9503

Closed

10 tasks

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 12, 2022

jan--f reviewed Jan 13, 2022

View reviewed changes

openshift-ci bot closed this Jan 20, 2022

openshift-ci bot reopened this Jan 20, 2022

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 20, 2022

enhancements/monitoring: User-defined alerting rules proposal

c2d3da0

This adds an enhancment proposal for supporting user-defined alerting rules and limited alert overrides in the OpenShift monitoring stack.

bison force-pushed the alert-overrides branch from 033ab2e to c2d3da0 Compare February 7, 2022 11:26

bison mentioned this pull request Feb 8, 2022

enhancements/monitoring: User-Defined Alerting and Relabeling Rules #1030

Merged

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 15, 2022

travisn mentioned this pull request Mar 17, 2022

Allow disabling of PersistentVolumeUsageNearFull/PersistentVolumeUsageCritical alerts on workloads that are expected to be fully utilized rook/rook#9568

Closed

openshift-ci bot closed this Mar 22, 2022


		### User Stories

		- As an OpenShift user, I want to raise the severity of an alert from `Warning`

	- As an OpenShift user, I want to raise the severity of an alert from `Warning`
	- As an OpenShift user, I want to manage the severity of an alert choosing between ("Info", "Warning", "Critical")


		### API Extensions

		None, but this is an open question. Is there good reason to make this a CRD


		TBD

		## Alternatives

enhancements/monitoring: User-defined alerting rules proposal #958

enhancements/monitoring: User-defined alerting rules proposal #958

Conversation

bison commented Nov 17, 2021

openshift-ci bot commented Nov 17, 2021

bison commented Nov 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyoto Nov 18, 2021 • edited Loading

Choose a reason for hiding this comment

bison Nov 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan--f commented Nov 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dofinn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bison commented Nov 23, 2021

bison commented Nov 23, 2021

dofinn commented Nov 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bison commented Nov 30, 2021

jsafrane commented Nov 30, 2021

raptorsun commented Dec 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bison commented Dec 8, 2021

openshift-bot commented Jan 5, 2022

openshift-bot commented Jan 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Jan 20, 2022

openshift-ci bot commented Jan 20, 2022

bison commented Jan 20, 2022

openshift-ci bot commented Jan 20, 2022

bison commented Jan 20, 2022

bison commented Feb 7, 2022

openshift-ci bot commented Feb 7, 2022

bison commented Feb 8, 2022

openshift-bot commented Mar 8, 2022

openshift-bot commented Mar 15, 2022

openshift-bot commented Mar 22, 2022

openshift-ci bot commented Mar 22, 2022

jan--f commented May 11, 2022

kyoto Nov 18, 2021 •

edited

Loading

bison Nov 18, 2021 •

edited

Loading