alerting as a feature #682

dofinn · 2021-03-09T03:57:12Z

No description provided.

openshift-ci-robot · 2021-03-09T03:57:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign dhellmann after the PR has been reviewed.
You can assign the PR to them by writing /assign @dhellmann in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2021-03-09T04:06:53Z

@dofinn: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/markdownlint	`59cf9e9`	link	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2021-03-09T04:42:45Z

enhancements/alerting/alerting-as-a-feature.md

+
+Caused Based Alert: Describes the direct cause of an issue. Example: etcdMemoryUtilization @ 100%
+
+Symptom Based Alert = Describes a symptom who's source is a cause based alert. Example: etcd connection latency > 200ms. This may be caused by etcdMemoryUtilization (or not?)


How high up the stack you go before a cause-based alert becomes a symptom-based alert depends on your particular SLAs and SLOs. The Prom docs link Rob Ewaschuk's excellent philosophy doc, talking about alerting from the spout. If you're managing etcd, etcd request latency is a reasonable spout. If you're managing the cluster, where folks can only get at etcd through the Kube API servers, etcd latency becomes a cause-based alert explaining one possible component of your spout Kube API latency alert.

Besides case-based and symptom/spout alerting, I think it's also useful to have diagnostics that look at firing alerts and other in-cluster status and say "ah, the cluster of issues you're seeing look like $ROOT_CAUSE". The insights folks have some tools in this space, but currently they are internal and don't run in-cluster.

How high up the stack you go before a cause-based alert becomes a symptom-based

Start from as a high up the stack as you can go then. We should be alerting from the perspective of outside in, not inside out.

Side note, does this separation of alerts matter in the overall proposal? I think it's fair to say that alerting should be a feature of Openshift yes, but the content or alerts could be of a choice of cluster admins really, plus it depends what is being run and with what SLA.

The reason is that cause/symptom alert statements are controversial as noticed. As you mentioned the write up about this from Google SRE here: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.6ammb5h32uqq is excellent material.
The rule is simple: don't write cause-based paging rules for symptoms you can catch otherwise.

We need to also identify the goal of alerting here. It is not for drilling down (!). It's for notifying you that users are or will be impacted(!)

That said, if your debugging dashboards let you move quickly enough from symptom to cause to amelioration, you don't need to spend time on cause-based rules anyway.

+1 to the above thoughts. Thats why i proposed alerting profiles to enable admins/SRSs to either:
a) configure alerting from scratch
b) configure alerting with caused based alerts (as a starting point)
c) configure alerting with symptom based alerts (as a starting point)

wking · 2021-03-09T04:50:18Z

enhancements/alerting/alerting-as-a-feature.md

+#### Provide two life cycles that deliver Alerting as a feature. 
+
+* Alerting standards are:
+  * defined by engineering, monitoring team and SREs (SRE assist on critical severity only). 


Alerts are aimed at cluster admins, which means SREs + customers. There should absolutely be SRE and customer feedback on the utility and actionability of all alerts, not just critical alerts. But the critical alerts are the loudest, so that's the most impactful place for folks interested in improving alert quality in general but who only have limited time to invest.

+1
This can probably be reworded to incorporate sres generally giving guidance on the alert severity

+1. I was trying to illustrate that critical severity involvement should be heavily encouraged, but never as a blocker.

wking · 2021-03-09T05:00:14Z

enhancements/alerting/alerting-as-a-feature.md

+  * defined by monitoring team and SREs.
+  * Implemented by SRE
+  * Lifecycled by monitoring team and SRE. 
+  * Developed, tested and validated by SRE and eventually committed back to the product.


This distinct life-cycle is fine for any case where the engineering team is slower than SRE needs or the SRE needs are more specific than the general engineering case. But I don't think it's an alerting-standards vs. SLO-based thing. I think it's good to do the broadest consensus-building we can around any new alerts, so that, even though whoever is not driving the new alert is probably too busy to seriously weigh in right then, they can at least raise any early warnings, and have a heads up so they aren't blindsided later when they do have time. And in cases where engineering says "sounds great, but we can't commit to that SLO because $WEIRD_CORNER_CASE", then the SREs can move forward understanding that they are on the hook for keeping their clusters out of the weird-but-supported situation that engineering has to support.

You have missed the point if you think its a standards vs SLO based thing. They work in paralell. The component experts help drive the standards the specific causes of their components, while SRE helps drive the highest abstractions of symptom based alerts... potentially back into the product.

RiRa12621 · 2021-03-11T05:55:48Z

enhancements/alerting/alerting-as-a-feature.md

+There are two current efforts in the space at the moment:
+
+* [alerting standards](https://github.com/openshift/enhancements/pull/637) -> this will have component teams own their own components alerts as they are they subject matter experts (caused based alerts). 
+* SLO driven [symptom based alerting](https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.1upja8jlnnwp) is being explored by OSD/SRE. -> this will enable meaningful alerting from a customer perspective in an SLA environment.  


We need to find a good place for this to live. The OEPs are public and for all of openshift so docs referred to should probably be available publicly as well.

RiRa12621 · 2021-03-11T05:58:09Z

enhancements/alerting/alerting-as-a-feature.md

+#### Provide two life cycles that deliver Alerting as a feature. 
+
+* Alerting standards are:
+  * defined by engineering, monitoring team and SREs (SRE assist on critical severity only). 


+1
This can probably be reworded to incorporate sres generally giving guidance on the alert severity

bwplotka

Thanks for this proposal, it is a nice starting point, but I am missing technical details on what example intput and output we expect. What APIs you want to have, what data alerts should be sourced from, how we deploy them, who can change them etc.

Do we have plan for this? 🤗

bwplotka · 2021-03-11T07:55:31Z

enhancements/alerting/alerting-as-a-feature.md

@@ -0,0 +1,302 @@
+---
+title: Alerting as a feature


Can you describe your title clearer (and update PR title): Feature of what particularly? 🤗

Its built in, its not an after-thought. OCP no longer comes with just alerts, it comes with alerts and out of the box methodology that can be deployed as easy as a pod for an end user.

bwplotka · 2021-03-11T07:56:10Z

enhancements/alerting/alerting-as-a-feature.md

+
+## Summary
+
+Alerting as a feature is a holistic composition of [alerting standards](https://github.com/openshift/enhancements/pull/637) coupled with built-in alerting methodologies (none, symptom based and caused based) with an option to deliver SLOs for a single and/or fleet of clusters. 


as a feature where? Who would use it? What would be the source of data? Is it multi cluster or per single cluster? 🤗

As a feature of the cluster and extensibly a fleet

SRE/sAdmins

What data are you referring to exactly?

Built for single,multi and fleet scale.

bwplotka · 2021-03-11T08:41:58Z

enhancements/alerting/alerting-as-a-feature.md

+
+Caused Based Alert: Describes the direct cause of an issue. Example: etcdMemoryUtilization @ 100%
+
+Symptom Based Alert = Describes a symptom who's source is a cause based alert. Example: etcd connection latency > 200ms. This may be caused by etcdMemoryUtilization (or not?)


Side note, does this separation of alerts matter in the overall proposal? I think it's fair to say that alerting should be a feature of Openshift yes, but the content or alerts could be of a choice of cluster admins really, plus it depends what is being run and with what SLA.

The reason is that cause/symptom alert statements are controversial as noticed. As you mentioned the write up about this from Google SRE here: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.6ammb5h32uqq is excellent material.
The rule is simple: don't write cause-based paging rules for symptoms you can catch otherwise.

We need to also identify the goal of alerting here. It is not for drilling down (!). It's for notifying you that users are or will be impacted(!)

That said, if your debugging dashboards let you move quickly enough from symptom to cause to amelioration, you don't need to spend time on cause-based rules anyway.

bwplotka · 2021-03-11T08:45:54Z

enhancements/alerting/alerting-as-a-feature.md

+
+There are two current efforts in the space at the moment:
+
+* [alerting standards](https://github.com/openshift/enhancements/pull/637) -> this will have component teams own their own components alerts as they are they subject matter experts (caused based alerts). 


I would be careful here, it sounds like a path to noisy alerts and alert burden. Instead of alerting we might want to invest in tools and practices to drill into such components. As cluster admin, responsible for bigger service (e.g cluster) you should be able to alert on symptoms and one instance of etcd being slow is not a symptom, it might not even impact cluster at all! It is a potentially misleading piece that you should have dashboard for and not screaming alerts 🤔

The alerting standards define how component teams will create alerts for their components. They are the experts in their domains. These alerts should aid in troubleshooting a caused based symptom of which an admin/sre can overlay on the cluster. alerting standards being the caused based alerting layer that will live under symptom. IMO this should be life cycled and developed to reduce MTTR.

dofinn · 2021-03-17T04:55:53Z

Thanks for this proposal, it is a nice starting point, but I am missing technical details on what example intput and output we expect. What APIs you want to have, what data alerts should be sourced from, how we deploy them, who can change them etc.

Do we have plan for this?

Thanks for feedback.

I think i would like to address these finer issues after buy in of the propsed lifecycling and outcomes.

openshift-bot · 2021-06-15T07:50:35Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-07-17T11:36:03Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-08-16T14:23:47Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-08-16T14:29:16Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alerting as a feature

cde6ad0

openshift-ci-robot requested review from abhinavdahiya and mrunalp March 9, 2021 03:57

dofinn added 2 commits March 9, 2021 13:57

Update alerting-as-a-feature.md

5762661

missing syntax.

59cf9e9

wking reviewed Mar 9, 2021

View reviewed changes

RiRa12621 reviewed Mar 11, 2021

View reviewed changes

bwplotka reviewed Mar 11, 2021

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 17, 2021

openshift-ci bot closed this Aug 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alerting as a feature #682

alerting as a feature #682

dofinn commented Mar 9, 2021

openshift-ci-robot commented Mar 9, 2021

openshift-ci bot commented Mar 9, 2021

wking Mar 9, 2021

dofinn Mar 11, 2021

bwplotka Mar 11, 2021

dofinn Mar 17, 2021

wking Mar 9, 2021

RiRa12621 Mar 11, 2021

dofinn Mar 17, 2021

wking Mar 9, 2021 •

edited

Loading

dofinn Mar 11, 2021

RiRa12621 Mar 11, 2021

RiRa12621 Mar 11, 2021

bwplotka left a comment

bwplotka Mar 11, 2021

dofinn Mar 17, 2021

bwplotka Mar 11, 2021

dofinn Mar 17, 2021

bwplotka Mar 11, 2021

bwplotka Mar 11, 2021

dofinn Mar 17, 2021

dofinn commented Mar 17, 2021

openshift-bot commented Jun 15, 2021

openshift-bot commented Jul 17, 2021

openshift-bot commented Aug 16, 2021

openshift-ci bot commented Aug 16, 2021


		Caused Based Alert: Describes the direct cause of an issue. Example: etcdMemoryUtilization @ 100%

		Symptom Based Alert = Describes a symptom who's source is a cause based alert. Example: etcd connection latency > 200ms. This may be caused by etcdMemoryUtilization (or not?)


		## Summary

		Alerting as a feature is a holistic composition of [alerting standards](https://github.com/openshift/enhancements/pull/637) coupled with built-in alerting methodologies (none, symptom based and caused based) with an option to deliver SLOs for a single and/or fleet of clusters.


		There are two current efforts in the space at the moment:

		* [alerting standards](https://github.com/openshift/enhancements/pull/637) -> this will have component teams own their own components alerts as they are they subject matter experts (caused based alerts).

alerting as a feature #682

alerting as a feature #682

Conversation

dofinn commented Mar 9, 2021

openshift-ci-robot commented Mar 9, 2021

openshift-ci bot commented Mar 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wking Mar 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwplotka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dofinn commented Mar 17, 2021

openshift-bot commented Jun 15, 2021

openshift-bot commented Jul 17, 2021

openshift-bot commented Aug 16, 2021

openshift-ci bot commented Aug 16, 2021

wking Mar 9, 2021 •

edited

Loading