-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alerting as a feature #682
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@dofinn: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
||
Caused Based Alert: Describes the direct cause of an issue. Example: etcdMemoryUtilization @ 100% | ||
|
||
Symptom Based Alert = Describes a symptom who's source is a cause based alert. Example: etcd connection latency > 200ms. This may be caused by etcdMemoryUtilization (or not?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How high up the stack you go before a cause-based alert becomes a symptom-based alert depends on your particular SLAs and SLOs. The Prom docs link Rob Ewaschuk's excellent philosophy doc, talking about alerting from the spout. If you're managing etcd, etcd request latency is a reasonable spout. If you're managing the cluster, where folks can only get at etcd through the Kube API servers, etcd latency becomes a cause-based alert explaining one possible component of your spout Kube API latency alert.
Besides case-based and symptom/spout alerting, I think it's also useful to have diagnostics that look at firing alerts and other in-cluster status and say "ah, the cluster of issues you're seeing look like $ROOT_CAUSE
". The insights folks have some tools in this space, but currently they are internal and don't run in-cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How high up the stack you go before a cause-based alert becomes a symptom-based
Start from as a high up the stack as you can go then. We should be alerting from the perspective of outside in, not inside out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note, does this separation of alerts matter in the overall proposal? I think it's fair to say that alerting should be a feature of Openshift yes, but the content or alerts could be of a choice of cluster admins really, plus it depends what is being run and with what SLA.
The reason is that cause/symptom alert statements are controversial as noticed. As you mentioned the write up about this from Google SRE here: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.6ammb5h32uqq is excellent material.
The rule is simple: don't write cause-based paging rules for symptoms you can catch otherwise.
We need to also identify the goal of alerting here. It is not for drilling down (!). It's for notifying you that users are or will be impacted(!)
That said, if your debugging dashboards let you move quickly enough from symptom to cause to amelioration, you don't need to spend time on cause-based rules anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the above thoughts. Thats why i proposed alerting profiles to enable admins/SRSs to either:
a) configure alerting from scratch
b) configure alerting with caused based alerts (as a starting point)
c) configure alerting with symptom based alerts (as a starting point)
#### Provide two life cycles that deliver Alerting as a feature. | ||
|
||
* Alerting standards are: | ||
* defined by engineering, monitoring team and SREs (SRE assist on critical severity only). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alerts are aimed at cluster admins, which means SREs + customers. There should absolutely be SRE and customer feedback on the utility and actionability of all alerts, not just critical
alerts. But the critical
alerts are the loudest, so that's the most impactful place for folks interested in improving alert quality in general but who only have limited time to invest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This can probably be reworded to incorporate sres generally giving guidance on the alert severity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I was trying to illustrate that critical severity involvement should be heavily encouraged, but never as a blocker.
* defined by monitoring team and SREs. | ||
* Implemented by SRE | ||
* Lifecycled by monitoring team and SRE. | ||
* Developed, tested and validated by SRE and eventually committed back to the product. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This distinct life-cycle is fine for any case where the engineering team is slower than SRE needs or the SRE needs are more specific than the general engineering case. But I don't think it's an alerting-standards vs. SLO-based thing. I think it's good to do the broadest consensus-building we can around any new alerts, so that, even though whoever is not driving the new alert is probably too busy to seriously weigh in right then, they can at least raise any early warnings, and have a heads up so they aren't blindsided later when they do have time. And in cases where engineering says "sounds great, but we can't commit to that SLO because $WEIRD_CORNER_CASE
", then the SREs can move forward understanding that they are on the hook for keeping their clusters out of the weird-but-supported situation that engineering has to support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have missed the point if you think its a standards vs SLO based thing. They work in paralell. The component experts help drive the standards the specific causes of their components, while SRE helps drive the highest abstractions of symptom based alerts... potentially back into the product.
There are two current efforts in the space at the moment: | ||
|
||
* [alerting standards](https://github.com/openshift/enhancements/pull/637) -> this will have component teams own their own components alerts as they are they subject matter experts (caused based alerts). | ||
* SLO driven [symptom based alerting](https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.1upja8jlnnwp) is being explored by OSD/SRE. -> this will enable meaningful alerting from a customer perspective in an SLA environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to find a good place for this to live. The OEPs are public and for all of openshift so docs referred to should probably be available publicly as well.
#### Provide two life cycles that deliver Alerting as a feature. | ||
|
||
* Alerting standards are: | ||
* defined by engineering, monitoring team and SREs (SRE assist on critical severity only). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This can probably be reworded to incorporate sres generally giving guidance on the alert severity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this proposal, it is a nice starting point, but I am missing technical details on what example intput and output we expect. What APIs you want to have, what data alerts should be sourced from, how we deploy them, who can change them etc.
Do we have plan for this? 🤗
@@ -0,0 +1,302 @@ | |||
--- | |||
title: Alerting as a feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe your title clearer (and update PR title): Feature of what particularly? 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its built in, its not an after-thought. OCP no longer comes with just alerts, it comes with alerts and out of the box methodology that can be deployed as easy as a pod for an end user.
|
||
## Summary | ||
|
||
Alerting as a feature is a holistic composition of [alerting standards](https://github.com/openshift/enhancements/pull/637) coupled with built-in alerting methodologies (none, symptom based and caused based) with an option to deliver SLOs for a single and/or fleet of clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a feature where? Who would use it? What would be the source of data? Is it multi cluster or per single cluster? 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- As a feature of the cluster and extensibly a fleet
- SRE/sAdmins
- What data are you referring to exactly?
- Built for single,multi and fleet scale.
|
||
Caused Based Alert: Describes the direct cause of an issue. Example: etcdMemoryUtilization @ 100% | ||
|
||
Symptom Based Alert = Describes a symptom who's source is a cause based alert. Example: etcd connection latency > 200ms. This may be caused by etcdMemoryUtilization (or not?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note, does this separation of alerts matter in the overall proposal? I think it's fair to say that alerting should be a feature of Openshift yes, but the content or alerts could be of a choice of cluster admins really, plus it depends what is being run and with what SLA.
The reason is that cause/symptom alert statements are controversial as noticed. As you mentioned the write up about this from Google SRE here: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.6ammb5h32uqq is excellent material.
The rule is simple: don't write cause-based paging rules for symptoms you can catch otherwise.
We need to also identify the goal of alerting here. It is not for drilling down (!). It's for notifying you that users are or will be impacted(!)
That said, if your debugging dashboards let you move quickly enough from symptom to cause to amelioration, you don't need to spend time on cause-based rules anyway.
|
||
There are two current efforts in the space at the moment: | ||
|
||
* [alerting standards](https://github.com/openshift/enhancements/pull/637) -> this will have component teams own their own components alerts as they are they subject matter experts (caused based alerts). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be careful here, it sounds like a path to noisy alerts and alert burden. Instead of alerting we might want to invest in tools and practices to drill into such components. As cluster admin, responsible for bigger service (e.g cluster) you should be able to alert on symptoms and one instance of etcd being slow is not a symptom, it might not even impact cluster at all! It is a potentially misleading piece that you should have dashboard for and not screaming alerts 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alerting standards define how component teams will create alerts for their components. They are the experts in their domains. These alerts should aid in troubleshooting a caused based symptom of which an admin/sre can overlay on the cluster. alerting standards being the caused based alerting layer that will live under symptom. IMO this should be life cycled and developed to reduce MTTR.
Thanks for feedback. I think i would like to address these finer issues after buy in of the propsed lifecycling and outcomes. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.