Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alerting as a feature #682

Closed
wants to merge 3 commits into from
Closed

alerting as a feature #682

wants to merge 3 commits into from

Conversation

dofinn
Copy link
Contributor

@dofinn dofinn commented Mar 9, 2021

No description provided.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign dhellmann after the PR has been reviewed.
You can assign the PR to them by writing /assign @dhellmann in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 9, 2021

@dofinn: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/markdownlint 59cf9e9 link /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.


Caused Based Alert: Describes the direct cause of an issue. Example: etcdMemoryUtilization @ 100%

Symptom Based Alert = Describes a symptom who's source is a cause based alert. Example: etcd connection latency > 200ms. This may be caused by etcdMemoryUtilization (or not?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How high up the stack you go before a cause-based alert becomes a symptom-based alert depends on your particular SLAs and SLOs. The Prom docs link Rob Ewaschuk's excellent philosophy doc, talking about alerting from the spout. If you're managing etcd, etcd request latency is a reasonable spout. If you're managing the cluster, where folks can only get at etcd through the Kube API servers, etcd latency becomes a cause-based alert explaining one possible component of your spout Kube API latency alert.

Besides case-based and symptom/spout alerting, I think it's also useful to have diagnostics that look at firing alerts and other in-cluster status and say "ah, the cluster of issues you're seeing look like $ROOT_CAUSE". The insights folks have some tools in this space, but currently they are internal and don't run in-cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How high up the stack you go before a cause-based alert becomes a symptom-based

Start from as a high up the stack as you can go then. We should be alerting from the perspective of outside in, not inside out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, does this separation of alerts matter in the overall proposal? I think it's fair to say that alerting should be a feature of Openshift yes, but the content or alerts could be of a choice of cluster admins really, plus it depends what is being run and with what SLA.

The reason is that cause/symptom alert statements are controversial as noticed. As you mentioned the write up about this from Google SRE here: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.6ammb5h32uqq is excellent material.
The rule is simple: don't write cause-based paging rules for symptoms you can catch otherwise.

We need to also identify the goal of alerting here. It is not for drilling down (!). It's for notifying you that users are or will be impacted(!)

That said, if your debugging dashboards let you move quickly enough from symptom to cause to amelioration, you don't need to spend time on cause-based rules anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the above thoughts. Thats why i proposed alerting profiles to enable admins/SRSs to either:
a) configure alerting from scratch
b) configure alerting with caused based alerts (as a starting point)
c) configure alerting with symptom based alerts (as a starting point)

#### Provide two life cycles that deliver Alerting as a feature.

* Alerting standards are:
* defined by engineering, monitoring team and SREs (SRE assist on critical severity only).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alerts are aimed at cluster admins, which means SREs + customers. There should absolutely be SRE and customer feedback on the utility and actionability of all alerts, not just critical alerts. But the critical alerts are the loudest, so that's the most impactful place for folks interested in improving alert quality in general but who only have limited time to invest.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
This can probably be reworded to incorporate sres generally giving guidance on the alert severity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I was trying to illustrate that critical severity involvement should be heavily encouraged, but never as a blocker.

* defined by monitoring team and SREs.
* Implemented by SRE
* Lifecycled by monitoring team and SRE.
* Developed, tested and validated by SRE and eventually committed back to the product.
Copy link
Member

@wking wking Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This distinct life-cycle is fine for any case where the engineering team is slower than SRE needs or the SRE needs are more specific than the general engineering case. But I don't think it's an alerting-standards vs. SLO-based thing. I think it's good to do the broadest consensus-building we can around any new alerts, so that, even though whoever is not driving the new alert is probably too busy to seriously weigh in right then, they can at least raise any early warnings, and have a heads up so they aren't blindsided later when they do have time. And in cases where engineering says "sounds great, but we can't commit to that SLO because $WEIRD_CORNER_CASE", then the SREs can move forward understanding that they are on the hook for keeping their clusters out of the weird-but-supported situation that engineering has to support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have missed the point if you think its a standards vs SLO based thing. They work in paralell. The component experts help drive the standards the specific causes of their components, while SRE helps drive the highest abstractions of symptom based alerts... potentially back into the product.

There are two current efforts in the space at the moment:

* [alerting standards](https://github.com/openshift/enhancements/pull/637) -> this will have component teams own their own components alerts as they are they subject matter experts (caused based alerts).
* SLO driven [symptom based alerting](https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.1upja8jlnnwp) is being explored by OSD/SRE. -> this will enable meaningful alerting from a customer perspective in an SLA environment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a good place for this to live. The OEPs are public and for all of openshift so docs referred to should probably be available publicly as well.

#### Provide two life cycles that deliver Alerting as a feature.

* Alerting standards are:
* defined by engineering, monitoring team and SREs (SRE assist on critical severity only).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
This can probably be reworded to incorporate sres generally giving guidance on the alert severity

Copy link
Contributor

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal, it is a nice starting point, but I am missing technical details on what example intput and output we expect. What APIs you want to have, what data alerts should be sourced from, how we deploy them, who can change them etc.

Do we have plan for this? 🤗

@@ -0,0 +1,302 @@
---
title: Alerting as a feature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe your title clearer (and update PR title): Feature of what particularly? 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its built in, its not an after-thought. OCP no longer comes with just alerts, it comes with alerts and out of the box methodology that can be deployed as easy as a pod for an end user.


## Summary

Alerting as a feature is a holistic composition of [alerting standards](https://github.com/openshift/enhancements/pull/637) coupled with built-in alerting methodologies (none, symptom based and caused based) with an option to deliver SLOs for a single and/or fleet of clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a feature where? Who would use it? What would be the source of data? Is it multi cluster or per single cluster? 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • As a feature of the cluster and extensibly a fleet
  • SRE/sAdmins
  • What data are you referring to exactly?
  • Built for single,multi and fleet scale.


Caused Based Alert: Describes the direct cause of an issue. Example: etcdMemoryUtilization @ 100%

Symptom Based Alert = Describes a symptom who's source is a cause based alert. Example: etcd connection latency > 200ms. This may be caused by etcdMemoryUtilization (or not?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, does this separation of alerts matter in the overall proposal? I think it's fair to say that alerting should be a feature of Openshift yes, but the content or alerts could be of a choice of cluster admins really, plus it depends what is being run and with what SLA.

The reason is that cause/symptom alert statements are controversial as noticed. As you mentioned the write up about this from Google SRE here: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.6ammb5h32uqq is excellent material.
The rule is simple: don't write cause-based paging rules for symptoms you can catch otherwise.

We need to also identify the goal of alerting here. It is not for drilling down (!). It's for notifying you that users are or will be impacted(!)

That said, if your debugging dashboards let you move quickly enough from symptom to cause to amelioration, you don't need to spend time on cause-based rules anyway.


There are two current efforts in the space at the moment:

* [alerting standards](https://github.com/openshift/enhancements/pull/637) -> this will have component teams own their own components alerts as they are they subject matter experts (caused based alerts).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be careful here, it sounds like a path to noisy alerts and alert burden. Instead of alerting we might want to invest in tools and practices to drill into such components. As cluster admin, responsible for bigger service (e.g cluster) you should be able to alert on symptoms and one instance of etcd being slow is not a symptom, it might not even impact cluster at all! It is a potentially misleading piece that you should have dashboard for and not screaming alerts 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alerting standards define how component teams will create alerts for their components. They are the experts in their domains. These alerts should aid in troubleshooting a caused based symptom of which an admin/sre can overlay on the cluster. alerting standards being the caused based alerting layer that will live under symptom. IMO this should be life cycled and developed to reduce MTTR.

@dofinn
Copy link
Contributor Author

dofinn commented Mar 17, 2021

Thanks for this proposal, it is a nice starting point, but I am missing technical details on what example intput and output we expect. What APIs you want to have, what data alerts should be sourced from, how we deploy them, who can change them etc.

Do we have plan for this?

Thanks for feedback.

I think i would like to address these finer issues after buy in of the propsed lifecycling and outcomes.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 17, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants