Add: Alerting Standards #637

michaelgugino · 2021-02-09T02:26:47Z

No description provided.

lilic

Hey, thanks for the proposal, is there a backstory where this is coming from? Thanks!

Would love to see some more around test plan and some actual examples how you imagined runbooks would look like.

cc @openshift/openshift-team-monitoring

lilic · 2021-02-09T13:59:53Z

enhancements/monitoring/alerting-consistency.md

@@ -0,0 +1,195 @@
+---
+title: alerting-consistency


Would be great if we could link to some upstream Prometheus and blog posts on this topic, so we adhere to best practices and align to them.

Do you have any good looks or info you'd like to incorporate here?

Yes there is plenty, @s-urbaniak wrote a blog post on this. But couple of things are:

https://prometheus.io/docs/practices/alerting/

https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

google SRE book chapter on alerting

https://www.usenix.org/sites/default/files/conference/protected-files/srecon16europe_slides_rabenstein.pdf also anything by Björn really

To name a few, not sure if @s-urbaniak blog post on this topic is out yet?

I wouldn't refer to those resources as 'upstream' but rather just some content from the web. I think they are probably helpful for establishing a rationale, but I'm not sure that including them is better or worse.

the blog post is being prepared to be published, but i think i will have to nudge again.

Yeah, include a bulleted list of references in the summary section. The Monitoring section in the Google SRE book includes majority of the guidance to support readers of this document. https://sre.google/sre-book/monitoring-distributed-systems/

lilic · 2021-02-09T14:01:47Z

enhancements/monitoring/alerting-consistency.md

+
+Frankly, having just "Critical" and "Warning" is not helpful.
+
+We really need "Critical", "Major", "Minor" and "Info."  Can/Should we do this?


We already have info level alerts shipped with cluster, but they are often non-actionable ones there for historical reason, so agreed info should be the last retreat.

lilic · 2021-02-09T14:02:03Z

enhancements/monitoring/alerting-consistency.md

+
+### Open Questions [optional]
+
+Frankly, having just "Critical" and "Warning" is not helpful.


Can you give some reasons here why not? :)

Inevitably, there are going to be some 'warning' level alerts that people won't care about. We need a lower level that indicates that there is a problem, but the impact is low. This would improve the signal to noise ratio of warning level alerts.

Warning level already should create a ticket, so this is already low level enough, it should not page anyone so there is no noise really in an ideal scenario. If folks don't care about those warning alerts they can silence them after getting to a ticket.

The problem is, we want all warning level alerts to be cared about and fixed in most situations. This is about signal to noise. If people are trained to ignore warning level alerts by default, they will miss important alerts. This is why we need a 'minor' alert.

Alternatively, we can just not alert for 'minor' alerts.

IMO with minor severity alerts, we're not avoiding the issue of folks not paying attention to alerts because they're noisy... If a warning alert is too noisy, it is a bug: the alert should either be removed or modified to be more accurate.
The Prometheus community has discussed for a long time about this: the (relative) consensus is that there should be a level for "everything's on fire" alerts and "please have a look at this" alerts. Informative alerts can be useful sometimes but should probably be the exception.

I have thought about this overnight, and for me it comes back to triage. The group of critical alerts should be small, very well defined, highly documented and polished.

The group of warning alerts are the next priority. If you find yourself managing many clusters with many alerts, you need a sensible way to triage these. Clearly an API replica being down is more important than a daemonset not running on an individual worker. They're both problems, but one should probably be investigated before the other.

I think, another way to state it is, think of a single alert in the context of the cluster, rather than the context of a single operator. We all know there are 'minor' alerts, we just don't label them as minor.

"Everything else can wait" yes, that's true. But if you're managing dozens of clusters, you need a way to triage alerts, and you probably want to send a lot of stuff to /dev/null. Our operation teams probably already have rules about which rules to care about and which rules to ignore. Some of this particularly revolves around upgrades and asking "Should I fix this prior to upgrade?" I want to remove that question.

Even if we introduce a "minor" level, there would be people complaining that alert X should "warning" while it's "minor" (and vice-versa). At some point you will leverage additional alert labels to group/route the alerts to the correct destination (this is what OSD is doing from what I understand).
And triaging doesn't necessarily happen at the cluster level: if you run significant infrastructure, you probably have a PagerDuty-like service for dealing with on-call activities.

I'm not talking about triaging on-call events (eg, critical). Those are the ones that get paged and hopefully resolved. I'm talking about all the stuff that fits into 'warning.'

I agree, some people will want alerts to change from one severity to another. That's the entire point of this document, to establish clear guidelines. Definitely generic alerts like Kube* I mentioned elsewhere, if we don't disable those entirely (we probably should), every one of those should be 'minor.' It's a problem, but it's likely caused by a warning or higher level alert (or should be). If something is down, and the service is not impacted, and losing more of them doesn't jeopardize the cluster's health, then it's probably okay to be minor.

I would like this sentence added:

"The group of critical alerts should be small, very well defined, highly documented, polished and with a high bar set for entry. This includes a mandatory review of a proposed critical alert by the Red Hat SRE team."

Having this would be an important checkpoint for SRE, one we have a workflow already in place to support you.

lilic · 2021-02-09T14:03:35Z

enhancements/monitoring/alerting-consistency.md

+
+## Drawbacks
+
+People might have rules around the existing broken alerts.  They will have to


I believe best would be to help folks with this, we have tried in the monitoring team but its often not as easy as just saying they will have to change some alerts. So agreed this is a drawback but also the actual correct result of this.

Indeed, and my experience so far is that teams are happy to have a 2nd set of eyes. So I hope this ends up being a non-issue.

lilic · 2021-02-09T14:04:05Z

enhancements/monitoring/alerting-consistency.md

+
+## Alternatives
+
+We'll just mark everything critical so users can guess what is an urgent


Hmm not sure why this is an alternatives, at least in monitoring team we do not ship all critical alerts.

This is just some place holder text, I don't actually think this is a viable alternative.

One alternative would be to document policies, but not make any backwards-incompatible changes to the existing alerts and only apply the policies to new alerts. I don't think we should do that, but it does address the drawback mentioned above.

Good suggestion!

lilic · 2021-02-09T14:04:47Z

enhancements/monitoring/alerting-consistency.md

+
+We really need "Critical", "Major", "Minor" and "Info."  Can/Should we do this?
+
+### Test Plan


I would actually love to see a test plan here, adding tests to e2e origin of openshift which have some best practices tested for alerts would be good here. Otherwise we are in the same spot in 6 months :) WDYT?

I'm not opposed to a test plan, but I don't have one ATM. I'm open to suggestion here.

Take out a few compute machines. Wait an hour. Confirm that no critical alerts are firing, or were ever firing.

A few ideas:

Ensure that all alerts have a severity label and comply with the list of allowed values.

Ensure that no critical alert gets added without a sign-off from teams that are on-call (e.g. OSD).

Ensure that all alerts have mandatory annotations (e.g. description and summary).

Do you actually need to test the alert condition duration or can this be sped up, eg, do I actually need to take out some machines to see if the alert would fire? If I do need to take them out, do I have to wait for the hour period?

In terms of testing for alerts I was wondering if we could have some more static analysis style rules but perhaps that's a bit naive in at this level

I would like folks opinion if we need to resolve a test plan approach for alerts in the context of merging this enhancement.

Depending on what we agree about rule conventions and guidelines, we should translate the more we can into origin e2e tests. E.g. there should be a test validating that all alert rules have a severity label matching the set of allowed values. Same goes for the summary and description annotations if we agree that they are mandatory.

Asking teams to write unit tests for all rules using promtool is going to have a small benefit IMHO. Unit tests are mostly useful when joining metrics from different sources, I expect that most of the rules don't fall in this category.

I would like a test plan, because policies that are not backed up by automated tests are hard to preserve. Doesn't mean we need to have the tests implemented before we can start pushing towards new policy, but we should at least have an idea of what the tests will look like.

New e2e tests in openshift/origin:

Test validating that all alerting rules have a severity label matching the set of allowed values.

Test validating that all alerting rules have a summary and description annotations.

Test validating that all runbook_url annotations point to an existing web page.

@simonpasquier link to these tests?

lilic · 2021-02-09T14:05:54Z

enhancements/monitoring/alerting-consistency.md

+
+### Documentation Required
+
+Every alert needs to have a URL to relevant documentation to resolve the issue.


There is a RFE for this already, but it is not possible to do this right now, due to disconnected evns from last time I spoke with monitoring PM. This is essentially runbooks no?

I don't know what disconnected has to do with having a URL? Even if the cluster is in an isolated environment, that doesn't mean the person looking at the alerts is. At the worst case, someone can just copy and paste the URL.

This was the issue at the time, it also was a problem due to a lot of alerts coming from upstream mixins, which means URLs would be different. There should be a ticket for runbooks on the monitoring board with more details on this, its being worked on or at least last I heard. Best to contact Christian our PM for more details here.

We shouldn't use upstream alert mixins as part of the core alerting. How many alerts are we even talking about? Can't be more than a few dozen, probably some of which we don't want or need. We need to trim down the superfluous stuff.

All alerts for prometheus, alertmanager, prometheus-operator, kubernetes, etcd and majority for k8s api, k8s scheduler, k8s controller manager. All in all, around 150 alerts and recording rules. The majority of alerts in OpenShift ARE coming from upstream and this is aligned with "upstream first" mentality.

We should use alerts from upstream due to the same reason we use kubernetes as base. We shouldn't invent everything ourselves.

IMO, this is reactive and not proactive. There's a reason most projects in upstream are alpha/beta statuses. We should selective choose and groom the alerts we care about to deliver the best experience to our users. If we're just bulk-importing the alerting rules, it means we (probably) don't have resolution docs to go with them.

#298 has me poking around in this space for conditions, since we have more control over there. I like the link idea, and they don't have to be external. We can have alert messages link to local, console-hosted docs or UXes that help unpack the issue being reported in more context, like this. And paragraph-or-two level of discussion on likely impact and suggested recovery procedures, I don't see why we couldn't either grow out the alert message, or attach a new field/annotation to alerts to carry that information, even for alerts defined in upstream mixins.

IMO, this is reactive and not proactive.

All alerts I mentioned are created upstream in projects largely managed by OpenShift engineers. Most of them are tweaked because of OpenShift. I don't see how this is reactive.

Just because something exists upstream doesn't mean it's good and doesn't mean it belongs downstream. We should bring in the quality pieces and leave the rest.

Here's an example: "KubePodNotReady"

That is a warning level alert. I'm assuming that is coming from upstream. Is having this alert helping many people, or is it contributing to the noise part of signal-to-noise?

A better alert would be for the specific application (user created alert) or for a trend, if the trend of non-ready pods is increasing over time.

In the SRE team orthogonal effort to reduce alert noise (which is ongoing for many months, predates this PR, yadda yadda), we will be forced to improve the upstream alerts. That is a circular workflow that we worked with the monitoring team on. SRE team knows what to do when something like this comes up.

RiRa12621

/hold
I see the intention here, however some things in regards to that are in flight already, most is not really defined very well so far, hence the comments.

On a little side note: calling the alerts as a whole "a mess" doesn't really do the amount of work justice that people have put into them and is not really a formulation that should be used in an OEP and rarely ever when talking about other people's work

RiRa12621 · 2021-02-09T14:44:21Z

enhancements/monitoring/alerting-consistency.md

+
+## Alternatives
+
+We'll just mark everything critical so users can guess what is an urgent


Not sure an OEP is the right place for sarcasm

RiRa12621 · 2021-02-09T14:45:40Z

enhancements/monitoring/alerting-consistency.md

+
+### Open Questions [optional]
+
+Frankly, having just "Critical" and "Warning" is not helpful.


Info exists and is encouraged to be used

Personally, I think "Info" is not something that should be in 'alerts.' We need a different channel for informational messaging.

The community isn't fully decided on what is the way to go, however with us inheriting a major set of OCP alerts from https://github.com/kubernetes-monitoring/kubernetes-mixin it makes sense to obey the alert severities defined there

Which alerts are we inheriting?

All of them
https://github.com/openshift/cluster-monitoring-operator/blob/721bc047264fb4f52581f9c595fd31a304c6eaf0/jsonnet/jsonnetfile.json#L16

https://github.com/prometheus-operator/kube-prometheus/blob/de2d1b523135cd4e561b899b294f015e52e4f704/jsonnet/kube-prometheus/jsonnetfile.json#L43

RiRa12621 · 2021-02-09T14:45:58Z

enhancements/monitoring/alerting-consistency.md

+
+Every alert needs to have a URL to relevant documentation to resolve the issue.
+
+If there's not documentation, there shouldn't be an alert.


Suggested change

If there's not documentation, there shouldn't be an alert.

If there's no documentation, there shouldn't be an alert.

This sentence re-states the previous sentence. For emphasis, I guess? Drop it

RiRa12621 · 2021-02-09T14:46:21Z

enhancements/monitoring/alerting-consistency.md

+
+Every alert needs to have a URL to relevant documentation to resolve the issue.
+
+If there's not documentation, there shouldn't be an alert.


That's in fact true, hence we're bootstrapping GitHub.com/openshift/runbooks

RiRa12621 · 2021-02-09T14:47:02Z

enhancements/monitoring/alerting-consistency.md

+
+#### Example 3:  I have an operator that only needs 1 replica to function
+
+Well, if the cluster can upgrade with only 1 replica, and the the service is


Suggested change

Well, if the cluster can upgrade with only 1 replica, and the the service is

If the cluster can upgrade with only 1 replica, and the service is

RiRa12621 · 2021-02-09T14:48:51Z

enhancements/monitoring/alerting-consistency.md

+if not immediately addressed or if the cluster is already in a critical state.
+
+Some critical states are:
+* loss or impending loss of etcd-quorum.


That's a very special case and not really fits the bigger picture of "cluster unhealthy"

I agree that etcd-quorum loss is a special case, that's why it should be critical. If it's something similar to impact to etcd-quorum loss, it should be critical. If it's dissimilar, it should not be critical.

RiRa12621 · 2021-02-09T14:49:15Z

enhancements/monitoring/alerting-consistency.md

+were to restart or need to be rescheduled, it would not able to start on another
+host
+
+In otherwords, critical alerts are something that require someone to get out


Suggested change

In otherwords, critical alerts are something that require someone to get out

In other words, critical alerts are something that require someone to get out

RiRa12621 · 2021-02-09T14:49:28Z

enhancements/monitoring/alerting-consistency.md

+
+In otherwords, critical alerts are something that require someone to get out
+of bed in the middle of the night and fix something **right now** or they will
+be faced with a disaster.


clarify "disaster"

Something that requires use of disaster recovery docs.

Clarify in the proposal if possible please

RiRa12621 · 2021-02-09T14:50:54Z

enhancements/monitoring/alerting-consistency.md

+be faced with a disaster.
+
+An example of something that is NOT a critical alert:
+* MCO and/or related components are completely dead/crash looping.


That's to say that MCO itself is not relevant to the clusters health, which is not true for all cluster setups

The MCO being down does not create a disaster situation. The MCO does not need to be alive to run existing workloads, ensure the API is available (schedule new/replacement workloads), or jeopardize etcd-quorum.

The MCO might need to be alive to run existing workloads. For example, if your pull-secret needs rotating, or your Kube-API CA is getting rolled, etc., those are things that get pushed out to nodes via the machine-config operator. And if that isn't getting pushed out, the nodes may die. And possibly be destroyed. And new nodes (replacements or autoscaling increases) will not come up without sufficient machine-config stack health. So lots of vulnerabilities. But vulnerabilities are not necessarily page-at-midnight until the lack of capacity or stale-pull-secret or other fallout is actually biting you. And "core MCO operator is dead" shouldn't happen all that often anyway. Is this really something where noise is a problem? Or are you extrapolating from a different machine-config alert that you do find too noisy?

It sounds like in this case there should be independent monitoring of this particular situation for kube-api CA's getting rolled. So, the alert would be something along the lines of "CA not synced && MCO dead".

"&& MCO dead" doesn't sound like something the CA-generator should be taught to care about. But kubelets could provide metrics about remaining CA time, and there could be a new alert about those expiration times dropping under a given threshold? I think it's fine to leave it to the user or local Insights-style rules to consolidate a collection of distinct alerts into diagnoses. There is a distinction between "there is a problem" (alerts) and "I have a theory for the root cause" (diagnoses), and if we attempt to only target diagnoses with alerts, we risk silence when there is some new failure mode or presentation that doesn't match an existing diagnosis fingerprint.

I think this is wrong example. MCO for sure is on the critical path and it should alert appropriately as critical component due to all reasons above and that it is actioner of a lot of control plane actions. And alerting in kube manager for "Cert rotation stuck" without giving me where to look is almost useless. But Alerting in kube manager with "Cert Rotation stuck. Rotation path: Manager -> MCO-> node" Gives me hint what to check and which alerts to look for next. So I dont see MCO alert in critical I know it works fine and go to the next in line.
Every component and alert needs to understand their dependency flow in the the system overall.

s/MCO/samples operator/ and your example should be good.

I disagree. If Cert rotation is the thing that will drag the cluster down, then we need to set an alert for that specifically. We also aren't (shouldn't be?) waiting until the last possible minute to do cert rotation, we have plenty of time to take action on that before they expire, on the order of weeks IIRC. So, if you have an MCO that goes down, it's not immediately critical. Yes, if you completely neglect your cluster for weeks on end, it will fail, but the MCO itself going down is not going to prevent application communication.

RiRa12621 · 2021-02-09T14:52:19Z

enhancements/monitoring/alerting-consistency.md

+
+* Define clear criteria for designating an alert 'critical'
+* Define clear criteria for designating an alert 'warning'
+* Define minimum time thresholds before prior to triggering an alert


That's unrealistic. Certain conditions can't be recovered from on their own and/or need instant action because in a high number of cases the component doesn't recover without interaction. There doesn't need to be an arbitrary wait time.
Generally arbitrary wait times are hard to define all over OpenShift

I disagree. This is about consistency. The point of this document is to discuss what we think alerting and related elements should look like. There's no reason to not define a standard set of criteria. Anyone that thinks they have a special use case is probably wrong.

this seems like it would be easy to audit. For example, alerts installed via the CVO that do not set for:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.7.0-rc.0-x86_64 $ for x in $(grep -lr '^kind: PrometheusRule' manifests); do yaml2json < "${x}" | jq -r 'if (. | type) == "object" then [.] else . end | .[] | select(.kind == "PrometheusRule").spec.groups[].rules[] | select(.alert != null and .for == null) | .labels.severity + " " + .alert + " " + .expr'; done | sort | uniq critical MCDRebootError mcd_reboot_err > 0 warning FailingOperator csv_abnormal{phase="Failed"} warning ImageRegistryStorageReconfigured increase(image_registry_operator_storage_reconfigured_total[30m]) > 0 warning KubeletHealthState mcd_kubelet_state > 2 warning MCDPivotError mcd_pivot_err > 0 warning SystemMemoryExceedsReservation sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9)

How do we feel about those?

"SystemMemoryExceedsReservation" is definitely one that needs a threshold. I have some questions about the utility of this particular alert, but that's outside the scope of this discussion.

SystemMemoryExceedsReservation is here from openshift/machine-config-operator#2033, if you want to follow up on that alert.

pretty much any alert that spuriously fires(in the sense that no action is needed, it's going to be fine once the upgrade finishes) during upgrade today is a candidate for "this alert needs to wait longer before firing" imho.

i'm also very interested to see examples where the time it takes an admin to see, process, and take action to resolve an alert doesn't significantly dominate the time between the condition happening and the alert firing. That is to say:

if it takes an admin 30 minutes to fix something they were alerted to, then does it matter whether we alerted them 0 minutes or 5 minutes after it started happening? Particularly if by waiting 5 minutes we may be able to avoid alerting them at all (because the situation does resolve itself, such as in the case of most alerts caused by node reboots).

I tried to capture some timeline standards in the individual alerting sections. I would say, for critical alerts, they can probably be added to a 0m, 5m or 15m bucket.

0m bucket: etcd quorum is lost, we don't know how long this cluster might be alive so fire right away and hope for the best.
5m bucket: etcd quorum member lost (1 of 3). If it's a small network blip, hopefully this is covered by etcd's internal timeout, and 5 minutes should hopefully be enough to recover.
15m bucket: Pretty much all other critical alerts, like 1/3 healthy API servers.

Back to the upgrade bit, we shouldn't fire any alerts during upgrade, however, I don't think timing is the core issue, at least for the critical alerts. The vast majority of alerts could be Warning or lower severity, and 60m timeout. This achieves a number of things. First, it gives time for the situation to self-correct, or for a helper like MachineHealthChecks to remediate a problem. Second, it allows us to know that it's probably not going to self correct. Third, it keeps your alert screen from having tons of alerts at once, especially if there are critical alerts firing, we don't need the extra distractions.

Many of our alerts today are 'raw' alerts, IMO. Things like 9/10 daemonsets are running for MCD. Does this need an alert? It would be nice if we can aggregate some things into node specific issues. EG, if sdn-daemon is down on a given node, we should have a 'NodeUnhealthy' alert, and aggregate some conditions onto that screen. I'm not sure how feasible this is today. At the very least, each component's operator could detect the specifics about it's operands and a given node. For instance, if a given node is under maintenance, and the daemonset is missing, that's expected. The operator is the layer we can utilize to understand this state.

I do think we need to provide numbers as guidance, if only from a hygiene standpoint. Otherwise we're not going to make progress on standardizing, which is the purpose of this PR. Even if those numbers need tweaking in the end, or if they don't make sense on a particular alert, we can't reasonably ask a component team to write alerts without this basic guidance. This whole process is iterative and we have to start with something.

Isn't the wording on this a but doubled now?

michaelgugino · 2021-02-09T15:32:47Z

/hold
I see the intention here, however some things in regards to that are in flight already, most is not really defined very well so far, hence the comments.

On a little side note: calling the alerts as a whole "a mess" doesn't really do the amount of work justice that people have put into them and is not really a formulation that should be used in an OEP and rarely ever when talking about other people's work

The alerts are a mess. This isn't a reflection of any particular person or team's effort, it's because all of our teams are on different pages about alert levels, documentation, and having a consistent UX. I've heard multiple conversations along the lines of "What alerts can I ignore for upgrades?" and that is a huge problem. I want to fix this problem, I think other people want to fix this problem. I haven't seen any process improvement to address this problem.

dhellmann · 2021-02-09T15:42:26Z

enhancements/monitoring/alerting-consistency.md

+
+## Summary
+
+Alerts are a mess.  There's not really formal guidelines to follow when


These docs are supposed to be aspirational, right? I think this is probably not what we aspire to have. :-)

I think you're right, probably Summary should be more future oriented, thanks!

dhellmann · 2021-02-09T15:43:06Z

enhancements/monitoring/alerting-consistency.md

+### Goals
+
+* Define clear criteria for designating an alert 'critical'
+* Define clear criteria for designating an alert 'warning'


Do we only care about those 2 levels, or is the goal to describe criteria for all of the levels?

You tell me. I just wanted to get the conversation started. I would prefer to have 3 alert levels, "Critical" "Major" and "Minor". Since that ship might have already sailed, I am focusing on what we have today.

Other levels are mentioned in the text below, so I wasn't sure if you were just focusing on the top 2 as a start or if there was another reason they weren't included here. I'd be OK with starting by nailing down the criteria for critical and warning and seeing where that takes us.

Same here. It is listed as TBD below. Not a blocker for initial merge of this IMO.

dhellmann · 2021-02-09T15:46:26Z

enhancements/monitoring/alerting-consistency.md

+
+### Critical Alerts
+
+TL/DR:  Fix this now or cluster dies.


Are we talking just about the cluster falling over, or would an alert about a situation that might lead to data loss without the cluster halting also qualify?

I think the TLDR doesn't capture all the situations we might need a critical alert, I expanded upon some of these below.

If the purpose of this doc is to provide guidance, I think it would help to phrase these sections more imperatively, as instructions to someone designing or implementing an alert. So here we might say

Reserve critical level alerts only for reporting conditions that may lead to loss of data or inability to deliver service for the cluster as a whole. Failures of most individual components should not trigger critical level alerts, unless they would result in either of those conditions. Configure critical level alerts so they fire before the situation becomes irrecoverable. Expect users to be notified of a critical alert within a short period of time after it fires so they can respond with corrective action quickly.

and then below expand on that with the sorts of details that you have already with examples.

dhellmann · 2021-02-09T15:58:02Z

enhancements/monitoring/alerting-consistency.md

+
+## Alternatives
+
+We'll just mark everything critical so users can guess what is an urgent


One alternative would be to document policies, but not make any backwards-incompatible changes to the existing alerts and only apply the policies to new alerts. I don't think we should do that, but it does address the drawback mentioned above.

dhellmann · 2021-02-09T16:20:02Z

enhancements/monitoring/alerting-consistency.md

+
+### Critical Alerts
+
+TL/DR:  Fix this now or cluster dies.


If the purpose of this doc is to provide guidance, I think it would help to phrase these sections more imperatively, as instructions to someone designing or implementing an alert. So here we might say

Reserve critical level alerts only for reporting conditions that may lead to loss of data or inability to deliver service for the cluster as a whole. Failures of most individual components should not trigger critical level alerts, unless they would result in either of those conditions. Configure critical level alerts so they fire before the situation becomes irrecoverable. Expect users to be notified of a critical alert within a short period of time after it fires so they can respond with corrective action quickly.

and then below expand on that with the sorts of details that you have already with examples.

dhellmann · 2021-02-09T16:40:26Z

enhancements/monitoring/alerting-consistency.md

+### Warning Alerts
+
+TL/DR: fix this soon, or some things won't work and upgrades will probably be
+blocked.


Taking a similar stab at rephrasing this section:

Use warning level alerts for reporting conditions that may lead to inability to deliver individual features of the cluster, but not service for the cluster as a whole. Most alerts are likely to be warnings. Configure warning level alerts so that they do not fire until components have sufficient time to try to recover from the interruption automatically. Expect users to be notified of a warning, but for them not to respond with corrective action immediately.

Great wording here, thanks!

agree, thanks doug

paulfantom · 2021-02-09T17:17:42Z

We (the monitoring team) took the available materials and guidelines published by the monitoring community. Materials that were created over the years with the cooperation of Administrators/SRE people who actually use monitoring on a day-to-day basis. That data was then converted to a blog post which should be published soon and it is meant to provide guidance on alert creation for OpenShift internal stakeholders as well as OpenShift customers. AFAIK it is also aligned with SREP internal practices.

All this means, we already have materials on how alerts should be created, how they should look like, and what severities should be used. The matter is how to enforce those practices and not what practices should be created. This document is lacking the critical aspect in form of the HOW question.

michaelgugino · 2021-02-09T17:53:57Z

All this means, we already have materials on how alerts should be created, how they should look like, and what severities should be used.

We need an easy to consume, 1 page, simple document for OpenShift developers to create alerts. We need a document that is about OpenShift alerts, and not alerts generally. In the context of OpenShift, we need to determine what types of scenarios map to which alert levels. Since the monitoring team doesn't own all of the alerts, it needs to be consumable by all the different teams.

While blog posts and similar content are valuable, we need something formalized into developer documentation.

paulfantom · 2021-02-09T19:43:29Z

Here you go, OpenShift docs for 4.6, section Understanding alert filters. This explains all available severities in OpenShift and it is already written in an easy, consumable way.

We need a document that is about OpenShift alerts, and not alerts generally

When it comes to monitoring and alerting, OpenShift is not different from any other microservice environment hence our blog post as well as documents listed earlier by other folks from the monitoring team still hold.

we need something formalized into developer documentation.

I agree that we need docs. But I would agree more on the enforcement of such docs and this again goes into the HOW question. If you have ideas on HOW then I am all ears if not then I have no idea what this OEP is for as things are already documented, it is just that this documentation needs to be read and applied.

michaelgugino · 2021-02-09T19:55:27Z

OpenShift docs for 4.6, section Understanding alert filters.

I find the wording pretty suspect for "Warning":

The alert provides a warning notification about something that might require attention in order to prevent a problem from occurring.

This sentence alone demonstrates the confusion. Warning notifications should require attention to correct a problem. If it "might require attention" to "prevent a problem" then it shouldn't be a warning level alert at all. This leads to a sea of un-actionable alarms for administrators.

OpenShift is not different from any other microservice environment

I fundamentally disagree with this statement, and this is sort of the problem. We need to be thinking about monitoring the platform holistically rather than small pieces, at least when it comes to critical and warning alerts. This is why we have a bad signal to noise ratio.

In any case, the nature of the enhancement process is to enhance. So even if we have these things in some current form, this enhancement allows us to discuss them and decide if the process is working, and if not, how we should change it. We have evidence that alerting in it's current state is flawed.

RiRa12621 · 2021-02-09T20:08:49Z

In any case, the nature of the enhancement process is to enhance.

This enhancement doesn't allow us to do that very well, since the author didn't put a lot, if any, research into the in-cluster monitoring stack. A lot of things are based on the author's personal assumptions and opinions. The author didn't make any efforts up front to find out about potential initiatives that are already started or in progress.
The wording is very poor and accusing and almost none of the comments, even issues by multiple very SMEs are just taken defensively and not added to the proposal.

So even if we have these things in some current form, this enhancement allows us to discuss them and decide if the process is working, and if not, how we should change it.

This only works if the PR authors is open for discussion which involves accepting any opinion other than their own.

We have evidence that alerting in it's current state is flawed.

It would make sense to present this for context

IMHO the best way would be to close this PR, conduct the aforementioned research and then open a new one.
Else we will have a very long discussion here without a meaningful outcome.

michaelgugino · 2021-02-09T20:49:30Z

In any case, the nature of the enhancement process is to enhance.

This enhancement doesn't allow us to do that very well, since the author didn't put a lot, if any, research into the in-cluster monitoring stack. A lot of things are based on the author's personal assumptions and opinions. The author didn't make any efforts up front to find out about potential initiatives that are already started or in progress.
The wording is very poor and accusing and almost none of the comments, even issues by multiple very SMEs are just taken defensively and not added to the proposal.

So even if we have these things in some current form, this enhancement allows us to discuss them and decide if the process is working, and if not, how we should change it.

This only works if the PR authors is open for discussion which involves accepting any opinion other than their own.

We have evidence that alerting in it's current state is flawed.

It would make sense to present this for context

IMHO the best way would be to close this PR, conduct the aforementioned research and then open a new one.
Else we will have a very long discussion here without a meaningful outcome.

I'm sorry you feel negatively about this enhancement.

I opened this enhancement to get the discussion started, not to say I have all the right ideas. You are welcome to open your own enhancement to counter this one, or we can work on this one collaboratively. This enhancement is a place to share ideas and opinions. I don't prefer how we're managing alerts today, and I want to address it.

enhancements/monitoring/alerting-consistency.md

wking · 2021-02-10T05:28:54Z

enhancements/monitoring/alerting-consistency.md

+* loss or impending loss of etcd-quorum.
+* inability to route application traffic externally or internally
+(dataplane disruption)
+* inability to start any new pod other than capacity.  EG, if the API server


"other than capacity" is not very clear. Are you suggesting that pods waiting on capacity are not critical? Or that that's the only situation in which pods failing to start are critical? Personally, I don't think "pods failing to start" would ever be critical. One fewer API-server pod is a bummer, but the other two will be able to handle the necessary traffic. Similarly for other components. A critical alert is "we are down to one, surviving API-server pod. Wake up, Admin, and save us before the cluster goes dark!" or "the auth component is Available=False, so users can't log in".

I'm talking about a cluster-wide inability to start a new pod due to some degraded condition other than capacity. Capacity meaning you don't have RAM/CPU to run more pods.

Said another way, if a currently running pod were to fail, could it be restarted and become ready on this host or another? One example might be SDN is totally down.

Rolling into #637 (comment) , which is also quibbling about this line. Feel free to mark this thread resolved.

wking · 2021-02-10T06:01:57Z

enhancements/monitoring/alerting-consistency.md

+
+60 minutes?  That seems high!  That's because it is high.  We want to reduce
+the noise.  We've done a lot to make clusters and operators auto-heal
+themselves.  The whole idea is that if a condition has persisted for more than


Upstream alert levels from here:

Critical: An issue, that needs to page a person to take instant action

Warning: An issue, that needs to be worked on but in the regular work queue or for during office hours rather than paging the oncall

Info: Is meant to support a trouble shooting process by informing about a non-normal situation for one or more systems but not worth a page or ticket on its own.

For warnings, I'd be happier with "if you are very confident that this will not auto-heal, even if it hasn't been 60m". And example would be AlertmanagerFailedReload, which, as I read it, will fire after 15m without a successful load. The expectation is that someone's fumbled the config, and only an admin correcting the config will recover things. I'd personally be happier if the alert manager copied verified configs over into some private location, so "admin fumbles a config touch" didn't leave you exposed to losing all alert manager functionality if the pod restarted. The fact that config fumbles leave you vulnerable to alert-manager-loss today, and alert-manager being the thing paging you at midnight for other issues, makes the current critical there make sense to me. If the alert manager did have valid-config-caching that survived pod restarts/reschedules, I'd rather have the alert be a warning, so no midnight page, but I'd still be comfortable waiting only 15 minutes before firing, because you want the admin to come back and fix the config they broke before they clock out for the night.

In practice, AlertmanagerFailedReloadwould only fire if someone touches the Alertmanager configuration which means that there's someone already awake. It might change when we allow users to manage their own AlertmanagerConfig resources but the operator should be able to validate that the configs aren't broken.

if the alert manager copied verified configs over into some private location

In principle yes but given the current design of Alertmanager, it isn't trivial. As stated above, the Prometheus operator should take care of checking that at least the configuration is syntactically valid.

...only fire if someone touches the Alertmanager configuration which means that there's someone already awake.

I used that same argument internally when I tried to argue that CannotRetrieveUpdates made sense at critical, because it was firing was because OSD tooling had set a bogus spec.channel leading to VersionNotFound. It still got demoted to warning in openshift/cluster-version-operator#509. Assuming that an awake human is the one touching your config is apparently a leaky argument ;).

In principle yes but given the current design of Alertmanager, it isn't trivial...

The way it works today doesn't seem to be bothering folks too often, and it makes sense to have "lots of work to solidify guards for a rare situation" be lower priority than lots of other things.

CannotRetrieveUpdates is a bit different IMHO since it can probably wait until tomorrow morning (unless I'm wrong about the scope). If the Alertmanager config is broken, you're only one step away from losing all alert notifications in case all your Alertmanager pods get restarted.
But I don't think we need to argue on this specific alert here and I'd be fine if someone wants it to be demoted to "warning" because it causes them pain :)

Is alertManager rolled out and gated on readiness during changes? Seems like catastrophe is already being prevented here.

...can probably wait until tomorrow morning (unless I'm wrong about the scope)...

Depends on the alerts you're missing. If there are no available updates, not being able to fetch them doesn't matter. If someone is DoSing your network to keep you from hearing about a critical CVE update, then not hearing about them means attackers have longer to exploit the flaw before you wake up and start working the update-retrieval issue.

Is alertManager rolled out and gated on readiness during changes?

You are currently safe as long as there is an alertmanager around which was running the old config. But that's not a resilient position. For example, someone bumps a MachineConfig and the machine-config operator rolls your compute pool, and now your Alertmanagers are all dead. Maybe PDBs protect you from some of that? But OSD allows customers to configure limited respect for PDBs on the ~hours scale, and sometimes machines go down hard, without allowing time for graceful PDB guards.

wking · 2021-02-10T06:07:20Z

enhancements/monitoring/alerting-consistency.md

+That seems significant
+and it probably is, but nobody has to get out of bed to fix this **right now**.
+But, what if a machine dies, I won't be able to get a replacement?  Yeah, that
+is probably true, but how often does that happen?  Also, that's an entirely


"doesn't happen often" is not an argument for "not critical". The argument for this not being critical is that there is no risk of data-loss or customer impact inherent in a machine dying, or failing to provision a new one, or rolling out new secrets to nodes, or all the other things that the machine-config does. The critical impact would be "my customer-facing workload is failing to schedule because I have no capacity" or "I am on the edge of etcd-quorum loss because I cannot revive my third control-plane node". Those are critical. Maybe they are due to the machine-config stack being sad, so it's good to have warning-level machine-config alerts. But I don't think that they need to be critical-level machine-config alerts. But "happens rarely" is an argument for "even if the severity is wrong, fixing it is not all that important", not an argument for what the severity should be.

"my customer-facing workload is failing to schedule because I have no capacity"

No, we can't own this. Users should have their own monitoring for apps that fail to schedule. There's all sorts of capacity user stories and users need to use careful consideration in this area.

The machine-api does not replace failed instances anyway, MHC does that.

I disagree that frequency is not a component of severity. If clusters routinely lost 1 instance an hour (in some hypothetical reality... there's a joke to be made about a cloud provider in here lol) just as a matter of course, then having a function machine-api would be critical.

Users should have their own monitoring for apps that fail to schedule.

User workloads should have "failed to schedule" monitoring so they can stay under the current caps and quotas. But cluster admins should have "limited capacity is impacting user workloads" monitoring so they can decide when they want to grow quota. This is the same as the autoscaler, which definitely cares about user workload capacity today, despite being managed by cluster admins. The alerts would fill the manually-scaled use case, and also cover the "autoscaler is maxed / dead" use case.

The machine-api does not replace failed instances anyway, MHC does that.

I thought the MHC killed dead/degraded machines/nodes, but that the machine API then provisioned the replacement Machine to get the MachineSet back up to it's spec.replicas. If you have a MHC on a machine/node that is not part of a MachineSet, is a replacement created? I'd have expected no replacement to be created.

If clusters routinely lost 1 instance an hour (in some hypothetical reality... then having a function machine-api would be critical.

But it would not be a critical alert. Losing your customer-facing workloads or living on the edge of etcd quorum loss would be a critical alert. And then there would be machine API is dead, high machine death rate, and unable to schedule many pods due to limited overall cluster capacity warning alerts to help point the responding admin at likely root causes.

"limited capacity is impacting user workloads" monitoring so they can decide when they want to grow quota

This assumes having pending pods is a problem. Some clusters might schedule very large batches of jobs, having limited capacity is not a problem for them, they just want to chew through the jobs. Like I said, there are all sorts of capacity user stories. A cluster can run indefinitely with 0 capacity.

If you have a MHC on a machine/node

I can't recall MHC specifics, but it's not intended to delete machines that aren't part of a machineset today. The machineset creates the replacement machine, but in effect, you can paraphrase that the MHC replaces the failed machine.

elmiko · 2021-02-17T20:18:44Z

enhancements/monitoring/alerting-consistency.md

+Clear and actionable alerts are a key component of a smooth operational
+experience.  Ensuring we have clear and concise guidelines for our alerts
+will allow developers to better inform users of problem situations and how to
+resolve them.


+1, i really like the idea of creating consistent guidelines for not only the alerts, but the documentation to support them. i am also hugely in favor of bringing the focus of generating those artifacts closer to the project development.

michaelgugino · 2021-02-24T17:23:33Z

enhancements/monitoring/alerting-consistency.md

+* Define ownership and responsibilities of existing and new alerts
+* Establish clear documentation guidelines
+* Outline operational triage of firing alerts
+* Explore additional alerting severity (eg, 'minor')


Chime in here if you're for or against a new alert level. My feeling is 'info' doesn't seem much like an alert (indeed, we include stuff like 'cluster update available' in this level) and we could better communicate a problem with something like 'minor'. For as long as I can remember, monitoring systems other than ours have had at least 3 levels of alerts.

I think it makes sense to have multiple levels (info, warning, critical) if we plan to use alerts to give signal about upcoming problems, eg, at x minutes you get info, at 5x minutes you get warning. A use case I can think of for this is certificate expiry, if we expect certificates to exist for 30 days and to be renewed with 7 days left, you may want to info alert if it's past the renewal period, warning at half way through this and critical when it gets close to the expiry.

In this case this to a user means:

Info: Something hasn't happened that should have, but it might fix itself so no need to be concerned yet

Warning: This isn't breaking your system, but it's failed multiple times so probably needs human intervention, don't wake up, fix it tomorrow during normal working hours

Critical: You system is now broken/will imminently be broken and therefore human intervention is required now

i don't have a strong feeling for what is a better experience for users. i am willing to extend trust to those who have more direct experience in this area. i would like to hear if the proposed 4 levels would be more helpful to the people who are operating openshift, would like to see some SRE folks chime in here (sorry if you have and i didn't recognize).

my thoughts about the various levels

info: informational only. i feel like even calling this an "alert" is strongly worded, this sounds more like "exposed log records". but, this is valuable as it can save users time from look at actual logs, or having to dig for information like updates being available.

warning: something bad is happening and needs to be fixed, it will not crash the cluster immediately but should be investigated as it may lead to cluster instability.

critical: wake someone up, the cluster is dead or dying. should be self explanatory

to me, "minor" would mean "this is a problem, but it does not need to be prioritized for investigation and will never lead to a cluster disruption". i can see the value in having something at that level, i'm just not sure how this fits into an operational team's experience. @michaelgugino's reasoning for having the extra level makes sense to me, i can see the value in adding a "minor" level alert.

hope that helps, just my two cents =)

I suggest we scope an additional level out of this enhancement and follow up later.

From an SRE perspective, there's basically only two levels:

paging,

everything else.

Paging should be criticals, so "something's likely very bad and needs attention asap". If there's a repeated situation where the criticals are unactionable or multiple criticals are produced per a single underlying common issue (therefore multiple pages), those alerts needs review.

For "everything else" bucket… that's a whole doc worth of considerations. Basically if it doesn't indicate an underlying issue to be fixed by a human a high percent of the times, it should be out.

deads2k · 2021-03-01T19:29:16Z

enhancements/monitoring/alerting-consistency.md

+A: It's a warning level alert.
+
+
+### Alert Ownership


I'm strongly in favor of this aspect of the proposal. If the proposal remains contentious/unmerged overall, I'd like to see this aspect split out. I bet the monitoring team has an interest in helping this along.

This is indeed the only way to quickly improve and have product alerting be sustainable into the future. However, I have not seen an explicit ack on this from OCP engineering at a high enough level + documented/sign-off level.

Spreading out alert ownership without clear and simple guidelines on how to create them to go along with that might lead to a lot of headache for alert consumers down the pipe.

deads2k · 2021-03-01T19:32:15Z

enhancements/monitoring/alerting-consistency.md

+much effort by each component team.  As we mature our alerting system, we should
+ensure all teams take ownership of alerts for their individual components.
+
+Going forward, teams will be expected to own alerting rules within their own


We should enumerate the ownership responsibilities, including

first in line to receive the bug for an alert firing

responsible for describing, "what does it mean" and "what does it do"

responsible for choosing the level of alert

responsible for deciding if the alert even matters

I think we could get some improvement by categorizing alerts by owner so we can push for better refinement, but that could be a later step.

👍 totally agree with what David write, but assuming we have clear guidelines about critical/warning/info alert levels described above in this document.

As I said earlier, I don't think writing this in this PR constitutes active acceptance and acknowledgement by the stakeholders. Want to avoid surprising folks with new responsibilities they had no heads-up to.

deads2k · 2021-03-01T19:35:36Z

enhancements/monitoring/alerting-consistency.md

+* etcd corruption
+* inability to route application traffic externally or internally
+(data plane disruption)
+* inability to start/restart any pod other than capacity.  EG, if the API server


does this imply that the kube-apiserver is considered critical? Perhaps we should enumerate those binaries which are "special".

Anything that can affect an SLA is critical. kube-apiserver is the only SLA our managed product is contracted to uphold. However, I don't think every alert for kube-apiserver should be set to critical.

Anything that can affect an SLO (which are often company and SRE team specific), should be a warning, because SLO thresholds will be set more aggressively than the SLA that they support.

Collapsing discussion from this earlier thread about this same line, I still don't understand the capacity carve-out. If a pod failing to schedule is a problem, then the reason the pod failed to schedule helps admins figure out how to resolve/mitigate, but the problem exists regardless of the reason. E.g. "I have no API-server pods, and am failing to schedule more" is a disaster. If the reason is "because I'm out of CPU", it's still a disaster.

E.g. "I have no API-server pods, and am failing to schedule more" is a disaster. If the reason is "because I'm out of CPU", it's still a disaster.

I think this example is contrary to symptom based alerting. Also, the API-server pods have a very high priority, so capacity should never be an issue on a properly (eg, default or higher) sized node.

Pods failing to schedule due to lack of capacity is not a problem. Sometimes clusters fill up, sometimes this is desired. One use case: run 1 Million of pod x, low priority, preemptible.

jeremyeder · 2021-03-02T19:55:26Z

enhancements/monitoring/alerting-consistency.md

@@ -0,0 +1,195 @@
+---
+title: alerting-consistency


Yeah, include a bulleted list of references in the summary section. The Monitoring section in the Google SRE book includes majority of the guidance to support readers of this document. https://sre.google/sre-book/monitoring-distributed-systems/

jeremyeder · 2021-03-02T19:56:14Z

enhancements/monitoring/alerting-consistency.md

+### Goals
+
+* Define clear criteria for designating an alert 'critical'
+* Define clear criteria for designating an alert 'warning'


Same here. It is listed as TBD below. Not a blocker for initial merge of this IMO.

jeremyeder · 2021-03-02T20:01:11Z

enhancements/monitoring/alerting-consistency.md

+
+* Define clear criteria for designating an alert 'critical'
+* Define clear criteria for designating an alert 'warning'
+* Define minimum time thresholds before prior to triggering an alert


I do think we need to provide numbers as guidance, if only from a hygiene standpoint. Otherwise we're not going to make progress on standardizing, which is the purpose of this PR. Even if those numbers need tweaking in the end, or if they don't make sense on a particular alert, we can't reasonably ask a component team to write alerts without this basic guidance. This whole process is iterative and we have to start with something.

jeremyeder · 2021-03-02T20:03:05Z

enhancements/monitoring/alerting-consistency.md

+* Define clear criteria for designating an alert 'warning'
+* Define minimum time thresholds before prior to triggering an alert
+* Define ownership and responsibilities of existing and new alerts
+* Establish clear documentation guidelines


I want to call out the runbooks repo here. The idea is that every alert comes with a runbook, and they're open source, part of the product code-base. https://github.com/openshift/runbooks.

We had some discussion with the support team on this and they ack'd use of the runbooks repo (the alternative suggested was KCS, but was decided against).

PLM strongly suggest the use of KCS vs a github as it creates more on brand and controlled space (not influenced by the community).

jeremyeder · 2021-03-02T20:04:11Z

enhancements/monitoring/alerting-consistency.md

+* Define ownership and responsibilities of existing and new alerts
+* Establish clear documentation guidelines
+* Outline operational triage of firing alerts
+* Explore additional alerting severity (eg, 'minor')


I suggest we scope an additional level out of this enhancement and follow up later.

jeremyeder · 2021-03-02T21:33:18Z

enhancements/monitoring/alerting-consistency.md

+
+Sometimes an alert's impact is not obvious.  We should state the impact of the
+alert.  Don't expect users to understand the context, we have too many moving
+pieces.


Indeed. context about why the alert exists and what was in the developers head when they wrote it is SO IMPORTANT.

jeremyeder · 2021-03-02T21:38:41Z

enhancements/monitoring/alerting-consistency.md

+
+### Open Questions [optional]
+
+Frankly, having just "Critical" and "Warning" is not helpful.


I would like this sentence added:

"The group of critical alerts should be small, very well defined, highly documented, polished and with a high bar set for entry. This includes a mandatory review of a proposed critical alert by the Red Hat SRE team."

Having this would be an important checkpoint for SRE, one we have a workflow already in place to support you.

jeremyeder · 2021-03-02T21:40:10Z

enhancements/monitoring/alerting-consistency.md

+
+We really need "Critical", "Major", "Minor" and "Info."  Can/Should we do this?
+
+### Test Plan


I would like folks opinion if we need to resolve a test plan approach for alerts in the context of merging this enhancement.

jeremyeder · 2021-03-02T21:41:03Z

enhancements/monitoring/alerting-consistency.md

+
+## Drawbacks
+
+People might have rules around the existing broken alerts.  They will have to


Indeed, and my experience so far is that teams are happy to have a 2nd set of eyes. So I hope this ends up being a non-issue.

jeremyeder · 2021-03-02T21:42:24Z

enhancements/monitoring/alerting-consistency.md

+## Alternatives
+
+Document policies, but not make any backwards-incompatible changes to the
+existing alerts and only apply the policies to new alerts.


Realizing you're listing this because it's in the template, but I feel compelled to say it is not an acceptable alternative from SRE standpoint. We have to do two things: 1) improve existing noise 2) put in place a process (this doc) to ensure we don't regress.

simonpasquier · 2021-03-04T08:22:48Z

enhancements/monitoring/alerting-consistency.md

+* Define minimum time thresholds before prior to triggering an alert
+* Define ownership and responsibilities of existing and new alerts
+* Establish clear documentation guidelines
+* Outline operational triage of firing alerts


Does it cover runbooks as per @jeremyeder comment above? If not what is the intent?

We should pull this in as one of the major tenants of this enhancement. It's a major win from SRE and the general "alert consumer" PoV. It should be the responsibility of the alert author, so that it will be lifecycled along with the alert (e.g. runbook CRUD when alert CRUD).

simonpasquier · 2021-03-04T08:46:27Z

enhancements/monitoring/alerting-consistency.md

+of bed in the middle of the night and fix something **right now** or they will
+be faced with a disaster.
+
+An example of something that is NOT a critical alert:


Perhaps we need a less controversial example of an alert that shouldn't be critical? I can propose ThanosQueryRangeLatencyHigh: Thanos Query isn't xritical to ensure that user workloads keep running so it shouldn't fire critical alerts.

I'd also like to see an example of an alert that is unambiguously critical (e.g. etcdInsufficientMembers).

@simonpasquier would you be able to supply the specific alert definitions in addition to the names so that it can be included here as a reference?

Maybe PrometheusRemoteStorageFailures? The fact that stuff isn't getting pushed to telemeter is an internal problem for us, but customers don't really have a reason to care.

simonpasquier · 2021-03-04T08:54:11Z

enhancements/monitoring/alerting-consistency.md

+
+### Alert Ownership
+
+Previously, the bulk of our alerting was handled directly by the monitoring


I've done the numbers and here is the current breakdown on a 4.7 cluster which could be useful to add to the proposal as a datapoint.

Alerts shipped by the cluster monitoring operator (CMO)

critical: 43

warning: 82

Other alerts

critical: 14

warning: 35

Diving into the alerts shipped by CMO:

alerts for etcd, Kubelet, API server, Kube controller and scheduler. They live in the CMO repository for historical reasons and the plan is to move them to their respective repositories.

etcd: 12 alerts

control plane: 12

kubelet: 6

alerts for the monitoring components (Prometheus, Alertmanager, ...)

57 alerts

alerts for node metrics (clock skew, filesystems filling up, ...).

17 alerts

alerts based on kube-state-metrics metrics like KubePodNotReady, KubePodCrashLooping, ...

25 alerts

@simonpasquier who owns that plan and where can I track it?

openshift/cluster-monitoring-operator#1076 for the control plane alerts and openshift/cluster-monitoring-operator#1085 for the etcd alerts. We need to synchronize with their respective teams for the hand over.

simonpasquier · 2021-03-04T16:33:37Z

enhancements/monitoring/alerting-consistency.md

+enable component owners to control their alerts.
+
+
+### Documentation Required


We need clear specification and recommendations for writing alerting rules. Having consistent format for alerting rules helps with alert triaging and notification dispatch. FWIW there are already guidelines in the monitoring-mixins repository that we should get inspiration from IMO.

the name of the alerting rule should clearly identify the component impacted by the issue (for example etcdInsufficientMembers instead of InsufficientMembers, MachineConfigDaemonDrainError instead of MCDDrainError). It should camel case, without whitespace, starting with a capital letter. The first part of the alert name should be the same for all alerts originating from the same component.

alerting rules should have a "severity" label whose value is either info, warning or critical (matching what we have today and staying aside from the discussion whether we want minor or not).

alerting rules should have a description annotation providing details about what is happening and how to resolve the issue.

alerting rules should have a summary annotation providing a high-level description (think of it as the first line of a commit message or email subject).

if there's a runbook in https://github.com/openshift/runbooks, it should be linked in the runbook_url annotation.

Another guideline for alerts that work on gauge metrics: the rule expression needs to account for failed scrapes that can create false-negatives. For example, say that SamplesDegraded (expressed as openshift_samples_degraded_info == 1) is firing and Prometheus fails to scrape the target then the alert would resolve itself because the metric has been marked as stale. A better alerting query would be max_over_time(openshift_samples_degraded_info[5m]) == 1 (see https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details).

This is fantastic - @michaelgugino can you pull this in?

simonpasquier · 2021-03-04T16:39:45Z

enhancements/monitoring/alerting-consistency.md

+
+We really need "Critical", "Major", "Minor" and "Info."  Can/Should we do this?
+
+### Test Plan


Depending on what we agree about rule conventions and guidelines, we should translate the more we can into origin e2e tests. E.g. there should be a test validating that all alert rules have a severity label matching the set of allowed values. Same goes for the summary and description annotations if we agree that they are mandatory.

Asking teams to write unit tests for all rules using promtool is going to have a small benefit IMHO. Unit tests are mostly useful when joining metrics from different sources, I expect that most of the rules don't fall in this category.

simonpasquier · 2021-03-04T17:09:13Z

enhancements/monitoring/alerting-consistency.md

+
+Well, if the cluster can upgrade with only 1 replica, and the the service is
+not unavailable despite other replicas being unavailable, this can probably
+be just an info-level alert.  We should have a "minor" alert, TBD.  But for now,


List of info alerts in 4.7:

MultipleContainersOOMKilled (Bug 1807139: jsonnet/rules.jsonnet: Lower threshold for MultipleContainersOOMKilled alert cluster-monitoring-operator#690 and https://bugzilla.redhat.com/show_bug.cgi?id=1807139)

KubeQuotaAlmostFull (Split the alerts for quota usage reached and exceeded kubernetes-monitoring/kubernetes-mixin#494)

KubeQuotaFullyUsed (KubeQuotaExceeded fires even if quota is not exceeded kubernetes-monitoring/kubernetes-mixin#441 and alerts/resource_alerts.libsonnet, runbook.md: Reduce KubeQuotaExceeded severity and adjust threshold kubernetes-monitoring/kubernetes-mixin#450)

SamplesTBRInaccessibleOnBoot

UpdateAvailable

TL;DR: we don't overuse this level and it serves a purpose (at least for the first 3 alerts from what I can tell).

simonpasquier · 2021-03-05T14:47:00Z

enhancements/monitoring/alerting-consistency.md

+that is probably true, but that doesn't make your alert "critical" it just makes
+it worthy of an alert.
+
+### Warning Alerts


I propose to use ClusterNotUpgradeable as an example of a good warning alert.

@simonpasquier could you document and land an example origin e2e test in the actual CI system to point to?

sichvoge · 2021-03-19T14:21:06Z

@jeremyeder @michaelgugino Are we also thinking about incorporating https://monitoring.mixins.dev/#guidelines-for-alert-names-labels-and-annotations into the proposal? cc/ @paulfantom

smarterclayton · 2021-04-12T13:05:06Z

Can we get a couple of review passes from folks? I see us starting to converge, don't want to lose momentum here.

deads2k · 2021-04-13T23:02:12Z

I'd like to see something like this move ahead, but I would suggest being open to changes as we divide up ownership of existing alerts (there's a spreadsheet titled OCP 4.8 - OOTB Critical Alert Evaluation going around). Once we have ownership, we'll have components finding us instead of the other way around.

Pinning something as a starting point (even as provisional) can help in that initial pass. As a starting point, this is pretty good.

jeremyeder · 2021-04-20T18:28:27Z

I agree. On March 17 we had a discussion as to what was missing in this enhancement from the SRE standpoint, and those items have been included in the April 6 update that @michaelgugino pushed to this enhancement.

/lgtm

smarterclayton · 2021-04-23T20:42:08Z

enhancements/monitoring/alerting-consistency.md

+1. Alerting rules should have a "severity" label whose value is either info,
+warning or critical (matching what we have today and staying aside from the
+discussion whether we want minor or not).
+1. Alerting rules should have a description annotation providing details about


I would like to see a follow up to this that adds examples that people can prescriptively follow, maybe with 1-3 cherry-picked examples of "good descriptions" including some guidance. Or we can put htat in a "how to write alerts doc" and link here.

smarterclayton · 2021-04-23T20:43:24Z

Squash please, fix markdown, and if there are no other comments by monday morning I will approve (thanks for all hard work here folks)

smarterclayton · 2021-04-26T19:43:53Z

/approve
/lgtm

Thanks for all the hard work on getting consensus here, I know this one was 🎆 and it's an important step forward.

openshift-ci-robot · 2021-04-26T19:44:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

RiRa12621 · 2021-04-27T17:02:47Z

/unhold

smarterclayton · 2021-04-27T17:26:09Z

Looks like the bot got you - can you add those sections?

jeremyeder · 2021-04-28T15:37:34Z

/lgtm

openshift-ci-robot requested review from crawford and deads2k February 9, 2021 02:26

lilic reviewed Feb 9, 2021

View reviewed changes

RiRa12621 suggested changes Feb 9, 2021

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2021

dhellmann reviewed Feb 9, 2021

View reviewed changes

wking reviewed Feb 10, 2021

View reviewed changes

enhancements/monitoring/alerting-consistency.md Outdated Show resolved Hide resolved

wking reviewed Feb 10, 2021

View reviewed changes

sdodson mentioned this pull request Feb 12, 2021

Adjust alert severities openshift/cluster-version-operator#520

Closed

elmiko reviewed Feb 17, 2021

View reviewed changes

michaelgugino commented Feb 24, 2021

View reviewed changes

deads2k reviewed Mar 1, 2021

View reviewed changes

jeremyeder suggested changes Mar 2, 2021

View reviewed changes

simonpasquier reviewed Mar 8, 2021

View reviewed changes

This was referenced Mar 24, 2021

Bug 1934163: adjust Thanos querier alerting rules openshift/cluster-monitoring-operator#1087

Merged

test/extended/prometheus: add test validating alert severity openshift/origin#26016

Closed

openshift-ci-robot assigned jeremyeder Apr 20, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2021

smarterclayton reviewed Apr 23, 2021

View reviewed changes

michaelgugino force-pushed the alerting-consistency branch from 168b54f to 37ece67 Compare April 26, 2021 13:51

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2021

michaelgugino force-pushed the alerting-consistency branch from 37ece67 to 2368c99 Compare April 26, 2021 14:36

openshift-ci-robot assigned smarterclayton Apr 26, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2021

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 26, 2021

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 27, 2021

smarterclayton mentioned this pull request Apr 27, 2021

WIP - Add etcd object counts alert openshift/cluster-kube-apiserver-operator#1118

Closed

Add: Alerting Standards

01422f1

michaelgugino force-pushed the alerting-consistency branch from 2368c99 to 01422f1 Compare April 27, 2021 17:59

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 27, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2021

openshift-merge-robot merged commit ce42505 into openshift:master Apr 28, 2021

bison mentioned this pull request Sep 14, 2021

enhancements: Update monitoring alerting consistency proposal #897

Merged


		Frankly, having just "Critical" and "Warning" is not helpful.

		We really need "Critical", "Major", "Minor" and "Info." Can/Should we do this?


		### Open Questions [optional]

		Frankly, having just "Critical" and "Warning" is not helpful.


		## Drawbacks

		People might have rules around the existing broken alerts. They will have to


		## Alternatives

		We'll just mark everything critical so users can guess what is an urgent


		We really need "Critical", "Major", "Minor" and "Info." Can/Should we do this?

		### Test Plan


		### Documentation Required

		Every alert needs to have a URL to relevant documentation to resolve the issue.


		Every alert needs to have a URL to relevant documentation to resolve the issue.

		If there's not documentation, there shouldn't be an alert.

	If there's not documentation, there shouldn't be an alert.
	If there's no documentation, there shouldn't be an alert.


		#### Example 3: I have an operator that only needs 1 replica to function

		Well, if the cluster can upgrade with only 1 replica, and the the service is

	Well, if the cluster can upgrade with only 1 replica, and the the service is
	If the cluster can upgrade with only 1 replica, and the service is

	In otherwords, critical alerts are something that require someone to get out
	In other words, critical alerts are something that require someone to get out


		## Summary

		Alerts are a mess. There's not really formal guidelines to follow when


		### Alert Ownership

		Previously, the bulk of our alerting was handled directly by the monitoring

		enable component owners to control their alerts.


		### Documentation Required

Add: Alerting Standards #637

Add: Alerting Standards #637

Conversation

michaelgugino commented Feb 9, 2021

lilic left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wking Feb 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulfantom Feb 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RiRa12621 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lilic left a comment •

edited

Loading

wking Feb 10, 2021 •

edited

Loading

paulfantom Feb 9, 2021 •

edited

Loading

RiRa12621 left a comment •

edited

Loading