-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add: Alerting Standards #637
Add: Alerting Standards #637
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for the proposal, is there a backstory where this is coming from? Thanks!
Would love to see some more around test plan and some actual examples how you imagined runbooks would look like.
cc @openshift/openshift-team-monitoring
@@ -0,0 +1,195 @@ | |||
--- | |||
title: alerting-consistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if we could link to some upstream Prometheus and blog posts on this topic, so we adhere to best practices and align to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any good looks or info you'd like to incorporate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is plenty, @s-urbaniak wrote a blog post on this. But couple of things are:
- https://prometheus.io/docs/practices/alerting/
- https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
- google SRE book chapter on alerting
- https://www.usenix.org/sites/default/files/conference/protected-files/srecon16europe_slides_rabenstein.pdf also anything by Björn really
To name a few, not sure if @s-urbaniak blog post on this topic is out yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't refer to those resources as 'upstream' but rather just some content from the web. I think they are probably helpful for establishing a rationale, but I'm not sure that including them is better or worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the blog post is being prepared to be published, but i think i will have to nudge again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, include a bulleted list of references in the summary section. The Monitoring section in the Google SRE book includes majority of the guidance to support readers of this document. https://sre.google/sre-book/monitoring-distributed-systems/
|
||
Frankly, having just "Critical" and "Warning" is not helpful. | ||
|
||
We really need "Critical", "Major", "Minor" and "Info." Can/Should we do this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have info level alerts shipped with cluster, but they are often non-actionable ones there for historical reason, so agreed info should be the last retreat.
|
||
### Open Questions [optional] | ||
|
||
Frankly, having just "Critical" and "Warning" is not helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give some reasons here why not? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inevitably, there are going to be some 'warning' level alerts that people won't care about. We need a lower level that indicates that there is a problem, but the impact is low. This would improve the signal to noise ratio of warning level alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning level already should create a ticket, so this is already low level enough, it should not page anyone so there is no noise really in an ideal scenario. If folks don't care about those warning alerts they can silence them after getting to a ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is, we want all warning level alerts to be cared about and fixed in most situations. This is about signal to noise. If people are trained to ignore warning level alerts by default, they will miss important alerts. This is why we need a 'minor' alert.
Alternatively, we can just not alert for 'minor' alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO with minor severity alerts, we're not avoiding the issue of folks not paying attention to alerts because they're noisy... If a warning alert is too noisy, it is a bug: the alert should either be removed or modified to be more accurate.
The Prometheus community has discussed for a long time about this: the (relative) consensus is that there should be a level for "everything's on fire" alerts and "please have a look at this" alerts. Informative alerts can be useful sometimes but should probably be the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about this overnight, and for me it comes back to triage. The group of critical alerts should be small, very well defined, highly documented and polished.
The group of warning alerts are the next priority. If you find yourself managing many clusters with many alerts, you need a sensible way to triage these. Clearly an API replica being down is more important than a daemonset not running on an individual worker. They're both problems, but one should probably be investigated before the other.
I think, another way to state it is, think of a single alert in the context of the cluster, rather than the context of a single operator. We all know there are 'minor' alerts, we just don't label them as minor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Everything else can wait" yes, that's true. But if you're managing dozens of clusters, you need a way to triage alerts, and you probably want to send a lot of stuff to /dev/null. Our operation teams probably already have rules about which rules to care about and which rules to ignore. Some of this particularly revolves around upgrades and asking "Should I fix this prior to upgrade?" I want to remove that question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we introduce a "minor" level, there would be people complaining that alert X should "warning" while it's "minor" (and vice-versa). At some point you will leverage additional alert labels to group/route the alerts to the correct destination (this is what OSD is doing from what I understand).
And triaging doesn't necessarily happen at the cluster level: if you run significant infrastructure, you probably have a PagerDuty-like service for dealing with on-call activities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not talking about triaging on-call events (eg, critical). Those are the ones that get paged and hopefully resolved. I'm talking about all the stuff that fits into 'warning.'
I agree, some people will want alerts to change from one severity to another. That's the entire point of this document, to establish clear guidelines. Definitely generic alerts like Kube* I mentioned elsewhere, if we don't disable those entirely (we probably should), every one of those should be 'minor.' It's a problem, but it's likely caused by a warning or higher level alert (or should be). If something is down, and the service is not impacted, and losing more of them doesn't jeopardize the cluster's health, then it's probably okay to be minor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like this sentence added:
"The group of critical alerts should be small, very well defined, highly documented, polished and with a high bar set for entry. This includes a mandatory review of a proposed critical alert by the Red Hat SRE team."
Having this would be an important checkpoint for SRE, one we have a workflow already in place to support you.
|
||
## Drawbacks | ||
|
||
People might have rules around the existing broken alerts. They will have to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe best would be to help folks with this, we have tried in the monitoring team but its often not as easy as just saying they will have to change some alerts. So agreed this is a drawback but also the actual correct result of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, and my experience so far is that teams are happy to have a 2nd set of eyes. So I hope this ends up being a non-issue.
|
||
## Alternatives | ||
|
||
We'll just mark everything critical so users can guess what is an urgent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure why this is an alternatives, at least in monitoring team we do not ship all critical alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just some place holder text, I don't actually think this is a viable alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative would be to document policies, but not make any backwards-incompatible changes to the existing alerts and only apply the policies to new alerts. I don't think we should do that, but it does address the drawback mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion!
|
||
We really need "Critical", "Major", "Minor" and "Info." Can/Should we do this? | ||
|
||
### Test Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually love to see a test plan here, adding tests to e2e origin of openshift which have some best practices tested for alerts would be good here. Otherwise we are in the same spot in 6 months :) WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to a test plan, but I don't have one ATM. I'm open to suggestion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take out a few compute machines. Wait an hour. Confirm that no critical alerts are firing, or were ever firing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few ideas:
- Ensure that all alerts have a
severity
label and comply with the list of allowed values. - Ensure that no critical alert gets added without a sign-off from teams that are on-call (e.g. OSD).
- Ensure that all alerts have mandatory annotations (e.g.
description
andsummary
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you actually need to test the alert condition duration or can this be sped up, eg, do I actually need to take out some machines to see if the alert would fire? If I do need to take them out, do I have to wait for the hour period?
In terms of testing for alerts I was wondering if we could have some more static analysis style rules but perhaps that's a bit naive in at this level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like folks opinion if we need to resolve a test plan approach for alerts in the context of merging this enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on what we agree about rule conventions and guidelines, we should translate the more we can into origin e2e tests. E.g. there should be a test validating that all alert rules have a severity
label matching the set of allowed values. Same goes for the summary
and description
annotations if we agree that they are mandatory.
Asking teams to write unit tests for all rules using promtool
is going to have a small benefit IMHO. Unit tests are mostly useful when joining metrics from different sources, I expect that most of the rules don't fall in this category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like a test plan, because policies that are not backed up by automated tests are hard to preserve. Doesn't mean we need to have the tests implemented before we can start pushing towards new policy, but we should at least have an idea of what the tests will look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New e2e tests in openshift/origin:
- Test validating that all alerting rules have a severity label matching the set of allowed values.
- Test validating that all alerting rules have a summary and description annotations.
- Test validating that all runbook_url annotations point to an existing web page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier link to these tests?
|
||
### Documentation Required | ||
|
||
Every alert needs to have a URL to relevant documentation to resolve the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a RFE for this already, but it is not possible to do this right now, due to disconnected evns from last time I spoke with monitoring PM. This is essentially runbooks no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what disconnected has to do with having a URL? Even if the cluster is in an isolated environment, that doesn't mean the person looking at the alerts is. At the worst case, someone can just copy and paste the URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the issue at the time, it also was a problem due to a lot of alerts coming from upstream mixins, which means URLs would be different. There should be a ticket for runbooks on the monitoring board with more details on this, its being worked on or at least last I heard. Best to contact Christian our PM for more details here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't use upstream alert mixins as part of the core alerting. How many alerts are we even talking about? Can't be more than a few dozen, probably some of which we don't want or need. We need to trim down the superfluous stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All alerts for prometheus, alertmanager, prometheus-operator, kubernetes, etcd and majority for k8s api, k8s scheduler, k8s controller manager. All in all, around 150 alerts and recording rules. The majority of alerts in OpenShift ARE coming from upstream and this is aligned with "upstream first" mentality.
We should use alerts from upstream due to the same reason we use kubernetes as base. We shouldn't invent everything ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is reactive and not proactive. There's a reason most projects in upstream are alpha/beta statuses. We should selective choose and groom the alerts we care about to deliver the best experience to our users. If we're just bulk-importing the alerting rules, it means we (probably) don't have resolution docs to go with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#298 has me poking around in this space for conditions, since we have more control over there. I like the link idea, and they don't have to be external. We can have alert messages link to local, console-hosted docs or UXes that help unpack the issue being reported in more context, like this. And paragraph-or-two level of discussion on likely impact and suggested recovery procedures, I don't see why we couldn't either grow out the alert message, or attach a new field/annotation to alerts to carry that information, even for alerts defined in upstream mixins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is reactive and not proactive.
All alerts I mentioned are created upstream in projects largely managed by OpenShift engineers. Most of them are tweaked because of OpenShift. I don't see how this is reactive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because something exists upstream doesn't mean it's good and doesn't mean it belongs downstream. We should bring in the quality pieces and leave the rest.
Here's an example: "KubePodNotReady"
That is a warning level alert. I'm assuming that is coming from upstream. Is having this alert helping many people, or is it contributing to the noise part of signal-to-noise?
A better alert would be for the specific application (user created alert) or for a trend, if the trend of non-ready pods is increasing over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the SRE team orthogonal effort to reduce alert noise (which is ongoing for many months, predates this PR, yadda yadda), we will be forced to improve the upstream alerts. That is a circular workflow that we worked with the monitoring team on. SRE team knows what to do when something like this comes up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
I see the intention here, however some things in regards to that are in flight already, most is not really defined very well so far, hence the comments.
On a little side note: calling the alerts as a whole "a mess" doesn't really do the amount of work justice that people have put into them and is not really a formulation that should be used in an OEP and rarely ever when talking about other people's work
|
||
## Alternatives | ||
|
||
We'll just mark everything critical so users can guess what is an urgent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure an OEP is the right place for sarcasm
|
||
### Open Questions [optional] | ||
|
||
Frankly, having just "Critical" and "Warning" is not helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Info
exists and is encouraged to be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think "Info" is not something that should be in 'alerts.' We need a different channel for informational messaging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The community isn't fully decided on what is the way to go, however with us inheriting a major set of OCP alerts from https://github.com/kubernetes-monitoring/kubernetes-mixin it makes sense to obey the alert severities defined there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which alerts are we inheriting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Every alert needs to have a URL to relevant documentation to resolve the issue. | ||
|
||
If there's not documentation, there shouldn't be an alert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's not documentation, there shouldn't be an alert. | |
If there's no documentation, there shouldn't be an alert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence re-states the previous sentence. For emphasis, I guess? Drop it
|
||
Every alert needs to have a URL to relevant documentation to resolve the issue. | ||
|
||
If there's not documentation, there shouldn't be an alert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's in fact true, hence we're bootstrapping GitHub.com/openshift/runbooks
|
||
#### Example 3: I have an operator that only needs 1 replica to function | ||
|
||
Well, if the cluster can upgrade with only 1 replica, and the the service is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if the cluster can upgrade with only 1 replica, and the the service is | |
If the cluster can upgrade with only 1 replica, and the service is |
if not immediately addressed or if the cluster is already in a critical state. | ||
|
||
Some critical states are: | ||
* loss or impending loss of etcd-quorum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very special case and not really fits the bigger picture of "cluster unhealthy"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that etcd-quorum loss is a special case, that's why it should be critical. If it's something similar to impact to etcd-quorum loss, it should be critical. If it's dissimilar, it should not be critical.
were to restart or need to be rescheduled, it would not able to start on another | ||
host | ||
|
||
In otherwords, critical alerts are something that require someone to get out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In otherwords, critical alerts are something that require someone to get out | |
In other words, critical alerts are something that require someone to get out |
|
||
In otherwords, critical alerts are something that require someone to get out | ||
of bed in the middle of the night and fix something **right now** or they will | ||
be faced with a disaster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarify "disaster"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that requires use of disaster recovery docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify in the proposal if possible please
be faced with a disaster. | ||
|
||
An example of something that is NOT a critical alert: | ||
* MCO and/or related components are completely dead/crash looping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's to say that MCO itself is not relevant to the clusters health, which is not true for all cluster setups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MCO being down does not create a disaster situation. The MCO does not need to be alive to run existing workloads, ensure the API is available (schedule new/replacement workloads), or jeopardize etcd-quorum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MCO might need to be alive to run existing workloads. For example, if your pull-secret needs rotating, or your Kube-API CA is getting rolled, etc., those are things that get pushed out to nodes via the machine-config operator. And if that isn't getting pushed out, the nodes may die. And possibly be destroyed. And new nodes (replacements or autoscaling increases) will not come up without sufficient machine-config stack health. So lots of vulnerabilities. But vulnerabilities are not necessarily page-at-midnight until the lack of capacity or stale-pull-secret or other fallout is actually biting you. And "core MCO operator is dead" shouldn't happen all that often anyway. Is this really something where noise is a problem? Or are you extrapolating from a different machine-config alert that you do find too noisy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like in this case there should be independent monitoring of this particular situation for kube-api CA's getting rolled. So, the alert would be something along the lines of "CA not synced && MCO dead".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"&& MCO dead" doesn't sound like something the CA-generator should be taught to care about. But kubelets could provide metrics about remaining CA time, and there could be a new alert about those expiration times dropping under a given threshold? I think it's fine to leave it to the user or local Insights-style rules to consolidate a collection of distinct alerts into diagnoses. There is a distinction between "there is a problem" (alerts) and "I have a theory for the root cause" (diagnoses), and if we attempt to only target diagnoses with alerts, we risk silence when there is some new failure mode or presentation that doesn't match an existing diagnosis fingerprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong example. MCO for sure is on the critical path and it should alert appropriately as critical component due to all reasons above and that it is actioner of a lot of control plane actions. And alerting in kube manager for "Cert rotation stuck" without giving me where to look is almost useless. But Alerting in kube manager with "Cert Rotation stuck. Rotation path: Manager -> MCO-> node" Gives me hint what to check and which alerts to look for next. So I dont see MCO alert in critical I know it works fine and go to the next in line.
Every component and alert needs to understand their dependency flow in the the system overall.
s/MCO/samples operator/
and your example should be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. If Cert rotation is the thing that will drag the cluster down, then we need to set an alert for that specifically. We also aren't (shouldn't be?) waiting until the last possible minute to do cert rotation, we have plenty of time to take action on that before they expire, on the order of weeks IIRC. So, if you have an MCO that goes down, it's not immediately critical. Yes, if you completely neglect your cluster for weeks on end, it will fail, but the MCO itself going down is not going to prevent application communication.
|
||
* Define clear criteria for designating an alert 'critical' | ||
* Define clear criteria for designating an alert 'warning' | ||
* Define minimum time thresholds before prior to triggering an alert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unrealistic. Certain conditions can't be recovered from on their own and/or need instant action because in a high number of cases the component doesn't recover without interaction. There doesn't need to be an arbitrary wait time.
Generally arbitrary wait times are hard to define all over OpenShift
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. This is about consistency. The point of this document is to discuss what we think alerting and related elements should look like. There's no reason to not define a standard set of criteria. Anyone that thinks they have a special use case is probably wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like it would be easy to audit. For example, alerts installed via the CVO that do not set for
:
$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.7.0-rc.0-x86_64
$ for x in $(grep -lr '^kind: PrometheusRule' manifests); do yaml2json < "${x}" | jq -r 'if (. | type) == "object" then [.] else . end | .[] | select(.kind == "PrometheusRule").spec.groups[].rules[] | select(.alert != null and .for == null) | .labels.severity + " " + .alert + " " + .expr'; done | sort | uniq
critical MCDRebootError mcd_reboot_err > 0
warning FailingOperator csv_abnormal{phase="Failed"}
warning ImageRegistryStorageReconfigured increase(image_registry_operator_storage_reconfigured_total[30m]) > 0
warning KubeletHealthState mcd_kubelet_state > 2
warning MCDPivotError mcd_pivot_err > 0
warning SystemMemoryExceedsReservation sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9)
How do we feel about those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SystemMemoryExceedsReservation" is definitely one that needs a threshold. I have some questions about the utility of this particular alert, but that's outside the scope of this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SystemMemoryExceedsReservation
is here from openshift/machine-config-operator#2033, if you want to follow up on that alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty much any alert that spuriously fires(in the sense that no action is needed, it's going to be fine once the upgrade finishes) during upgrade today is a candidate for "this alert needs to wait longer before firing" imho.
i'm also very interested to see examples where the time it takes an admin to see, process, and take action to resolve an alert doesn't significantly dominate the time between the condition happening and the alert firing. That is to say:
if it takes an admin 30 minutes to fix something they were alerted to, then does it matter whether we alerted them 0 minutes or 5 minutes after it started happening? Particularly if by waiting 5 minutes we may be able to avoid alerting them at all (because the situation does resolve itself, such as in the case of most alerts caused by node reboots).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to capture some timeline standards in the individual alerting sections. I would say, for critical alerts, they can probably be added to a 0m, 5m or 15m bucket.
0m bucket: etcd quorum is lost, we don't know how long this cluster might be alive so fire right away and hope for the best.
5m bucket: etcd quorum member lost (1 of 3). If it's a small network blip, hopefully this is covered by etcd's internal timeout, and 5 minutes should hopefully be enough to recover.
15m bucket: Pretty much all other critical alerts, like 1/3 healthy API servers.
Back to the upgrade bit, we shouldn't fire any alerts during upgrade, however, I don't think timing is the core issue, at least for the critical alerts. The vast majority of alerts could be Warning or lower severity, and 60m timeout. This achieves a number of things. First, it gives time for the situation to self-correct, or for a helper like MachineHealthChecks to remediate a problem. Second, it allows us to know that it's probably not going to self correct. Third, it keeps your alert screen from having tons of alerts at once, especially if there are critical alerts firing, we don't need the extra distractions.
Many of our alerts today are 'raw' alerts, IMO. Things like 9/10 daemonsets are running for MCD. Does this need an alert? It would be nice if we can aggregate some things into node specific issues. EG, if sdn-daemon is down on a given node, we should have a 'NodeUnhealthy' alert, and aggregate some conditions onto that screen. I'm not sure how feasible this is today. At the very least, each component's operator could detect the specifics about it's operands and a given node. For instance, if a given node is under maintenance, and the daemonset is missing, that's expected. The operator is the layer we can utilize to understand this state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we need to provide numbers as guidance, if only from a hygiene standpoint. Otherwise we're not going to make progress on standardizing, which is the purpose of this PR. Even if those numbers need tweaking in the end, or if they don't make sense on a particular alert, we can't reasonably ask a component team to write alerts without this basic guidance. This whole process is iterative and we have to start with something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the wording on this a but doubled now?
The alerts are a mess. This isn't a reflection of any particular person or team's effort, it's because all of our teams are on different pages about alert levels, documentation, and having a consistent UX. I've heard multiple conversations along the lines of "What alerts can I ignore for upgrades?" and that is a huge problem. I want to fix this problem, I think other people want to fix this problem. I haven't seen any process improvement to address this problem. |
|
||
## Summary | ||
|
||
Alerts are a mess. There's not really formal guidelines to follow when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs are supposed to be aspirational, right? I think this is probably not what we aspire to have. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, probably Summary should be more future oriented, thanks!
### Goals | ||
|
||
* Define clear criteria for designating an alert 'critical' | ||
* Define clear criteria for designating an alert 'warning' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only care about those 2 levels, or is the goal to describe criteria for all of the levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You tell me. I just wanted to get the conversation started. I would prefer to have 3 alert levels, "Critical" "Major" and "Minor". Since that ship might have already sailed, I am focusing on what we have today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other levels are mentioned in the text below, so I wasn't sure if you were just focusing on the top 2 as a start or if there was another reason they weren't included here. I'd be OK with starting by nailing down the criteria for critical and warning and seeing where that takes us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. It is listed as TBD below. Not a blocker for initial merge of this IMO.
|
||
### Critical Alerts | ||
|
||
TL/DR: Fix this now or cluster dies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we talking just about the cluster falling over, or would an alert about a situation that might lead to data loss without the cluster halting also qualify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the TLDR doesn't capture all the situations we might need a critical alert, I expanded upon some of these below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the purpose of this doc is to provide guidance, I think it would help to phrase these sections more imperatively, as instructions to someone designing or implementing an alert. So here we might say
Reserve critical level alerts only for reporting conditions that may lead to loss of data or inability to deliver service for the cluster as a whole. Failures of most individual components should not trigger critical level alerts, unless they would result in either of those conditions. Configure critical level alerts so they fire before the situation becomes irrecoverable. Expect users to be notified of a critical alert within a short period of time after it fires so they can respond with corrective action quickly.
and then below expand on that with the sorts of details that you have already with examples.
|
||
## Alternatives | ||
|
||
We'll just mark everything critical so users can guess what is an urgent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative would be to document policies, but not make any backwards-incompatible changes to the existing alerts and only apply the policies to new alerts. I don't think we should do that, but it does address the drawback mentioned above.
|
||
### Critical Alerts | ||
|
||
TL/DR: Fix this now or cluster dies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the purpose of this doc is to provide guidance, I think it would help to phrase these sections more imperatively, as instructions to someone designing or implementing an alert. So here we might say
Reserve critical level alerts only for reporting conditions that may lead to loss of data or inability to deliver service for the cluster as a whole. Failures of most individual components should not trigger critical level alerts, unless they would result in either of those conditions. Configure critical level alerts so they fire before the situation becomes irrecoverable. Expect users to be notified of a critical alert within a short period of time after it fires so they can respond with corrective action quickly.
and then below expand on that with the sorts of details that you have already with examples.
### Warning Alerts | ||
|
||
TL/DR: fix this soon, or some things won't work and upgrades will probably be | ||
blocked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a similar stab at rephrasing this section:
Use warning level alerts for reporting conditions that may lead to inability to deliver individual features of the cluster, but not service for the cluster as a whole. Most alerts are likely to be warnings. Configure warning level alerts so that they do not fire until components have sufficient time to try to recover from the interruption automatically. Expect users to be notified of a warning, but for them not to respond with corrective action immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great wording here, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, thanks doug
We (the monitoring team) took the available materials and guidelines published by the monitoring community. Materials that were created over the years with the cooperation of Administrators/SRE people who actually use monitoring on a day-to-day basis. That data was then converted to a blog post which should be published soon and it is meant to provide guidance on alert creation for OpenShift internal stakeholders as well as OpenShift customers. AFAIK it is also aligned with SREP internal practices. All this means, we already have materials on how alerts should be created, how they should look like, and what severities should be used. The matter is how to enforce those practices and not what practices should be created. This document is lacking the critical aspect in form of the HOW question. |
We need an easy to consume, 1 page, simple document for OpenShift developers to create alerts. We need a document that is about OpenShift alerts, and not alerts generally. In the context of OpenShift, we need to determine what types of scenarios map to which alert levels. Since the monitoring team doesn't own all of the alerts, it needs to be consumable by all the different teams. While blog posts and similar content are valuable, we need something formalized into developer documentation. |
Here you go, OpenShift docs for 4.6, section
When it comes to monitoring and alerting, OpenShift is not different from any other microservice environment hence our blog post as well as documents listed earlier by other folks from the monitoring team still hold.
I agree that we need docs. But I would agree more on the enforcement of such docs and this again goes into the HOW question. If you have ideas on HOW then I am all ears if not then I have no idea what this OEP is for as things are already documented, it is just that this documentation needs to be read and applied. |
I find the wording pretty suspect for "Warning":
This sentence alone demonstrates the confusion. Warning notifications should require attention to correct a problem. If it "might require attention" to "prevent a problem" then it shouldn't be a warning level alert at all. This leads to a sea of un-actionable alarms for administrators.
I fundamentally disagree with this statement, and this is sort of the problem. We need to be thinking about monitoring the platform holistically rather than small pieces, at least when it comes to critical and warning alerts. This is why we have a bad signal to noise ratio. In any case, the nature of the enhancement process is to enhance. So even if we have these things in some current form, this enhancement allows us to discuss them and decide if the process is working, and if not, how we should change it. We have evidence that alerting in it's current state is flawed. |
This enhancement doesn't allow us to do that very well, since the author didn't put a lot, if any, research into the in-cluster monitoring stack. A lot of things are based on the author's personal assumptions and opinions. The author didn't make any efforts up front to find out about potential initiatives that are already started or in progress.
This only works if the PR authors is open for discussion which involves accepting any opinion other than their own.
It would make sense to present this for context IMHO the best way would be to close this PR, conduct the aforementioned research and then open a new one. |
I'm sorry you feel negatively about this enhancement. I opened this enhancement to get the discussion started, not to say I have all the right ideas. You are welcome to open your own enhancement to counter this one, or we can work on this one collaboratively. This enhancement is a place to share ideas and opinions. I don't prefer how we're managing alerts today, and I want to address it. |
* loss or impending loss of etcd-quorum. | ||
* inability to route application traffic externally or internally | ||
(dataplane disruption) | ||
* inability to start any new pod other than capacity. EG, if the API server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"other than capacity" is not very clear. Are you suggesting that pods waiting on capacity are not critical? Or that that's the only situation in which pods failing to start are critical? Personally, I don't think "pods failing to start" would ever be critical. One fewer API-server pod is a bummer, but the other two will be able to handle the necessary traffic. Similarly for other components. A critical alert is "we are down to one, surviving API-server pod. Wake up, Admin, and save us before the cluster goes dark!" or "the auth component is Available=False, so users can't log in".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about a cluster-wide inability to start a new pod due to some degraded condition other than capacity. Capacity meaning you don't have RAM/CPU to run more pods.
Said another way, if a currently running pod were to fail, could it be restarted and become ready on this host or another? One example might be SDN is totally down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rolling into #637 (comment) , which is also quibbling about this line. Feel free to mark this thread resolved.
|
||
60 minutes? That seems high! That's because it is high. We want to reduce | ||
the noise. We've done a lot to make clusters and operators auto-heal | ||
themselves. The whole idea is that if a condition has persisted for more than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream alert levels from here:
- Critical: An issue, that needs to page a person to take instant action
- Warning: An issue, that needs to be worked on but in the regular work queue or for during office hours rather than paging the oncall
- Info: Is meant to support a trouble shooting process by informing about a non-normal situation for one or more systems but not worth a page or ticket on its own.
For warnings, I'd be happier with "if you are very confident that this will not auto-heal, even if it hasn't been 60m". And example would be AlertmanagerFailedReload
, which, as I read it, will fire after 15m without a successful load. The expectation is that someone's fumbled the config, and only an admin correcting the config will recover things. I'd personally be happier if the alert manager copied verified configs over into some private location, so "admin fumbles a config touch" didn't leave you exposed to losing all alert manager functionality if the pod restarted. The fact that config fumbles leave you vulnerable to alert-manager-loss today, and alert-manager being the thing paging you at midnight for other issues, makes the current critical
there make sense to me. If the alert manager did have valid-config-caching that survived pod restarts/reschedules, I'd rather have the alert be a warning, so no midnight page, but I'd still be comfortable waiting only 15 minutes before firing, because you want the admin to come back and fix the config they broke before they clock out for the night.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, AlertmanagerFailedReload
would only fire if someone touches the Alertmanager configuration which means that there's someone already awake. It might change when we allow users to manage their own AlertmanagerConfig
resources but the operator should be able to validate that the configs aren't broken.
if the alert manager copied verified configs over into some private location
In principle yes but given the current design of Alertmanager, it isn't trivial. As stated above, the Prometheus operator should take care of checking that at least the configuration is syntactically valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...only fire if someone touches the Alertmanager configuration which means that there's someone already awake.
I used that same argument internally when I tried to argue that CannotRetrieveUpdates
made sense at critical, because it was firing was because OSD tooling had set a bogus spec.channel
leading to VersionNotFound
. It still got demoted to warning
in openshift/cluster-version-operator#509. Assuming that an awake human is the one touching your config is apparently a leaky argument ;).
In principle yes but given the current design of Alertmanager, it isn't trivial...
The way it works today doesn't seem to be bothering folks too often, and it makes sense to have "lots of work to solidify guards for a rare situation" be lower priority than lots of other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CannotRetrieveUpdates
is a bit different IMHO since it can probably wait until tomorrow morning (unless I'm wrong about the scope). If the Alertmanager config is broken, you're only one step away from losing all alert notifications in case all your Alertmanager pods get restarted.
But I don't think we need to argue on this specific alert here and I'd be fine if someone wants it to be demoted to "warning" because it causes them pain :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is alertManager rolled out and gated on readiness during changes? Seems like catastrophe is already being prevented here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...can probably wait until tomorrow morning (unless I'm wrong about the scope)...
Depends on the alerts you're missing. If there are no available updates, not being able to fetch them doesn't matter. If someone is DoSing your network to keep you from hearing about a critical CVE update, then not hearing about them means attackers have longer to exploit the flaw before you wake up and start working the update-retrieval issue.
Is alertManager rolled out and gated on readiness during changes?
You are currently safe as long as there is an alertmanager around which was running the old config. But that's not a resilient position. For example, someone bumps a MachineConfig and the machine-config operator rolls your compute pool, and now your Alertmanagers are all dead. Maybe PDBs protect you from some of that? But OSD allows customers to configure limited respect for PDBs on the ~hours scale, and sometimes machines go down hard, without allowing time for graceful PDB guards.
That seems significant | ||
and it probably is, but nobody has to get out of bed to fix this **right now**. | ||
But, what if a machine dies, I won't be able to get a replacement? Yeah, that | ||
is probably true, but how often does that happen? Also, that's an entirely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"doesn't happen often" is not an argument for "not critical". The argument for this not being critical is that there is no risk of data-loss or customer impact inherent in a machine dying, or failing to provision a new one, or rolling out new secrets to nodes, or all the other things that the machine-config does. The critical impact would be "my customer-facing workload is failing to schedule because I have no capacity" or "I am on the edge of etcd-quorum loss because I cannot revive my third control-plane node". Those are critical. Maybe they are due to the machine-config stack being sad, so it's good to have warning-level machine-config alerts. But I don't think that they need to be critical-level machine-config alerts. But "happens rarely" is an argument for "even if the severity is wrong, fixing it is not all that important", not an argument for what the severity should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"my customer-facing workload is failing to schedule because I have no capacity"
No, we can't own this. Users should have their own monitoring for apps that fail to schedule. There's all sorts of capacity user stories and users need to use careful consideration in this area.
The machine-api does not replace failed instances anyway, MHC does that.
I disagree that frequency is not a component of severity. If clusters routinely lost 1 instance an hour (in some hypothetical reality... there's a joke to be made about a cloud provider in here lol) just as a matter of course, then having a function machine-api would be critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should have their own monitoring for apps that fail to schedule.
User workloads should have "failed to schedule" monitoring so they can stay under the current caps and quotas. But cluster admins should have "limited capacity is impacting user workloads" monitoring so they can decide when they want to grow quota. This is the same as the autoscaler, which definitely cares about user workload capacity today, despite being managed by cluster admins. The alerts would fill the manually-scaled use case, and also cover the "autoscaler is maxed / dead" use case.
The machine-api does not replace failed instances anyway, MHC does that.
I thought the MHC killed dead/degraded machines/nodes, but that the machine API then provisioned the replacement Machine to get the MachineSet back up to it's spec.replicas
. If you have a MHC on a machine/node that is not part of a MachineSet, is a replacement created? I'd have expected no replacement to be created.
If clusters routinely lost 1 instance an hour (in some hypothetical reality... then having a function machine-api would be critical.
But it would not be a critical
alert. Losing your customer-facing workloads or living on the edge of etcd quorum loss would be a critical
alert. And then there would be machine API is dead
, high machine death rate
, and unable to schedule many pods due to limited overall cluster capacity
warning
alerts to help point the responding admin at likely root causes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"limited capacity is impacting user workloads" monitoring so they can decide when they want to grow quota
This assumes having pending pods is a problem. Some clusters might schedule very large batches of jobs, having limited capacity is not a problem for them, they just want to chew through the jobs. Like I said, there are all sorts of capacity user stories. A cluster can run indefinitely with 0 capacity.
If you have a MHC on a machine/node
I can't recall MHC specifics, but it's not intended to delete machines that aren't part of a machineset today. The machineset creates the replacement machine, but in effect, you can paraphrase that the MHC replaces the failed machine.
Clear and actionable alerts are a key component of a smooth operational | ||
experience. Ensuring we have clear and concise guidelines for our alerts | ||
will allow developers to better inform users of problem situations and how to | ||
resolve them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, i really like the idea of creating consistent guidelines for not only the alerts, but the documentation to support them. i am also hugely in favor of bringing the focus of generating those artifacts closer to the project development.
* Define ownership and responsibilities of existing and new alerts | ||
* Establish clear documentation guidelines | ||
* Outline operational triage of firing alerts | ||
* Explore additional alerting severity (eg, 'minor') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chime in here if you're for or against a new alert level. My feeling is 'info' doesn't seem much like an alert (indeed, we include stuff like 'cluster update available' in this level) and we could better communicate a problem with something like 'minor'. For as long as I can remember, monitoring systems other than ours have had at least 3 levels of alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have multiple levels (info, warning, critical) if we plan to use alerts to give signal about upcoming problems, eg, at x minutes you get info, at 5x minutes you get warning. A use case I can think of for this is certificate expiry, if we expect certificates to exist for 30 days and to be renewed with 7 days left, you may want to info alert if it's past the renewal period, warning at half way through this and critical when it gets close to the expiry.
In this case this to a user means:
- Info: Something hasn't happened that should have, but it might fix itself so no need to be concerned yet
- Warning: This isn't breaking your system, but it's failed multiple times so probably needs human intervention, don't wake up, fix it tomorrow during normal working hours
- Critical: You system is now broken/will imminently be broken and therefore human intervention is required now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't have a strong feeling for what is a better experience for users. i am willing to extend trust to those who have more direct experience in this area. i would like to hear if the proposed 4 levels would be more helpful to the people who are operating openshift, would like to see some SRE folks chime in here (sorry if you have and i didn't recognize).
my thoughts about the various levels
- info: informational only. i feel like even calling this an "alert" is strongly worded, this sounds more like "exposed log records". but, this is valuable as it can save users time from look at actual logs, or having to dig for information like updates being available.
- warning: something bad is happening and needs to be fixed, it will not crash the cluster immediately but should be investigated as it may lead to cluster instability.
- critical: wake someone up, the cluster is dead or dying. should be self explanatory
to me, "minor" would mean "this is a problem, but it does not need to be prioritized for investigation and will never lead to a cluster disruption". i can see the value in having something at that level, i'm just not sure how this fits into an operational team's experience. @michaelgugino's reasoning for having the extra level makes sense to me, i can see the value in adding a "minor" level alert.
hope that helps, just my two cents =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we scope an additional level out of this enhancement and follow up later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an SRE perspective, there's basically only two levels:
- paging,
- everything else.
Paging should be criticals, so "something's likely very bad and needs attention asap". If there's a repeated situation where the criticals are unactionable or multiple criticals are produced per a single underlying common issue (therefore multiple pages), those alerts needs review.
For "everything else" bucket… that's a whole doc worth of considerations. Basically if it doesn't indicate an underlying issue to be fixed by a human a high percent of the times, it should be out.
A: It's a warning level alert. | ||
|
||
|
||
### Alert Ownership |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm strongly in favor of this aspect of the proposal. If the proposal remains contentious/unmerged overall, I'd like to see this aspect split out. I bet the monitoring team has an interest in helping this along.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed the only way to quickly improve and have product alerting be sustainable into the future. However, I have not seen an explicit ack on this from OCP engineering at a high enough level + documented/sign-off level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spreading out alert ownership without clear and simple guidelines on how to create them to go along with that might lead to a lot of headache for alert consumers down the pipe.
much effort by each component team. As we mature our alerting system, we should | ||
ensure all teams take ownership of alerts for their individual components. | ||
|
||
Going forward, teams will be expected to own alerting rules within their own |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should enumerate the ownership responsibilities, including
- first in line to receive the bug for an alert firing
- responsible for describing, "what does it mean" and "what does it do"
- responsible for choosing the level of alert
- responsible for deciding if the alert even matters
I think we could get some improvement by categorizing alerts by owner so we can push for better refinement, but that could be a later step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 totally agree with what David write, but assuming we have clear guidelines about critical/warning/info alert levels described above in this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said earlier, I don't think writing this in this PR constitutes active acceptance and acknowledgement by the stakeholders. Want to avoid surprising folks with new responsibilities they had no heads-up to.
* etcd corruption | ||
* inability to route application traffic externally or internally | ||
(data plane disruption) | ||
* inability to start/restart any pod other than capacity. EG, if the API server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this imply that the kube-apiserver is considered critical? Perhaps we should enumerate those binaries which are "special".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything that can affect an SLA is critical. kube-apiserver is the only SLA our managed product is contracted to uphold. However, I don't think every alert for kube-apiserver should be set to critical.
Anything that can affect an SLO (which are often company and SRE team specific), should be a warning, because SLO thresholds will be set more aggressively than the SLA that they support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collapsing discussion from this earlier thread about this same line, I still don't understand the capacity carve-out. If a pod failing to schedule is a problem, then the reason the pod failed to schedule helps admins figure out how to resolve/mitigate, but the problem exists regardless of the reason. E.g. "I have no API-server pods, and am failing to schedule more" is a disaster. If the reason is "because I'm out of CPU", it's still a disaster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. "I have no API-server pods, and am failing to schedule more" is a disaster. If the reason is "because I'm out of CPU", it's still a disaster.
I think this example is contrary to symptom based alerting. Also, the API-server pods have a very high priority, so capacity should never be an issue on a properly (eg, default or higher) sized node.
Pods failing to schedule due to lack of capacity is not a problem. Sometimes clusters fill up, sometimes this is desired. One use case: run 1 Million of pod x, low priority, preemptible.
@@ -0,0 +1,195 @@ | |||
--- | |||
title: alerting-consistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, include a bulleted list of references in the summary section. The Monitoring section in the Google SRE book includes majority of the guidance to support readers of this document. https://sre.google/sre-book/monitoring-distributed-systems/
### Goals | ||
|
||
* Define clear criteria for designating an alert 'critical' | ||
* Define clear criteria for designating an alert 'warning' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. It is listed as TBD below. Not a blocker for initial merge of this IMO.
|
||
* Define clear criteria for designating an alert 'critical' | ||
* Define clear criteria for designating an alert 'warning' | ||
* Define minimum time thresholds before prior to triggering an alert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we need to provide numbers as guidance, if only from a hygiene standpoint. Otherwise we're not going to make progress on standardizing, which is the purpose of this PR. Even if those numbers need tweaking in the end, or if they don't make sense on a particular alert, we can't reasonably ask a component team to write alerts without this basic guidance. This whole process is iterative and we have to start with something.
* Define clear criteria for designating an alert 'warning' | ||
* Define minimum time thresholds before prior to triggering an alert | ||
* Define ownership and responsibilities of existing and new alerts | ||
* Establish clear documentation guidelines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to call out the runbooks repo here. The idea is that every alert comes with a runbook, and they're open source, part of the product code-base. https://github.com/openshift/runbooks.
We had some discussion with the support team on this and they ack'd use of the runbooks repo (the alternative suggested was KCS, but was decided against).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PLM strongly suggest the use of KCS vs a github as it creates more on brand and controlled space (not influenced by the community).
* Define ownership and responsibilities of existing and new alerts | ||
* Establish clear documentation guidelines | ||
* Outline operational triage of firing alerts | ||
* Explore additional alerting severity (eg, 'minor') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we scope an additional level out of this enhancement and follow up later.
|
||
Sometimes an alert's impact is not obvious. We should state the impact of the | ||
alert. Don't expect users to understand the context, we have too many moving | ||
pieces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. context about why the alert exists and what was in the developers head when they wrote it is SO IMPORTANT.
|
||
### Open Questions [optional] | ||
|
||
Frankly, having just "Critical" and "Warning" is not helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like this sentence added:
"The group of critical alerts should be small, very well defined, highly documented, polished and with a high bar set for entry. This includes a mandatory review of a proposed critical alert by the Red Hat SRE team."
Having this would be an important checkpoint for SRE, one we have a workflow already in place to support you.
|
||
We really need "Critical", "Major", "Minor" and "Info." Can/Should we do this? | ||
|
||
### Test Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like folks opinion if we need to resolve a test plan approach for alerts in the context of merging this enhancement.
|
||
## Drawbacks | ||
|
||
People might have rules around the existing broken alerts. They will have to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, and my experience so far is that teams are happy to have a 2nd set of eyes. So I hope this ends up being a non-issue.
## Alternatives | ||
|
||
Document policies, but not make any backwards-incompatible changes to the | ||
existing alerts and only apply the policies to new alerts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realizing you're listing this because it's in the template, but I feel compelled to say it is not an acceptable alternative from SRE standpoint. We have to do two things: 1) improve existing noise 2) put in place a process (this doc) to ensure we don't regress.
* Define minimum time thresholds before prior to triggering an alert | ||
* Define ownership and responsibilities of existing and new alerts | ||
* Establish clear documentation guidelines | ||
* Outline operational triage of firing alerts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it cover runbooks as per @jeremyeder comment above? If not what is the intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should pull this in as one of the major tenants of this enhancement. It's a major win from SRE and the general "alert consumer" PoV. It should be the responsibility of the alert author, so that it will be lifecycled along with the alert (e.g. runbook CRUD when alert CRUD).
of bed in the middle of the night and fix something **right now** or they will | ||
be faced with a disaster. | ||
|
||
An example of something that is NOT a critical alert: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we need a less controversial example of an alert that shouldn't be critical? I can propose ThanosQueryRangeLatencyHigh
: Thanos Query isn't xritical to ensure that user workloads keep running so it shouldn't fire critical alerts.
I'd also like to see an example of an alert that is unambiguously critical (e.g. etcdInsufficientMembers
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier would you be able to supply the specific alert definitions in addition to the names so that it can be included here as a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe PrometheusRemoteStorageFailures? The fact that stuff isn't getting pushed to telemeter is an internal problem for us, but customers don't really have a reason to care.
|
||
### Alert Ownership | ||
|
||
Previously, the bulk of our alerting was handled directly by the monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done the numbers and here is the current breakdown on a 4.7 cluster which could be useful to add to the proposal as a datapoint.
- Alerts shipped by the cluster monitoring operator (CMO)
- critical: 43
- warning: 82
- Other alerts
- critical: 14
- warning: 35
Diving into the alerts shipped by CMO:
- alerts for etcd, Kubelet, API server, Kube controller and scheduler. They live in the CMO repository for historical reasons and the plan is to move them to their respective repositories.
- etcd: 12 alerts
- control plane: 12
- kubelet: 6
- alerts for the monitoring components (Prometheus, Alertmanager, ...)
- 57 alerts
- alerts for node metrics (clock skew, filesystems filling up, ...).
- 17 alerts
- alerts based on kube-state-metrics metrics like
KubePodNotReady
,KubePodCrashLooping
, ...- 25 alerts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier who owns that plan and where can I track it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openshift/cluster-monitoring-operator#1076 for the control plane alerts and openshift/cluster-monitoring-operator#1085 for the etcd alerts. We need to synchronize with their respective teams for the hand over.
enable component owners to control their alerts. | ||
|
||
|
||
### Documentation Required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need clear specification and recommendations for writing alerting rules. Having consistent format for alerting rules helps with alert triaging and notification dispatch. FWIW there are already guidelines in the monitoring-mixins repository that we should get inspiration from IMO.
- the name of the alerting rule should clearly identify the component impacted by the issue (for example
etcdInsufficientMembers
instead ofInsufficientMembers
,MachineConfigDaemonDrainError
instead ofMCDDrainError
). It should camel case, without whitespace, starting with a capital letter. The first part of the alert name should be the same for all alerts originating from the same component. - alerting rules should have a "severity" label whose value is either
info
,warning
orcritical
(matching what we have today and staying aside from the discussion whether we wantminor
or not). - alerting rules should have a
description
annotation providing details about what is happening and how to resolve the issue. - alerting rules should have a
summary
annotation providing a high-level description (think of it as the first line of a commit message or email subject). - if there's a runbook in https://github.com/openshift/runbooks, it should be linked in the
runbook_url
annotation.
Another guideline for alerts that work on gauge metrics: the rule expression needs to account for failed scrapes that can create false-negatives. For example, say that SamplesDegraded
(expressed as openshift_samples_degraded_info == 1
) is firing and Prometheus fails to scrape the target then the alert would resolve itself because the metric has been marked as stale. A better alerting query would be max_over_time(openshift_samples_degraded_info[5m]) == 1
(see https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic - @michaelgugino can you pull this in?
|
||
We really need "Critical", "Major", "Minor" and "Info." Can/Should we do this? | ||
|
||
### Test Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on what we agree about rule conventions and guidelines, we should translate the more we can into origin e2e tests. E.g. there should be a test validating that all alert rules have a severity
label matching the set of allowed values. Same goes for the summary
and description
annotations if we agree that they are mandatory.
Asking teams to write unit tests for all rules using promtool
is going to have a small benefit IMHO. Unit tests are mostly useful when joining metrics from different sources, I expect that most of the rules don't fall in this category.
|
||
Well, if the cluster can upgrade with only 1 replica, and the the service is | ||
not unavailable despite other replicas being unavailable, this can probably | ||
be just an info-level alert. We should have a "minor" alert, TBD. But for now, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List of info
alerts in 4.7:
- MultipleContainersOOMKilled (Bug 1807139: jsonnet/rules.jsonnet: Lower threshold for MultipleContainersOOMKilled alert cluster-monitoring-operator#690 and https://bugzilla.redhat.com/show_bug.cgi?id=1807139)
- KubeQuotaAlmostFull (Split the alerts for quota usage reached and exceeded kubernetes-monitoring/kubernetes-mixin#494)
- KubeQuotaFullyUsed (KubeQuotaExceeded fires even if quota is not exceeded kubernetes-monitoring/kubernetes-mixin#441 and alerts/resource_alerts.libsonnet, runbook.md: Reduce KubeQuotaExceeded severity and adjust threshold kubernetes-monitoring/kubernetes-mixin#450)
- SamplesTBRInaccessibleOnBoot
- UpdateAvailable
TL;DR: we don't overuse this level and it serves a purpose (at least for the first 3 alerts from what I can tell).
that is probably true, but that doesn't make your alert "critical" it just makes | ||
it worthy of an alert. | ||
|
||
### Warning Alerts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to use ClusterNotUpgradeable
as an example of a good warning alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier could you document and land an example origin e2e test in the actual CI system to point to?
@jeremyeder @michaelgugino Are we also thinking about incorporating https://monitoring.mixins.dev/#guidelines-for-alert-names-labels-and-annotations into the proposal? cc/ @paulfantom |
Can we get a couple of review passes from folks? I see us starting to converge, don't want to lose momentum here. |
I'd like to see something like this move ahead, but I would suggest being open to changes as we divide up ownership of existing alerts (there's a spreadsheet titled Pinning something as a starting point (even as provisional) can help in that initial pass. As a starting point, this is pretty good. |
I agree. On March 17 we had a discussion as to what was missing in this enhancement from the SRE standpoint, and those items have been included in the April 6 update that @michaelgugino pushed to this enhancement. /lgtm |
1. Alerting rules should have a "severity" label whose value is either info, | ||
warning or critical (matching what we have today and staying aside from the | ||
discussion whether we want minor or not). | ||
1. Alerting rules should have a description annotation providing details about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see a follow up to this that adds examples that people can prescriptively follow, maybe with 1-3 cherry-picked examples of "good descriptions" including some guidance. Or we can put htat in a "how to write alerts doc" and link here.
Squash please, fix markdown, and if there are no other comments by monday morning I will approve (thanks for all hard work here folks) |
168b54f
to
37ece67
Compare
37ece67
to
2368c99
Compare
/approve Thanks for all the hard work on getting consensus here, I know this one was 🎆 and it's an important step forward. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
Looks like the bot got you - can you add those sections? |
2368c99
to
01422f1
Compare
/lgtm |
No description provided.