-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
monitoring: support configuration of alerting notifications by application owners #937
monitoring: support configuration of alerting notifications by application owners #937
Conversation
ddab0db
to
597cd13
Compare
As an OpenShift cluster admin, I want to exclude certain user namespaces from | ||
modifying the Plaform Alertmanager configuration so that I can recover in case | ||
of breakage or bad behavior. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier I don't really understand this story. How would excluding specific namespaces allow us to recover from a bad configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say that there's a AlertmanagerConfig
resource in namespace foo
breaking the Platform Alertmanager. Then the cluster admin can temporarily exclude that particular namespace from UWM to recover the situation and get in touch with the team that provisioned the "bad" config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, that makes sense. So it is a reactive solution essentially. Drifting off slightly to implementation, would they have any means other than logging to be alerted on what namespace has broken the conf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is covered later with the AlertmanagerBadConfig
alert, somewhat at least. But maybe we can do some improvements to help identify the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I see this, but I wonder if we can have a more ambitious goal:
- As an OpenShift cluster main I expect Alertmanager to NOT crash/stop refreshing configuration if only one of the AlertmanagerConfig is malformed / invalid.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that even possible? (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is the goal from the prometheus-operator
perspective yes. We will discard the invalid CR and it will be excluded from the merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator can only do as much as it can to ensure that it generates a valid Alertmanager configuration but there will always be a theoretical risk that the operator has a bug in certain edge cases. Still I'll try to think about more proactive strategies.
enhancements/monitoring/alert-routing-for-user-workload-monitoring.md
Outdated
Show resolved
Hide resolved
As a UWM admin, I want to send user alerts to the Platform Alertmanager | ||
cluster because UWM alerts are managed by an external system (off-cluster Alertmanager for | ||
instance). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If managed by an external system, why would we want to send to platform AM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is clarified down further, but they would be able to (if desired) send to both right? If using some external system they could forward the alerts to both platform and external AM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it's already possible today (if you configure additionalAlertmanagerConfigs
for UWM, alerts will always go to Platform Alertmanager). I'll add another user story to make it clear.
597cd13
to
62b4146
Compare
Alertmanager configuration. | ||
|
||
Mitigation | ||
* The Prometheus operator should expose a validating admission webhook that should prevent invalid configurations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require that the webhook and the operator share the same RBAC to read cluster wide secrets, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the operator implements the webhook endpoint so it should have the proper RBAC but maybe it's better if the webhook doesn't try to resolve external references?
More a discussion for prometheus-operator/prometheus-operator#4338 though ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing! Some suggestions, otherwise looking good 👍🏽
@@ -0,0 +1,411 @@ | |||
--- | |||
title: alert-routing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I would propose naming this PR and title like Multi-Tenant Alert Routing on OpenShift 🔥
title: alert-routing | |
title: multi-tenant-alert-routing |
This document describes a solution that allows OpenShift users to route alert | ||
notifications without cluster admin intervention. It complements the existing | ||
[user-workload monitoring stack][uwm-docs], enabling a full self-service experience for | ||
workload monitoring. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document describes a solution that allows OpenShift users to route alert | |
notifications without cluster admin intervention. It complements the existing | |
[user-workload monitoring stack][uwm-docs], enabling a full self-service experience for | |
workload monitoring. | |
This document describes a solution that allows OpenShift users to route alert | |
notifications from shared Alertmanager without cluster admin intervention. It complements the existing | |
[user-workload monitoring stack][uwm-docs], enabling a full self-service experience for | |
workload monitoring. | |
cluster admins. | ||
* Cluster admins can grant permissions to users and groups to manage alert | ||
routing scoped to individual namespaces. | ||
* Namespace owners should be able to opt-out from Alertmanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽 That was my worry - what if someone created their own AM and what to watch for couple of namespaces on their own. That will do the work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we already have a solution in place for service/pod monitors and rules. For consistency it should also apply to AlertmanagerConfigs.
enhancements/monitoring/alert-routing-for-user-workload-monitoring.md
Outdated
Show resolved
Hide resolved
|
||
#### Story 3 | ||
|
||
As an application owner, I want to know if my AlertmanagerConfig custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽 Is this case for DeadMansSnitch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specifically. First of all, I want to know whether or not the Prometheus operator reconciled my AlertmanagerConfig resource (e.g. it won't if it contains invalid secret references for instance): this would be covered by exposing a Status
subresource in the CRD.
Secondly, it would be nice if I can have an easy way to trigger a "fake" alert (drill mode) to validate my receivers.
As an OpenShift cluster admin, I want to exclude certain user namespaces from | ||
modifying the Plaform Alertmanager configuration so that I can recover in case | ||
of breakage or bad behavior. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I see this, but I wonder if we can have a more ambitious goal:
- As an OpenShift cluster main I expect Alertmanager to NOT crash/stop refreshing configuration if only one of the AlertmanagerConfig is malformed / invalid.
WDYT?
As an OpenShift cluster admin, I want to exclude certain user namespaces from | ||
modifying the Plaform Alertmanager configuration so that I can recover in case | ||
of breakage or bad behavior. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that even possible? (:
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bwplotka The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this all makes sense to me.
Would it make sense to map out the additional changes we'd need, if we opt for the optional UWM alertmanager cluster? That might inform the decision in the first place. E.g. would we still retain the option to exclude some namespaces, what would the resource overhead look like.
routing scoped to individual namespaces. | ||
* Namespace owners should be able to opt-out from Alertmanager | ||
configuration (similar to what exist for service/pod monitors and rules using the | ||
`"openshisft.io/user-monitoring: false"` label on the namespace). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`"openshisft.io/user-monitoring: false"` label on the namespace). | |
`"openshift.io/user-monitoring: false"` label on the namespace). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is more than theoretically possible for users to destabilize the Platform AM, then we have to at least consider this second AM being required. How likely are we to hit these cases where the prometheus operator does not correctly handle a user input leading to a Platform AM outage?
@openshift/sre-alert-sme please review (I also sent this to a wide distribution of SRE mailing lists) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great feature design @simonpasquier ! I have concerns over meshing resources usage between admins/SREs + users as there are no guard rails from preventing users from pushing AM into an unavailable state. This concerns me at scale where if it became an issue, SRE would be forced to disable this feature leaving the customers back in their original position their application alerting requirements. Big + on UWM having its own AM.
|
||
This role complements the existing `monitoring-edit`, `monitoring-rules-edit` and `monitoring-rules-view` roles. | ||
|
||
#### Resource impacts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the alert tree be structured? In the case of sharing resources with customers, are SREs able to prioritise platform alerts over user?
For example:
- system/platform alerts
- user alerts (1st reconciled/added)
- ......
- user alerts (last reconciled/added)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, the routing tree will be structured as:
- user AlertmanagerConfig from namespace X
- user AlertmanagerConfig from namespace Y
- ...
- user AlertmanagerConfig from namespace Z
- Alertmanager configuration defined by the cluster admins
But there's no priority per se in Alertmanager: as soon as an alert matches a routing entry, it gets dispatched to a goroutine and the tree evaluation goes on to check if other entries match.
to a potential outage of the Platform Alertmanager cluster. | ||
|
||
Mitigations | ||
* The `AlertmanagerBadConfig` alert fires when Alertmanager can't reload its configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this alert will be destined for Admin/SRE routes? If so, this is unbound overhead that has potential to cause alot of pain for admins/SREs. Either configs should be somehow prevalidated or automatically rolled-back to prevent this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes since the Platform Alertmanager is managed by the SRE teams, the notification would be routed to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As SRE, we should not care about alerts that caused by customer's wrong configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mrbarge another candidate for ocm-agent :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a dedicated Alertmanager cluster is deployed for user-defined alerts, the platform SREs can decide where the AlertmanagerBadConfig
alert for namespace="openshift-user-workload-monitoring"
should be routed. I would guess that something similar exists today for alerts about the UWM components (Prometheus + Thanos Ruler).
Mitigations | ||
* The `AlertmanagerBadConfig` alert fires when Alertmanager can't reload its configuration. | ||
* Cluster admins can turn off the support for `AlertmanagerConfig` globally so that the Platform Alertmanager cluster can process platform alerts again and the cluster admins have time to identiy the "rogue" `AlertmanagerConfig` resource(s). | ||
* Cluster admins can exclude specific user namespaces (once the "rogue" `AlertmanagerConfig` resource(s) have been identified) to restore UWM functionality for good citizens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the rouge
AlertmanagerConfigbe identified? Will it get a
.status` of invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today, nothing exists and the operator can't report which one is invalid since it didn't catch it before. The identification has to rely on human intervention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually the operator will update the status field for AlertmanagerConfig resources that the operator doesn't reconcile (see #### AlertmanagerConfig resources not being reconciled
) below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier will this render the co/monitoring degraded? (it would fail preupgrade health checks if so...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today, CMO doesn't check for alerts so Alertmanager (or Prometheus) failing to reload the configuration doesn't change the co/monitoring conditions.
Users may provide bad credentials for the receivers, the system receiving the | ||
notifications might be unreachable or the system might be unable to process the requests. These | ||
situations would trigger the `AlertmanagerFailedToSendAlerts` and/or | ||
`AlertmanagerClusterFailedToSendAlerts` alerts. The cluster admins have to act |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is unbound support/toil for Admin/SREs. If we open the openshift-monitoring AM up to customers/users then there needs to be harder guard rails against the customers/user breaking the platform AM. Admins/SREs should not be involved in troubleshooting user configuration. This wont scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
|
||
### Risks and Mitigations | ||
|
||
#### Disruption of the platform Alertmanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scares me, giving customers this power to configure unbound alerting thresholds via their configs + applications gives high risk to AM unavailability. Great potential for unintended user generated DoS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns over meshing resources usage between admins/SREs + users as there are no guard rails from preventing users from pushing AM into an unavailable state. This concerns me at scale where if it became an issue, SRE would be forced to disable this feature leaving the customers back in their original position their application alerting requirements. Big + on UWM having its own AM.
Thanks for the great feedback @dofinn! I kind of expected these concerns. For organizations where a single team manages everything, it might be less problematic but I agree that they are very valid in the OSD context where the platform and UWM responsibilities are decoupled.
I'll update the proposal to describe the possibility of a standalone UWM Alertmanager.
to a potential outage of the Platform Alertmanager cluster. | ||
|
||
Mitigations | ||
* The `AlertmanagerBadConfig` alert fires when Alertmanager can't reload its configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes since the Platform Alertmanager is managed by the SRE teams, the notification would be routed to them.
Mitigations | ||
* The `AlertmanagerBadConfig` alert fires when Alertmanager can't reload its configuration. | ||
* Cluster admins can turn off the support for `AlertmanagerConfig` globally so that the Platform Alertmanager cluster can process platform alerts again and the cluster admins have time to identiy the "rogue" `AlertmanagerConfig` resource(s). | ||
* Cluster admins can exclude specific user namespaces (once the "rogue" `AlertmanagerConfig` resource(s) have been identified) to restore UWM functionality for good citizens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today, nothing exists and the operator can't report which one is invalid since it didn't catch it before. The identification has to rely on human intervention.
Users may provide bad credentials for the receivers, the system receiving the | ||
notifications might be unreachable or the system might be unable to process the requests. These | ||
situations would trigger the `AlertmanagerFailedToSendAlerts` and/or | ||
`AlertmanagerClusterFailedToSendAlerts` alerts. The cluster admins have to act |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
|
||
This role complements the existing `monitoring-edit`, `monitoring-rules-edit` and `monitoring-rules-view` roles. | ||
|
||
#### Resource impacts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, the routing tree will be structured as:
- user AlertmanagerConfig from namespace X
- user AlertmanagerConfig from namespace Y
- ...
- user AlertmanagerConfig from namespace Z
- Alertmanager configuration defined by the cluster admins
But there's no priority per se in Alertmanager: as soon as an alert matches a routing entry, it gets dispatched to a goroutine and the tree evaluation goes on to check if other entries match.
|
||
Even though the Prometheus operator prevents it as much as it can, it may be | ||
possible for users to create an `AlertmanagerConfig` resource that triggers the | ||
Prometheus operator to generate an invalid Alertmanager configuration, leading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense for the prometheus operator to reject any AlertmanagerConfig
resources that would cause an overall bad config?
Webhook validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator will do its best to reject invalid configuration but bugs happen and it may happen that the operator generates a configuration that couldn't be parsed by Alertmanager. In short, unlikely to happen but still possible.
5012be5
to
2941e40
Compare
ac9a842
to
ba3dc21
Compare
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Distinguish between Platform and UWM Alertmanager deployments.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2194aef
to
78d6ddb
Compare
/remove-lifecycle stale The proposal is in good shape now, I think all comments have addressed. |
/remove-lifecycle rotten |
|
||
#### Deployment models | ||
|
||
##### Leveraging the Platform Alertmanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inhibition is a global setting... is it possible that multi tenant AlertManager CRs could unknowingly inhibit each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes because the operator will always append a namespace matcher to the inhibition rules, e.g.
inhibit_rules:
- target_matchers:
- alertname="HttpErrors"
source_matchers:
- alertname=~"DatabaseErrors"
equal:
- service
is turned into
inhibit_rules:
- target_matchers:
- alertname="HttpErrors"
- namespace="foo"
source_matchers:
- alertname=~"DatabaseErrors"
- namespace="foo"
equal:
- service
enabled: true | ||
enableUserAlertmanagerConfig: true | ||
logLevel: info | ||
nodeSelector: {...} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier will this be workers by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be similar to what happens for the other monitoring components. When nodeSelector and tolerations aren't defined, all worker nodes are eligible and master nodes are excluded (because they are tainted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, the current prom-operator from UWM schedules itself on masters :)
$ oc -n openshift-user-workload-monitoring get pods prometheus-operator-68fd48b89d-6tp7s -o json | jq '.spec.tolerations[]'
{
"effect": "NoSchedule",
"key": "node-role.kubernetes.io/master",
"operator": "Exists"
}
{
"effect": "NoExecute",
"key": "node.kubernetes.io/not-ready",
"operator": "Exists",
"tolerationSeconds": 300
}
{
"effect": "NoExecute",
"key": "node.kubernetes.io/unreachable",
"operator": "Exists",
"tolerationSeconds": 300
}
{
"effect": "NoSchedule",
"key": "node.kubernetes.io/memory-pressure",
"operator": "Exists"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is following the OpenShift convention for operators -> https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#taints-and-tolerations
Done ! Awesome work on this @simonpasquier ! /lgtm |
/lgtm |
@simonpasquier: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This document describes a solution that allows OpenShift users to route alert
notifications without cluster admin intervention. It complements the existing
user workload monitoring stack, enabling a full self-service experience for
workload monitoring.