Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring: support configuration of alerting notifications by application owners #937

Conversation

simonpasquier
Copy link
Contributor

This document describes a solution that allows OpenShift users to route alert
notifications without cluster admin intervention. It complements the existing
user workload monitoring stack, enabling a full self-service experience for
workload monitoring.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 19, 2021
@openshift-ci openshift-ci bot requested review from markmc and sttts October 19, 2021 08:36
@simonpasquier simonpasquier force-pushed the alertmanagerconfig-support branch 2 times, most recently from ddab0db to 597cd13 Compare October 19, 2021 08:52
@simonpasquier simonpasquier changed the title [WIP] monitoring: support configuration of alerting settings by application owners [WIP] monitoring: support configuration of alerting notifications by application owners Oct 19, 2021
As an OpenShift cluster admin, I want to exclude certain user namespaces from
modifying the Plaform Alertmanager configuration so that I can recover in case
of breakage or bad behavior.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier I don't really understand this story. How would excluding specific namespaces allow us to recover from a bad configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say that there's a AlertmanagerConfig resource in namespace foo breaking the Platform Alertmanager. Then the cluster admin can temporarily exclude that particular namespace from UWM to recover the situation and get in touch with the team that provisioned the "bad" config.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that makes sense. So it is a reactive solution essentially. Drifting off slightly to implementation, would they have any means other than logging to be alerted on what namespace has broken the conf?

Copy link

@philipgough philipgough Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is covered later with the AlertmanagerBadConfig alert, somewhat at least. But maybe we can do some improvements to help identify the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I see this, but I wonder if we can have a more ambitious goal:

  • As an OpenShift cluster main I expect Alertmanager to NOT crash/stop refreshing configuration if only one of the AlertmanagerConfig is malformed / invalid.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that even possible? (:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is the goal from the prometheus-operator perspective yes. We will discard the invalid CR and it will be excluded from the merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator can only do as much as it can to ensure that it generates a valid Alertmanager configuration but there will always be a theoretical risk that the operator has a bug in certain edge cases. Still I'll try to think about more proactive strategies.

As a UWM admin, I want to send user alerts to the Platform Alertmanager
cluster because UWM alerts are managed by an external system (off-cluster Alertmanager for
instance).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If managed by an external system, why would we want to send to platform AM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is clarified down further, but they would be able to (if desired) send to both right? If using some external system they could forward the alerts to both platform and external AM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's already possible today (if you configure additionalAlertmanagerConfigs for UWM, alerts will always go to Platform Alertmanager). I'll add another user story to make it clear.

Alertmanager configuration.

Mitigation
* The Prometheus operator should expose a validating admission webhook that should prevent invalid configurations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require that the webhook and the operator share the same RBAC to read cluster wide secrets, correct?

Copy link
Contributor Author

@simonpasquier simonpasquier Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the operator implements the webhook endpoint so it should have the proper RBAC but maybe it's better if the webhook doesn't try to resolve external references?
More a discussion for prometheus-operator/prometheus-operator#4338 though ;)

Copy link
Contributor

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! Some suggestions, otherwise looking good 👍🏽

@@ -0,0 +1,411 @@
---
title: alert-routing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I would propose naming this PR and title like Multi-Tenant Alert Routing on OpenShift 🔥

Suggested change
title: alert-routing
title: multi-tenant-alert-routing

Comment on lines 30 to 34
This document describes a solution that allows OpenShift users to route alert
notifications without cluster admin intervention. It complements the existing
[user-workload monitoring stack][uwm-docs], enabling a full self-service experience for
workload monitoring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This document describes a solution that allows OpenShift users to route alert
notifications without cluster admin intervention. It complements the existing
[user-workload monitoring stack][uwm-docs], enabling a full self-service experience for
workload monitoring.
This document describes a solution that allows OpenShift users to route alert
notifications from shared Alertmanager without cluster admin intervention. It complements the existing
[user-workload monitoring stack][uwm-docs], enabling a full self-service experience for
workload monitoring.

cluster admins.
* Cluster admins can grant permissions to users and groups to manage alert
routing scoped to individual namespaces.
* Namespace owners should be able to opt-out from Alertmanager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽 That was my worry - what if someone created their own AM and what to watch for couple of namespaces on their own. That will do the work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we already have a solution in place for service/pod monitors and rules. For consistency it should also apply to AlertmanagerConfigs.


#### Story 3

As an application owner, I want to know if my AlertmanagerConfig custom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽 Is this case for DeadMansSnitch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifically. First of all, I want to know whether or not the Prometheus operator reconciled my AlertmanagerConfig resource (e.g. it won't if it contains invalid secret references for instance): this would be covered by exposing a Status subresource in the CRD.
Secondly, it would be nice if I can have an easy way to trigger a "fake" alert (drill mode) to validate my receivers.

As an OpenShift cluster admin, I want to exclude certain user namespaces from
modifying the Plaform Alertmanager configuration so that I can recover in case
of breakage or bad behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I see this, but I wonder if we can have a more ambitious goal:

  • As an OpenShift cluster main I expect Alertmanager to NOT crash/stop refreshing configuration if only one of the AlertmanagerConfig is malformed / invalid.

WDYT?

As an OpenShift cluster admin, I want to exclude certain user namespaces from
modifying the Plaform Alertmanager configuration so that I can recover in case
of breakage or bad behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that even possible? (:

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bwplotka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 21, 2021
Copy link
Contributor

@jan--f jan--f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this all makes sense to me.

Would it make sense to map out the additional changes we'd need, if we opt for the optional UWM alertmanager cluster? That might inform the decision in the first place. E.g. would we still retain the option to exclude some namespaces, what would the resource overhead look like.

routing scoped to individual namespaces.
* Namespace owners should be able to opt-out from Alertmanager
configuration (similar to what exist for service/pod monitors and rules using the
`"openshisft.io/user-monitoring: false"` label on the namespace).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`"openshisft.io/user-monitoring: false"` label on the namespace).
`"openshift.io/user-monitoring: false"` label on the namespace).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is more than theoretically possible for users to destabilize the Platform AM, then we have to at least consider this second AM being required. How likely are we to hit these cases where the prometheus operator does not correctly handle a user input leading to a Platform AM outage?

@simonpasquier simonpasquier changed the title [WIP] monitoring: support configuration of alerting notifications by application owners monitoring: support configuration of alerting notifications by application owners Oct 21, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2021
@jeremyeder
Copy link
Contributor

@openshift/sre-alert-sme please review (I also sent this to a wide distribution of SRE mailing lists)

Copy link
Contributor

@dofinn dofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great feature design @simonpasquier ! I have concerns over meshing resources usage between admins/SREs + users as there are no guard rails from preventing users from pushing AM into an unavailable state. This concerns me at scale where if it became an issue, SRE would be forced to disable this feature leaving the customers back in their original position their application alerting requirements. Big + on UWM having its own AM.


This role complements the existing `monitoring-edit`, `monitoring-rules-edit` and `monitoring-rules-view` roles.

#### Resource impacts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the alert tree be structured? In the case of sharing resources with customers, are SREs able to prioritise platform alerts over user?

For example:

  1. system/platform alerts
  2. user alerts (1st reconciled/added)
  3. ......
  4. user alerts (last reconciled/added)
    ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, the routing tree will be structured as:

  1. user AlertmanagerConfig from namespace X
  2. user AlertmanagerConfig from namespace Y
  3. ...
  4. user AlertmanagerConfig from namespace Z
  5. Alertmanager configuration defined by the cluster admins

But there's no priority per se in Alertmanager: as soon as an alert matches a routing entry, it gets dispatched to a goroutine and the tree evaluation goes on to check if other entries match.

to a potential outage of the Platform Alertmanager cluster.

Mitigations
* The `AlertmanagerBadConfig` alert fires when Alertmanager can't reload its configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this alert will be destined for Admin/SRE routes? If so, this is unbound overhead that has potential to cause alot of pain for admins/SREs. Either configs should be somehow prevalidated or automatically rolled-back to prevent this.

Copy link
Contributor Author

@simonpasquier simonpasquier Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes since the Platform Alertmanager is managed by the SRE teams, the notification would be routed to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As SRE, we should not care about alerts that caused by customer's wrong configs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mrbarge another candidate for ocm-agent :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a dedicated Alertmanager cluster is deployed for user-defined alerts, the platform SREs can decide where the AlertmanagerBadConfig alert for namespace="openshift-user-workload-monitoring" should be routed. I would guess that something similar exists today for alerts about the UWM components (Prometheus + Thanos Ruler).

Mitigations
* The `AlertmanagerBadConfig` alert fires when Alertmanager can't reload its configuration.
* Cluster admins can turn off the support for `AlertmanagerConfig` globally so that the Platform Alertmanager cluster can process platform alerts again and the cluster admins have time to identiy the "rogue" `AlertmanagerConfig` resource(s).
* Cluster admins can exclude specific user namespaces (once the "rogue" `AlertmanagerConfig` resource(s) have been identified) to restore UWM functionality for good citizens.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the rouge AlertmanagerConfigbe identified? Will it get a.status` of invalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today, nothing exists and the operator can't report which one is invalid since it didn't catch it before. The identification has to rely on human intervention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually the operator will update the status field for AlertmanagerConfig resources that the operator doesn't reconcile (see #### AlertmanagerConfig resources not being reconciled) below.

Copy link
Contributor

@dofinn dofinn Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier will this render the co/monitoring degraded? (it would fail preupgrade health checks if so...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today, CMO doesn't check for alerts so Alertmanager (or Prometheus) failing to reload the configuration doesn't change the co/monitoring conditions.

Users may provide bad credentials for the receivers, the system receiving the
notifications might be unreachable or the system might be unable to process the requests. These
situations would trigger the `AlertmanagerFailedToSendAlerts` and/or
`AlertmanagerClusterFailedToSendAlerts` alerts. The cluster admins have to act
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is unbound support/toil for Admin/SREs. If we open the openshift-monitoring AM up to customers/users then there needs to be harder guard rails against the customers/user breaking the platform AM. Admins/SREs should not be involved in troubleshooting user configuration. This wont scale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


### Risks and Mitigations

#### Disruption of the platform Alertmanager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scares me, giving customers this power to configure unbound alerting thresholds via their configs + applications gives high risk to AM unavailability. Great potential for unintended user generated DoS.

enhancements/monitoring/multi-tenant-alerting.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns over meshing resources usage between admins/SREs + users as there are no guard rails from preventing users from pushing AM into an unavailable state. This concerns me at scale where if it became an issue, SRE would be forced to disable this feature leaving the customers back in their original position their application alerting requirements. Big + on UWM having its own AM.

Thanks for the great feedback @dofinn! I kind of expected these concerns. For organizations where a single team manages everything, it might be less problematic but I agree that they are very valid in the OSD context where the platform and UWM responsibilities are decoupled.
I'll update the proposal to describe the possibility of a standalone UWM Alertmanager.

to a potential outage of the Platform Alertmanager cluster.

Mitigations
* The `AlertmanagerBadConfig` alert fires when Alertmanager can't reload its configuration.
Copy link
Contributor Author

@simonpasquier simonpasquier Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes since the Platform Alertmanager is managed by the SRE teams, the notification would be routed to them.

Mitigations
* The `AlertmanagerBadConfig` alert fires when Alertmanager can't reload its configuration.
* Cluster admins can turn off the support for `AlertmanagerConfig` globally so that the Platform Alertmanager cluster can process platform alerts again and the cluster admins have time to identiy the "rogue" `AlertmanagerConfig` resource(s).
* Cluster admins can exclude specific user namespaces (once the "rogue" `AlertmanagerConfig` resource(s) have been identified) to restore UWM functionality for good citizens.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today, nothing exists and the operator can't report which one is invalid since it didn't catch it before. The identification has to rely on human intervention.

Users may provide bad credentials for the receivers, the system receiving the
notifications might be unreachable or the system might be unable to process the requests. These
situations would trigger the `AlertmanagerFailedToSendAlerts` and/or
`AlertmanagerClusterFailedToSendAlerts` alerts. The cluster admins have to act
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


This role complements the existing `monitoring-edit`, `monitoring-rules-edit` and `monitoring-rules-view` roles.

#### Resource impacts
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, the routing tree will be structured as:

  1. user AlertmanagerConfig from namespace X
  2. user AlertmanagerConfig from namespace Y
  3. ...
  4. user AlertmanagerConfig from namespace Z
  5. Alertmanager configuration defined by the cluster admins

But there's no priority per se in Alertmanager: as soon as an alert matches a routing entry, it gets dispatched to a goroutine and the tree evaluation goes on to check if other entries match.


Even though the Prometheus operator prevents it as much as it can, it may be
possible for users to create an `AlertmanagerConfig` resource that triggers the
Prometheus operator to generate an invalid Alertmanager configuration, leading
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense for the prometheus operator to reject any AlertmanagerConfig resources that would cause an overall bad config?
Webhook validation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator will do its best to reject invalid configuration but bugs happen and it may happen that the operator generates a configuration that couldn't be parsed by Alertmanager. In short, unlikely to happen but still possible.

@simonpasquier
Copy link
Contributor Author

/remove-lifecycle stale

The proposal is in good shape now, I think all comments have addressed.
@jeremyeder @dofinn can you give it a final review pls?

@simonpasquier
Copy link
Contributor Author

/remove-lifecycle rotten
/assign @jeremyeder

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 16, 2022

#### Deployment models

##### Leveraging the Platform Alertmanager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inhibition is a global setting... is it possible that multi tenant AlertManager CRs could unknowingly inhibit each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because the operator will always append a namespace matcher to the inhibition rules, e.g.

inhibit_rules:                                                                                      
- target_matchers:                                                                                  
  - alertname="HttpErrors"                                                                          
  source_matchers:                                                                                  
  - alertname=~"DatabaseErrors"                                                                       
  equal:                                                                                            
  - service                                                                                            

is turned into

inhibit_rules:                                                                                      
- target_matchers:                                                                                  
  - alertname="HttpErrors"
  - namespace="foo"                                                                          
  source_matchers:                                                                                  
  - alertname=~"DatabaseErrors"                                                                       
  - namespace="foo"                                                                          
  equal:                                                                                            
  - service                                                                                            

enabled: true
enableUserAlertmanagerConfig: true
logLevel: info
nodeSelector: {...}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier will this be workers by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be similar to what happens for the other monitoring components. When nodeSelector and tolerations aren't defined, all worker nodes are eligible and master nodes are excluded (because they are tainted).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, the current prom-operator from UWM schedules itself on masters :)

$ oc -n openshift-user-workload-monitoring get pods prometheus-operator-68fd48b89d-6tp7s -o json | jq '.spec.tolerations[]'
{
  "effect": "NoSchedule",
  "key": "node-role.kubernetes.io/master",
  "operator": "Exists"
}
{
  "effect": "NoExecute",
  "key": "node.kubernetes.io/not-ready",
  "operator": "Exists",
  "tolerationSeconds": 300
}
{
  "effect": "NoExecute",
  "key": "node.kubernetes.io/unreachable",
  "operator": "Exists",
  "tolerationSeconds": 300
}
{
  "effect": "NoSchedule",
  "key": "node.kubernetes.io/memory-pressure",
  "operator": "Exists"
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dofinn
Copy link
Contributor

dofinn commented Apr 4, 2022

/remove-lifecycle stale

The proposal is in good shape now, I think all comments have addressed. @jeremyeder @dofinn can you give it a final review pls?

Done ! Awesome work on this @simonpasquier ! /lgtm

@dofinn
Copy link
Contributor

dofinn commented Apr 5, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 5, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 5, 2022

@simonpasquier: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit c296f93 into openshift:master Apr 5, 2022
@simonpasquier simonpasquier deleted the alertmanagerconfig-support branch May 5, 2022 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.