From 1c15454bbdba5defb23bf3d981ea87c7c322b9fa Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Thu, 14 Oct 2021 17:56:29 +0200
Subject: [PATCH 01/17] Add alert-routing-for-user-workload-monitoring.md

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
---
 ...rt-routing-for-user-workload-monitoring.md | 411 ++++++++++++++++++
 1 file changed, 411 insertions(+)
 create mode 100644 enhancements/monitoring/alert-routing-for-user-workload-monitoring.md

diff --git a/enhancements/monitoring/alert-routing-for-user-workload-monitoring.md b/enhancements/monitoring/alert-routing-for-user-workload-monitoring.md
new file mode 100644
index 0000000000..c68e81630b
--- /dev/null
+++ b/enhancements/monitoring/alert-routing-for-user-workload-monitoring.md
@@ -0,0 +1,411 @@
+---
+title: alert-routing
+authors:
+  - "@simonpasquier"
+reviewers:
+  - "@openshift/openshift-team-monitoring"
+approvers:
+  - TBD
+  - "@openshift/openshift-team-monitoring"
+creation-date: 2021-10-11
+last-updated: 2021-10-11
+status: provisional
+see-also:
+  - "/enhancements/monitoring/user-workload-monitoring.md"
+---
+
+# alert-routing-for-user-workload monitoring
+
+## Release Signoff Checklist
+
+- [X] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Operational readiness criteria is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+This document describes a solution that allows OpenShift users to route alert
+notifications without cluster admin intervention. It complements the existing
+[user-workload monitoring stack][uwm-docs], enabling a full self-service experience for
+workload monitoring.
+
+## Motivation
+
+Since OpenShift 4.6, application owners can collect metrics from their
+applications and configure alerting rules by themselves as described in the
+[User Workload Monitoring][user-workload-monitoring-enhancement] (UWM)
+enhancement. The rules are defined as `PrometheusRule` resources and can be
+based on platform and/or application metrics. They are evaluated by the Thanos
+ruler instances (by default) or the Prometheus instances running in the
+`openshift-user-workload` namespace.
+
+When a user alert fires, the Thanos Ruler (or UWM Prometheus) sends it to the
+Plaform Alertmanager cluster (deployed in the `openshift-monitoring` namespace)
+where it gets aggregated and dispatched to the correct destination (page, chat,
+email, ticket, ...).
+
+The configuration of Alertmanager is done via a single configuration file that
+only cluster admins have permissions to modify. If the cluster is shared by
+multiple tenants and each tenant has different requirements to receive their
+notifications then each tenant needs to ask and wait for the cluster admins to
+adjust the Alertmanager configuration.
+
+To streamline the process and avoid cluster admins being the bottleneck,
+application owners should be able to configure alert routing and notification
+receivers in the Plaform Alertmanager without cluster admin intervention.
+
+[AlertmanagerConfig][alertmanagerconfig-crd] CRD fullfills this requirement and
+is supported by the [Prometheus operator][prometheus-operator] since v0.43.0
+but it is explicitly called out in the OCP documentation as ["not
+supported"][unsupported-resources].
+
+### Goals
+
+* Cluster users can configure alert notifications for applications being
+monitored by the user-workload monitoring without requesting intervention from
+cluster admins.
+* Cluster admins can grant permissions to users and groups to manage alert
+routing scoped to individual namespaces.
+* Namespace owners should be able to opt-out from Alertmanager
+configuration (similar to what exist for service/pod monitors and rules using the
+`"openshisft.io/user-monitoring: false"` label on the namespace).
+* Cluster admins should be able to opt-out from supporting `AlertmanagerConfig`
+resources from user namespaces.
+
+### Non-Goals
+
+* Additional support for silencing user alerts (it is already supported by UWM in the OCP console).
+* Specific integration in the OCP Console exposing the configuration of alert notifications.
+* Support the configuration of alert notifications for platform alerts (e.g.
+alerts originating from namespaces with the `openshift.io/cluster-monitoring: "true"`
+label).
+
+## Proposal
+
+We plan to leverage the `AlertmanagerConfig` custom resource definition already
+exposed by the Prometheus operator so that application owners can configure how
+and where their alert notifications should be routed.
+
+### User Stories
+
+Personas:
+* Application owners: manage a project with sufficient permissions to define monitoring resources.
+* UWM admins: manage the configuration of the UWM components (edit permissions on the `openshift-user-workload-monitoring/user-workload-monitoring-config` configmap).
+* Cluster admins: manage the configuration of the Platform monitoring components.
+
+#### Story 1
+
+As an application owner, I want to use AlertmanagerConfig custom resources so
+that Alertmanager can push alert notifications for my applications to the
+receiver of my choice.
+
+#### Story 2
+
+As an application owner, I want to use AlertmanagerConfig custom
+resources so that Alertmanager can inhibit alerts based on other alerts firing
+at the same time.
+
+#### Story 3
+
+As an application owner, I want to know if my AlertmanagerConfig custom
+resource is taken into account so that I am confident that I will receive alert
+notifications.
+
+#### Story 4
+
+As a OpenShift cluster admin, I want to allow some of my users to
+create/update/delete AlertmanagerConfig custom resources and leverage the
+platform Alertmanager cluster so that I don't have to configure alert routing
+on their behalf.
+
+#### Story 5
+
+As an OpenShift cluster admin, I don't want AlertmanagerConfig resources
+defined by application owners to interfere with the routing of platform alerts.
+
+#### Story 6
+
+As an OpenShift cluster admin, I want to exclude certain user namespaces from
+modifying the Plaform Alertmanager configuration so that I can recover in case
+of breakage or bad behavior.
+
+#### Story 7
+
+As an OpenShift cluster admin, I don't want to support AlertmanagerConfig
+resources for application owners so that the configuration of the Platform
+Alertmanager cluster is completely under my control.
+
+### Story 8
+
+As a UWM admin, I don't want to send user alerts to the Platform Alertmanager
+cluster because these alerts are managed by an external system (off-cluster Alertmanager for
+instance).
+
+### Implementation Details/Notes/Constraints
+
+The `AlertmanagerConfig` CRD is exposed by the `coreos.monitoring.com/v1alpha1` API group.
+
+The `Alertmanager` custom resource defines 2 LabelSelector fields
+(`alertmanagerConfigSelector` and `alertmanagerConfigNamespaceSelector`) to
+select which `AlertmanagerConfig` resources should be reconciled by the
+Prometheus operator and from which namespace(s). Before this proposal, the
+Plaform Alertmanager resource defines the 2 selectors as null, meaning that it
+doesn't select any `AlertmanagerConfig` resources.
+
+We propose to add a new field `enableUserAlertmanagerConfig` to the
+`openshift-montoring/cluster-monitoring-config` configmap. If
+`enableUserAlertmanagerConfig` is missing, the default value is false.
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: cluster-monitoring-config
+  namespace: openshift-monitoring
+data:
+  config.yaml: |-
+    enableUserWorkload: true
+    enableUserAlertmanagerConfig: true
+```
+
+When `enableUserAlertmanagerConfig` is true, the cluster monitoring operator
+configures the Platform Alertmanager as follows.
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: Alertmanager
+metadata:
+  name: main
+  namespace: openshift-monitoring
+spec:
+  alertmanagerConfigSelector: {}
+  alertmanagerConfigNamespaceSelector:
+    matchExpressions:
+    - key: openshift.io/cluster-monitoring
+      operator: NotIn
+      values:
+      - "true"
+    - key: openshift.io/user-monitoring
+      operator: NotIn
+      values:
+      - "false"
+  ...
+```
+
+To be consistent with what exists already for service/pod monitors and rules,
+the Prometheus operator doesn't reconcile `AlertmanagerConfig` resources from
+namespaces with the `openshift.io/user-monitoring: "false"` label.  It allows
+application owners to opt out completely from UWM in case they deploy and run
+their own monitoring infrastructure (for instance with the [Monitoring Stack
+operator][monitoring-stack-operator]).
+
+In addition, the cluster admins can exclude specific user namespace(s) from UWM with the new `excludeUserNamespaces` field.
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: cluster-monitoring-config
+  namespace: openshift-monitoring
+data:
+  config.yaml: |-
+    enableUserWorkload: true
+    enableUserAlertmanagerConfig: true
+    excludeUserNamespaces: [foo,bar]
+```
+
+The UWM admins can also define that UWM alerts shouldn't be forwarded to the
+Platform Alertmanager. With this capability and the existing
+`additionalAlertmanagerConfigs`, it is possible to externalize the alert
+routing and notifications to an external Alertmanager instance when the cluster
+admins don't want to share the Plaform Alertmanager for instance.
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: user-workload-monitoring-config
+  namespace: openshift-user-workload-monitoring
+data:
+  config.yaml: |-
+    thanosRuler:
+      usePlatformAlertmanager: false
+    prometheus:
+      usePlatformAlertmanager: false
+      additionalAlertmanagerConfigs: [...]
+```
+
+When this option is chosen, the OCP console can't be used to manage silences for user alerts.
+
+### Tenancy
+
+By design, all alerts coming from UWM have a `namespace` label equal to the
+`PrometheusRule` resource's namespace. The Prometheus operator relies on this
+invariant to generate an Alertmanager configuration that ensures that a given
+`AlertmanagerConfig` resource only matches alerts that have the same
+`namespace` value. This means that an `AlertmanagerConfig` resource from
+namespace `foo` only processes alerts with the `namespace="foo"` label (be it
+for routing or inhibiting purposes).
+
+### RBAC
+
+The cluster monitoring operator ships a new cluster role
+`alertmanager-config-edit` so that cluster admins can bind it with a
+`RoleBinding` to grant permissions to users or groups on `AlertmanagerConfig`
+custom resources within a given namespace.
+
+```yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: alertmanager-config-edit
+rules:
+- apiGroups:
+  - monitoring.coreos.com
+  resources:
+  - alertmanagerconfigs
+  verbs:
+  - '*'
+```
+
+This role complements the existing `monitoring-edit`, `monitoring-rules-edit` and `monitoring-rules-view` roles.
+
+#### Resource impacts
+
+The size of the Alertmanager routing tree can grow to thousands of entries
+(possibly several `AlertmanagerConfig` resources per namespace) and this may
+hinder the performances of the Plaform Alertmanager (increased latency of alert
+notifications for instance).
+
+We already know from upstream users that Alertmanager can deal with many
+routes. We plan to simulate environments with thousands of `AlertmanagerConfig`
+resources and measure the impact on notification delivery.
+
+### Risks and Mitigations
+
+#### Disruption of the platform Alertmanager
+
+Even though the Prometheus operator prevents it as much as it can, it may be
+possible for users to create an `AlertmanagerConfig` resource that triggers the
+Prometheus operator to generate an invalid Alertmanager configuration, leading
+to a potential outage of the Platform Alertmanager cluster.
+
+Mitigations
+* The `AlertmanagerBadConfig` alert fires when Alertmanager can't reload its configuration.
+* Cluster admins can turn off the support for `AlertmanagerConfig` globally so that the Platform Alertmanager cluster can process platform alerts again and the cluster admins have time to identiy the "rogue" `AlertmanagerConfig` resource(s).
+* Cluster admins can exclude specific user namespaces (once the "rogue" `AlertmanagerConfig` resource(s) have been identified) to restore UWM functionality for good citizens.
+
+#### Misconfiguration of receivers
+
+Users may provide bad credentials for the receivers, the system receiving the
+notifications might be unreachable or the system might be unable to process the requests. These
+situations would trigger the `AlertmanagerFailedToSendAlerts` and/or
+`AlertmanagerClusterFailedToSendAlerts` alerts. The cluster admins have to act
+on upon the alerts and understand where the problem comes from.
+
+Mitigations
+* Detailed runbook for the `AlertmanagerFailedToSendAlerts` and `AlertmanagerClusterFailedToSendAlerts` alerts.
+* Ability to use a separate Alertmanager cluster to avoid messing up with the platform Alertmanager cluster.
+
+#### Non-optimal Alertmanager settings
+
+Users may use non-optimal settings for their alert notifications (such as
+reevaluation of alert groups at high frequency). This may impede the
+performances of Alertmanager globally because it would consume more CPU. It can
+also trigger notification failures if an exteral integration limits the number
+of requests a client IP address can do.
+
+Mitigation
+* Improve the `Alertmanager` CRD to expose new fields enforcing minimum interval values for all associated `AlertmanagerConfig` resources. This would be similar to what exists at the `Prometheus` CRD level for scrape targets with `enforcedSampleLimit` for instance.
+
+#### AlertmanagerConfig resources not being reconciled
+
+An `AlertmanagerConfig` resource might require credentials (such as API keys)
+which are referenced by secrets. If the Platform Prometheus operator doesn't
+have permissions to read the secret or if the reference is incorrect (wrong
+name or key), the operator doesn't reconcile the resource in the final
+Alertmanager configuration.
+
+Mitigation
+* The Prometheus operator should expose a validating admission webhook that should prevent invalid configurations.
+* We can implement the `Status` subresource of the `AlertmanagerConfig` CRD to report whether or not the resource is reconciled or not (with a message).
+* Users can validate that alerting routing works as expected by generating "fake" alerts triggering the notification system. _Users don't have permissions on the Alertmanager API endpoint so they would have to generate fake alerts from alerting rules themselves. We could also support the ability to craft an alert from the OCP console_.
+
+## Design Details
+
+### Open Questions
+
+* Should CMO allow UWM admins to deploy a separate Alertmanager cluster in the `openshift-user-workload-monitoring` namespace if the cluster admins don't want to share the Platform Alertmanager?
+  * Pros
+    * More flexibility.
+  * Cons
+    * Increased complexity.
+    * Redundancy with the upcoming Monitoring Stack operator.
+
+### Test Plan
+
+New tests are added to the cluster monitoring operator end-to-end test suites
+to validate the different user stories.
+
+### Graduation Criteria
+
+We plan to have the feature released Tech Preview first. We assume that the
+`AlertmanagerConfig` CRD graduates to `v1beta1` at least before we consider
+exposing the feature.
+
+#### Dev Preview -> Tech Preview
+
+N/A
+
+#### Tech Preview -> GA
+
+- The `AlertmanagerConfig` CRD is exposed as `v1` API.
+- More testing (upgrade, downgrade, scale)
+- Sufficient time for feedback including signals from telemetry about the customer adoption (e.g. number of `AlertmanagerConfig` resources across the fleet).
+- Counter-measures to avoid service degradation of the Platform Alertmanager.
+- Conduct load testing
+- Console integration?
+
+#### Removing a deprecated feature
+
+- Announce deprecation and support policy of the existing feature
+- Deprecate the feature
+
+### Upgrade / Downgrade Strategy
+
+CMO continues to orchestrate and automate the deployment of all monitoring
+components with the help of the Prometheus operator in this case.
+
+By default, CMO doesn't enable for user alert routing, hence upgrading to a
+OpenShift release supporting `AlertmanagerConfig` doesn't change the behavior
+of the monitoring components.
+
+### Version Skew Strategy
+
+N/A
+
+## Implementation History
+
+Major milestones in the life cycle of a proposal should be tracked in `Implementation
+History`.
+
+## Drawbacks
+
+N/A
+
+## Alternatives
+
+An alternative is to keep the current status-quo and rely on cluster admins to
+configure alert routing for their users. This proposal doesn't forbid this
+model since cluster admins can decide to not reconcile user-defined
+`AlertmanagerConfig` resources within the Platform Alertmanager.
+
+[user-workload-monitoring-enhancement]: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/user-workload-monitoring.md
+[uwm-docs]: https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html
+[prometheus-operator]: https://prometheus-operator.dev/
+[alertmanagerconfig-crd]: https://prometheus-operator.dev/docs/operator/api/#alertmanagerconfig
+[unsupported-resources]: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html#support-considerations_configuring-the-monitoring-stack
+[monitoring-stack-operator]: https://github.com/openshift/enhancements/pull/866

From 6c592cda627b8108e44f226856c32765969daf92 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Thu, 21 Oct 2021 17:49:57 +0200
Subject: [PATCH 02/17] Address Bartek and Philipp comments

---
 ...rt-routing-for-user-workload-monitoring.md | 77 +++++++++++++++----
 1 file changed, 61 insertions(+), 16 deletions(-)

diff --git a/enhancements/monitoring/alert-routing-for-user-workload-monitoring.md b/enhancements/monitoring/alert-routing-for-user-workload-monitoring.md
index c68e81630b..8976ba4933 100644
--- a/enhancements/monitoring/alert-routing-for-user-workload-monitoring.md
+++ b/enhancements/monitoring/alert-routing-for-user-workload-monitoring.md
@@ -1,5 +1,5 @@
 ---
-title: alert-routing
+title: multi-tenant-alerting
 authors:
   - "@simonpasquier"
 reviewers:
@@ -14,7 +14,7 @@ see-also:
   - "/enhancements/monitoring/user-workload-monitoring.md"
 ---
 
-# alert-routing-for-user-workload monitoring
+# multi-tenant-alerting
 
 ## Release Signoff Checklist
 
@@ -32,6 +32,7 @@ notifications without cluster admin intervention. It complements the existing
 [user-workload monitoring stack][uwm-docs], enabling a full self-service experience for
 workload monitoring.
 
+
 ## Motivation
 
 Since OpenShift 4.6, application owners can collect metrics from their
@@ -82,6 +83,8 @@ resources from user namespaces.
 * Support the configuration of alert notifications for platform alerts (e.g.
 alerts originating from namespaces with the `openshift.io/cluster-monitoring: "true"`
 label).
+* Share alert receivers and routes across tenants.
+* Deploy an additional UWM Alertmanager.
 
 ## Proposal
 
@@ -111,8 +114,8 @@ at the same time.
 #### Story 3
 
 As an application owner, I want to know if my AlertmanagerConfig custom
-resource is taken into account so that I am confident that I will receive alert
-notifications.
+resource has been reconciled on the target Alertmanager so that I am confident
+that I will receive alert notifications.
 
 #### Story 4
 
@@ -240,6 +243,20 @@ data:
 
 When this option is chosen, the OCP console can't be used to manage silences for user alerts.
 
+Summary of the different combinations:
+
+
+| `enableUserWorkload` | `enableUserAlertmanagerConfig` | `usePlatformAlertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) | Outcome |
+|:--------------------:|:------------------------------:|:------------------------------------:|:-------------------------------------:|---------|
+| false | N/A | N/A | N/A | User worload monitoring not available. |
+| true | false | false | empty | User alerts evaluated but sent nowhere. |
+| true | false | true | empty | User alerts sent to Plaform Alertmanager. Configuration managed by cluster admins. |
+| true | false | true | not empty | User alerts sent to Plaform Alertmanager and external Alertmanager(s). Configuration of Platform Alertmanager managed by cluster admins. |
+| true | true | false | empty | User alerts evaluated but sent nowhere. |
+| true | true | true | empty | User alerts sent to Plaform Alertmanager. Configuration managed by application owners. |
+| true | true | true | not empty | User alerts sent to Plaform Alertmanager and external Alertmanager(s). Configuration of Plaform Alertmanager managed by application owners. |
+
+
 ### Tenancy
 
 By design, all alerts coming from UWM have a `namespace` label equal to the
@@ -250,6 +267,28 @@ invariant to generate an Alertmanager configuration that ensures that a given
 namespace `foo` only processes alerts with the `namespace="foo"` label (be it
 for routing or inhibiting purposes).
 
+Below is how the operator renders an AlertmanagerConfig resource in the final Alertmanager configuration.
+
+```yaml
+...
+route:
+# The next item is generated from an AlertmanagerConfig resource named alertmanagerconfig1 in namespace foo.
+- matchers: ['namespace="foo"']
+  receiver: foo-alertmanagerconfig1-my-receiver
+  continue: true
+  routes:
+  - ...
+inhibit_rules:
+# The next item is generated from an AlertmanagerConfig resource named alertmanagerconfig1 in namespace foo.
+- source_matchers: ['namespace="foo"', ...]
+  target_maTCHERS: ['namespace="foo"', ...]
+  equal: ['namespace', ...]
+receivers:
+# The next item is generated from an AlertmanagerConfig resource named alertmanagerconfig1 in namespace foo.
+- name: foo-alertmanagerconfig1-my-receiver
+  ...
+```
+
 ### RBAC
 
 The cluster monitoring operator ships a new cluster role
@@ -323,27 +362,33 @@ Mitigation
 
 #### AlertmanagerConfig resources not being reconciled
 
-An `AlertmanagerConfig` resource might require credentials (such as API keys)
-which are referenced by secrets. If the Platform Prometheus operator doesn't
-have permissions to read the secret or if the reference is incorrect (wrong
-name or key), the operator doesn't reconcile the resource in the final
-Alertmanager configuration.
+An `AlertmanagerConfig` resource might not be valid for various reasons:
+* An alerting route references a receiver which doesn't exist.
+* Credentials (such as API keys) are referenced by secrets, the
+  operator doesn't have permissions to read the secret or the reference
+  is incorrect (wrong name or key).
+
+In such cases, the Prometheus operator discards the resource which isn't
+reconciled in the final Alertmanager configuration.
+
+The operator might also be unable to reconcile the AlertmanagerConfig resources temporiraly.
 
 Mitigation
 * The Prometheus operator should expose a validating admission webhook that should prevent invalid configurations.
-* We can implement the `Status` subresource of the `AlertmanagerConfig` CRD to report whether or not the resource is reconciled or not (with a message).
+* The Prometheus operator implements the `Status` subresource of the `AlertmanagerConfig` CRD to report whether or not the resource is reconciled or not (with a message).
 * Users can validate that alerting routing works as expected by generating "fake" alerts triggering the notification system. _Users don't have permissions on the Alertmanager API endpoint so they would have to generate fake alerts from alerting rules themselves. We could also support the ability to craft an alert from the OCP console_.
 
 ## Design Details
 
 ### Open Questions
 
-* Should CMO allow UWM admins to deploy a separate Alertmanager cluster in the `openshift-user-workload-monitoring` namespace if the cluster admins don't want to share the Platform Alertmanager?
-  * Pros
-    * More flexibility.
-  * Cons
-    * Increased complexity.
-    * Redundancy with the upcoming Monitoring Stack operator.
+1. Should CMO allow UWM admins to deploy a separate UWM Alertmanager cluster in the `openshift-user-workload-monitoring` namespace if the cluster admins don't want to share the Platform Alertmanager?
+
+Pros
+* More flexibility.
+Cons
+* Increased complexity in the CMO codebase.
+* Redundancy with the upcoming [Monitoring Stack operator][monitoring-stack-operator].
 
 ### Test Plan
 

From b1b7a1a3fa40564eda785b117d4d9ca8a27cd19f Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Thu, 21 Oct 2021 17:50:28 +0200
Subject: [PATCH 03/17] Rename proposal

---
 ...-user-workload-monitoring.md => multi-tenant-alert-routing.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename enhancements/monitoring/{alert-routing-for-user-workload-monitoring.md => multi-tenant-alert-routing.md} (100%)

diff --git a/enhancements/monitoring/alert-routing-for-user-workload-monitoring.md b/enhancements/monitoring/multi-tenant-alert-routing.md
similarity index 100%
rename from enhancements/monitoring/alert-routing-for-user-workload-monitoring.md
rename to enhancements/monitoring/multi-tenant-alert-routing.md

From f774cb6bad138813259e7842cd4c515f348c7629 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Mon, 25 Oct 2021 11:44:22 +0200
Subject: [PATCH 04/17] Address Jan's comments

---
 .../monitoring/multi-tenant-alert-routing.md  | 35 +++++++++++--------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alert-routing.md b/enhancements/monitoring/multi-tenant-alert-routing.md
index 8976ba4933..cd6a276081 100644
--- a/enhancements/monitoring/multi-tenant-alert-routing.md
+++ b/enhancements/monitoring/multi-tenant-alert-routing.md
@@ -72,7 +72,7 @@ cluster admins.
 routing scoped to individual namespaces.
 * Namespace owners should be able to opt-out from Alertmanager
 configuration (similar to what exist for service/pod monitors and rules using the
-`"openshisft.io/user-monitoring: false"` label on the namespace).
+`"openshift.io/user-monitoring: false"` label on the namespace).
 * Cluster admins should be able to opt-out from supporting `AlertmanagerConfig`
 resources from user namespaces.
 
@@ -245,16 +245,13 @@ When this option is chosen, the OCP console can't be used to manage silences for
 
 Summary of the different combinations:
 
-
-| `enableUserWorkload` | `enableUserAlertmanagerConfig` | `usePlatformAlertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) | Outcome |
-|:--------------------:|:------------------------------:|:------------------------------------:|:-------------------------------------:|---------|
-| false | N/A | N/A | N/A | User worload monitoring not available. |
-| true | false | false | empty | User alerts evaluated but sent nowhere. |
-| true | false | true | empty | User alerts sent to Plaform Alertmanager. Configuration managed by cluster admins. |
-| true | false | true | not empty | User alerts sent to Plaform Alertmanager and external Alertmanager(s). Configuration of Platform Alertmanager managed by cluster admins. |
-| true | true | false | empty | User alerts evaluated but sent nowhere. |
-| true | true | true | empty | User alerts sent to Plaform Alertmanager. Configuration managed by application owners. |
-| true | true | true | not empty | User alerts sent to Plaform Alertmanager and external Alertmanager(s). Configuration of Plaform Alertmanager managed by application owners. |
+| User alert destination | User notifications managed by | `enableUserWorkload` | `enableUserAlertmanagerConfig` | `usePlatformAlertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) |
+|----|----|:--------------------:|:------------------------------:|:------------------------------------:|:-------------------------------------:|
+| Nowhere | No-one | true | &lt;any&gt; | false | empty |
+| Platform Alertmanager | Cluster admins | true | false | true | empty |
+| Platform Alertmanager<br/>External Alertmanager(s) | Cluster admins for the Platform Alertmanager | true | false | true | not empty |
+| Platform Alertmanager | Application owners | true | true | true | empty |
+| Platform Alertmanager<br/>External Alertmanager(s) | Application owners | true | true | true | not empty |
 
 
 ### Tenancy
@@ -382,13 +379,21 @@ Mitigation
 
 ### Open Questions
 
-1. Should CMO allow UWM admins to deploy a separate UWM Alertmanager cluster in the `openshift-user-workload-monitoring` namespace if the cluster admins don't want to share the Platform Alertmanager?
+1. Should CMO allow UWM admins to deploy a separate UWM Alertmanager cluster if the cluster admins don't want to share the Platform Alertmanager?
+
+While the UWM admins have the ability to configure external Alertmanager
+endpoints where user alerts should be sent, it requires someone to manage the
+deployment of this additional Alertmanager. We could add an option in the UWM
+config map to enable an Alertmanager instance running in the
+`openshift-user-workload-monitoring` namespace.
 
 Pros
-* More flexibility.
+* It provides a better experience for UWM admins: no need to maintain a standalone Alertmanager cluster, less likely to mess up the configuration of `additionalAlertmanagerConfigs`.
 Cons
-* Increased complexity in the CMO codebase.
-* Redundancy with the upcoming [Monitoring Stack operator][monitoring-stack-operator].
+* Increased complexity in the CMO codebase and in the UWM configuration options.
+* Additional resource overhead (though Alertmanager is usually light on resources).
+* Redundancy with the [Monitoring Stack operator][monitoring-stack-operator].
+* More work required for a proper integration in the OCP console.
 
 ### Test Plan
 

From 65adefc6c38a09c0fbb7d82cc11452754ad4b826 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Mon, 25 Oct 2021 11:45:15 +0200
Subject: [PATCH 05/17] Rename file

---
 .../{multi-tenant-alert-routing.md => multi-tenant-alerting.md}   | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename enhancements/monitoring/{multi-tenant-alert-routing.md => multi-tenant-alerting.md} (100%)

diff --git a/enhancements/monitoring/multi-tenant-alert-routing.md b/enhancements/monitoring/multi-tenant-alerting.md
similarity index 100%
rename from enhancements/monitoring/multi-tenant-alert-routing.md
rename to enhancements/monitoring/multi-tenant-alerting.md

From d5ade6e7ef918453d68eaca31285a9b1a46a6ed5 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Mon, 25 Oct 2021 15:43:14 +0200
Subject: [PATCH 06/17] Add section about external labels for platform alerts

---
 .../monitoring/multi-tenant-alerting.md       | 53 +++++++++++++++----
 1 file changed, 43 insertions(+), 10 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index cd6a276081..a88ab0b206 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -126,8 +126,8 @@ on their behalf.
 
 #### Story 5
 
-As an OpenShift cluster admin, I don't want AlertmanagerConfig resources
-defined by application owners to interfere with the routing of platform alerts.
+As an OpenShift cluster admin, I want to distinguish between platform and user
+alerts so that my Alertmanager configuration can reliably handle all platform alerts.
 
 #### Story 6
 
@@ -141,7 +141,7 @@ As an OpenShift cluster admin, I don't want to support AlertmanagerConfig
 resources for application owners so that the configuration of the Platform
 Alertmanager cluster is completely under my control.
 
-### Story 8
+#### Story 8
 
 As a UWM admin, I don't want to send user alerts to the Platform Alertmanager
 cluster because these alerts are managed by an external system (off-cluster Alertmanager for
@@ -254,7 +254,38 @@ Summary of the different combinations:
 | Platform Alertmanager<br/>External Alertmanager(s) | Application owners | true | true | true | not empty |
 
 
-### Tenancy
+#### Distinction between platform and user alerts
+
+It is important that platform alerts can be clearly distinguished from user
+alerts because cluster admins want to ensure that:
+1. all alerts originating from platform components are dispatched to at least one default receiver which is owned by the admin team.
+2. they aren't notified about any user alert.
+
+To this effect, CMO configures the Platform Prometheus instances with a new
+external label: `openshift_io_alert_source="platform"`.
+
+The Alertmanager configuration can leverage the label's value to make the
+correct decision in the alert routing tree. For instance, the following
+configuration sends all user alerts which haven't been processed by a previous
+entry to an empty receiver.
+
+```yaml
+route:
+  receiver: default-platform-receiver
+  routes:
+  - ...
+  - matchers: ['openshift_io_alert_source!="platform"']
+    receiver: default-user-receiver
+receivers:
+- name: default-platform-receiver
+  ...
+- name: default-user-receiver
+```
+
+Note that a similar use case was already reported in [BZ 1933239][bz-1933239].
+
+
+#### Tenancy
 
 By design, all alerts coming from UWM have a `namespace` label equal to the
 `PrometheusRule` resource's namespace. The Prometheus operator relies on this
@@ -269,12 +300,13 @@ Below is how the operator renders an AlertmanagerConfig resource in the final Al
 ```yaml
 ...
 route:
-# The next item is generated from an AlertmanagerConfig resource named alertmanagerconfig1 in namespace foo.
-- matchers: ['namespace="foo"']
-  receiver: foo-alertmanagerconfig1-my-receiver
-  continue: true
   routes:
-  - ...
+  # The next item is generated from an AlertmanagerConfig resource named alertmanagerconfig1 in namespace foo.
+  - matchers: ['namespace="foo"']
+    receiver: foo-alertmanagerconfig1-my-receiver
+    continue: true
+    routes:
+    - ...
 inhibit_rules:
 # The next item is generated from an AlertmanagerConfig resource named alertmanagerconfig1 in namespace foo.
 - source_matchers: ['namespace="foo"', ...]
@@ -286,7 +318,7 @@ receivers:
   ...
 ```
 
-### RBAC
+#### RBAC
 
 The cluster monitoring operator ships a new cluster role
 `alertmanager-config-edit` so that cluster admins can bind it with a
@@ -459,3 +491,4 @@ model since cluster admins can decide to not reconcile user-defined
 [alertmanagerconfig-crd]: https://prometheus-operator.dev/docs/operator/api/#alertmanagerconfig
 [unsupported-resources]: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html#support-considerations_configuring-the-monitoring-stack
 [monitoring-stack-operator]: https://github.com/openshift/enhancements/pull/866
+[bz-1933239]: https://bugzilla.redhat.com/show_bug.cgi?id=1933239

From 0fb5b735bbed0eff94b89848e287849d5d6e169b Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Fri, 26 Nov 2021 11:39:34 +0100
Subject: [PATCH 07/17] Draft a first version of UWM Alertmanager deployment

---
 .../monitoring/multi-tenant-alerting.md       | 125 ++++++++++++------
 1 file changed, 86 insertions(+), 39 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index a88ab0b206..fe5627b224 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -151,6 +151,14 @@ instance).
 
 The `AlertmanagerConfig` CRD is exposed by the `coreos.monitoring.com/v1alpha1` API group.
 
+#### Deployment models
+
+##### Leveraging the Platform Alertmanager
+
+In this model, no additional Alertmanager is deployed and the user alerts are
+forwarded to the existing Platform Alertmanager. This is matching the current
+model.
+
 The `Alertmanager` custom resource defines 2 LabelSelector fields
 (`alertmanagerConfigSelector` and `alertmanagerConfigNamespaceSelector`) to
 select which `AlertmanagerConfig` resources should be reconciled by the
@@ -158,7 +166,7 @@ Prometheus operator and from which namespace(s). Before this proposal, the
 Plaform Alertmanager resource defines the 2 selectors as null, meaning that it
 doesn't select any `AlertmanagerConfig` resources.
 
-We propose to add a new field `enableUserAlertmanagerConfig` to the
+We propose to add a new boolean field `enableUserAlertmanagerConfig` to the
 `openshift-montoring/cluster-monitoring-config` configmap. If
 `enableUserAlertmanagerConfig` is missing, the default value is false.
 
@@ -175,7 +183,8 @@ data:
 ```
 
 When `enableUserAlertmanagerConfig` is true, the cluster monitoring operator
-configures the Platform Alertmanager as follows.
+configures the Platform Alertmanager to reconcile `AlertmanagerConfig`
+resources from user namespaces as follows.
 
 ```yaml
 apiVersion: monitoring.coreos.com/v1
@@ -200,12 +209,13 @@ spec:
 
 To be consistent with what exists already for service/pod monitors and rules,
 the Prometheus operator doesn't reconcile `AlertmanagerConfig` resources from
-namespaces with the `openshift.io/user-monitoring: "false"` label.  It allows
+namespaces with the `openshift.io/user-monitoring: "false"` label. It allows
 application owners to opt out completely from UWM in case they deploy and run
 their own monitoring infrastructure (for instance with the [Monitoring Stack
 operator][monitoring-stack-operator]).
 
-In addition, the cluster admins can exclude specific user namespace(s) from UWM with the new `excludeUserNamespaces` field.
+In addition, the cluster admins can exclude specific user namespace(s) from UWM
+with the new `excludeUserNamespaces` field.
 
 ```yaml
 apiVersion: v1
@@ -220,11 +230,19 @@ data:
     excludeUserNamespaces: [foo,bar]
 ```
 
-The UWM admins can also define that UWM alerts shouldn't be forwarded to the
-Platform Alertmanager. With this capability and the existing
-`additionalAlertmanagerConfigs`, it is possible to externalize the alert
-routing and notifications to an external Alertmanager instance when the cluster
-admins don't want to share the Plaform Alertmanager for instance.
+##### Dedicated UWM Alertmanager
+
+In some environments where cluster admins and UWM admins are different personas
+(e.g. OSD), it might not be acceptable for cluster admins to let users
+configure the Platform Alertmanager because:
+* User configurations may break the Alertmanager configuration.
+* Processing of user alerts may slow down the alert notification pipeline.
+* Cluster admins don't want to deal with delivery errors for user notifications.
+
+In this case, UWM admins have the possibility to deploy a dedicated
+Alertmanager. The configuration options will be equivalent to the options
+exposed for the Platform Alertmanager and exposed under the `alertmanager` key
+in the UWM configmap.
 
 ```yaml
 apiVersion: v1
@@ -234,24 +252,43 @@ metadata:
   namespace: openshift-user-workload-monitoring
 data:
   config.yaml: |-
-    thanosRuler:
-      usePlatformAlertmanager: false
-    prometheus:
-      usePlatformAlertmanager: false
-      additionalAlertmanagerConfigs: [...]
+    alertmanager:
+      enabled: true
+      logLevel: info
+      nodeSelector: {...}
+      tolerations: [...]
+      resources: {...}
+      volumeClaimTemplate: {...}
+    prometheus: {}
+    thanosRuler: {}
 ```
 
-When this option is chosen, the OCP console can't be used to manage silences for user alerts.
+The UWM Alertmanager will be automatically configured to reconcile
+`AlertmanagerConfig` resources from all user namespaces (just like for UWM
+service/pod monitors and rules). Again namespaces with the
+`openshift.io/user-monitoring: false` label will be excluded.
 
-Summary of the different combinations:
+When the UWM Alertmanager is enabled: 
+* The Platform Alertmanager will be configured to not reconcile
+  `AlertmanagerConfig` resources from user
+  namespaces.
+* The UWM Prometheus and Thanos Ruler will send alerts to
+  the UWM Alertmanager only.
 
-| User alert destination | User notifications managed by | `enableUserWorkload` | `enableUserAlertmanagerConfig` | `usePlatformAlertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) |
+The UWM admins are responsible for provisioning the root configuration of the
+UWM Alertmanager in the
+`openshift-user-workload-monitoring/alertmanager-user-workload` secret.
+
+
+##### Summary
+
+| User alert destination | User notifications managed by | `enableUserWorkload` | `enableUserAlertmanagerConfig` | `alertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) |
 |----|----|:--------------------:|:------------------------------:|:------------------------------------:|:-------------------------------------:|
 | Nowhere | No-one | true | &lt;any&gt; | false | empty |
-| Platform Alertmanager | Cluster admins | true | false | true | empty |
-| Platform Alertmanager<br/>External Alertmanager(s) | Cluster admins for the Platform Alertmanager | true | false | true | not empty |
-| Platform Alertmanager | Application owners | true | true | true | empty |
-| Platform Alertmanager<br/>External Alertmanager(s) | Application owners | true | true | true | not empty |
+| Platform Alertmanager | Cluster admins | true | false | empty | empty |
+| Platform Alertmanager<br/>External Alertmanager(s) | Cluster admins for the Platform Alertmanager | true | false | empty | not empty |
+| UWM Alertmanager | Application owners | true | true | not empty | empty |
+| UWM Alertmanager<br/>External Alertmanager(s) | Application owners | true | true | not empty | not empty |
 
 
 #### Distinction between platform and user alerts
@@ -259,7 +296,7 @@ Summary of the different combinations:
 It is important that platform alerts can be clearly distinguished from user
 alerts because cluster admins want to ensure that:
 1. all alerts originating from platform components are dispatched to at least one default receiver which is owned by the admin team.
-2. they aren't notified about any user alert.
+2. they aren't notified about any user alert and focus on platform alerts.
 
 To this effect, CMO configures the Platform Prometheus instances with a new
 external label: `openshift_io_alert_source="platform"`.
@@ -321,9 +358,8 @@ receivers:
 #### RBAC
 
 The cluster monitoring operator ships a new cluster role
-`alertmanager-config-edit` so that cluster admins can bind it with a
-`RoleBinding` to grant permissions to users or groups on `AlertmanagerConfig`
-custom resources within a given namespace.
+`alertmanager-config-edit` that grants all actions on `AlertmanagerConfig`
+custom resources.
 
 ```yaml
 apiVersion: rbac.authorization.k8s.io/v1
@@ -339,6 +375,15 @@ rules:
   - '*'
 ```
 
+Cluster admins can bind the cluster role with a `RoleBinding` to grant
+permissions to users or groups on `AlertmanagerConfig` custom resources within
+a given namespace.
+
+```
+oc -n <namespace> adm policy add-role-to-user \
+  alertmanager-config-edit <user> --role-namespace <namespace>
+```
+
 This role complements the existing `monitoring-edit`, `monitoring-rules-edit` and `monitoring-rules-view` roles.
 
 #### Resource impacts
@@ -411,21 +456,10 @@ Mitigation
 
 ### Open Questions
 
-1. Should CMO allow UWM admins to deploy a separate UWM Alertmanager cluster if the cluster admins don't want to share the Platform Alertmanager?
-
-While the UWM admins have the ability to configure external Alertmanager
-endpoints where user alerts should be sent, it requires someone to manage the
-deployment of this additional Alertmanager. We could add an option in the UWM
-config map to enable an Alertmanager instance running in the
-`openshift-user-workload-monitoring` namespace.
+1. How can the console support the UWM Alertmanager?
 
-Pros
-* It provides a better experience for UWM admins: no need to maintain a standalone Alertmanager cluster, less likely to mess up the configuration of `additionalAlertmanagerConfigs`.
-Cons
-* Increased complexity in the CMO codebase and in the UWM configuration options.
-* Additional resource overhead (though Alertmanager is usually light on resources).
-* Redundancy with the [Monitoring Stack operator][monitoring-stack-operator].
-* More work required for a proper integration in the OCP console.
+Right now the console backend manages the user-defined silences via the
+Platform Alertmanager API. It would need to be aware of the deployment model.
 
 ### Test Plan
 
@@ -480,11 +514,24 @@ N/A
 
 ## Alternatives
 
+### Status-quo
+
 An alternative is to keep the current status-quo and rely on cluster admins to
 configure alert routing for their users. This proposal doesn't forbid this
 model since cluster admins can decide to not reconcile user-defined
 `AlertmanagerConfig` resources within the Platform Alertmanager.
 
+### Don't support UWM Alertmanager
+
+We could decide that CMO doesn't offer the ability to deploy the UWM
+Alertmanager. In this case the responsibility of deploying an additional
+Alertmanager is delegated to the cluster admins which would leverage
+`additionalAlertmanagerConfigs` to point user alerts to this instance.
+
+The downsides are
+* Degraded user experience and overhead on the users.
+* The additional setup wouldn't be supported by Red Hat.
+
 [user-workload-monitoring-enhancement]: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/user-workload-monitoring.md
 [uwm-docs]: https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html
 [prometheus-operator]: https://prometheus-operator.dev/

From 74fcfd3de1e49e926a74475ea3aa5c9e99a3751e Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Fri, 26 Nov 2021 14:30:05 +0100
Subject: [PATCH 08/17] Refine UWM Alertmanager design

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
---
 .../monitoring/multi-tenant-alerting.md       | 74 +++++++++++--------
 1 file changed, 44 insertions(+), 30 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index fe5627b224..a223261f9f 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -115,7 +115,7 @@ at the same time.
 
 As an application owner, I want to know if my AlertmanagerConfig custom
 resource has been reconciled on the target Alertmanager so that I am confident
-that I will receive alert notifications.
+that I receive alert notifications.
 
 #### Story 4
 
@@ -179,7 +179,8 @@ metadata:
 data:
   config.yaml: |-
     enableUserWorkload: true
-    enableUserAlertmanagerConfig: true
+    alertmanager:
+      enableUserAlertmanagerConfig: true
 ```
 
 When `enableUserAlertmanagerConfig` is true, the cluster monitoring operator
@@ -226,7 +227,8 @@ metadata:
 data:
   config.yaml: |-
     enableUserWorkload: true
-    enableUserAlertmanagerConfig: true
+    alertmanager:
+      enableUserAlertmanagerConfig: true
     excludeUserNamespaces: [foo,bar]
 ```
 
@@ -234,14 +236,17 @@ data:
 
 In some environments where cluster admins and UWM admins are different personas
 (e.g. OSD), it might not be acceptable for cluster admins to let users
-configure the Platform Alertmanager because:
+configure the Platform Alertmanager with `AlertmanagerConfig` resources because:
 * User configurations may break the Alertmanager configuration.
 * Processing of user alerts may slow down the alert notification pipeline.
 * Cluster admins don't want to deal with delivery errors for user notifications.
 
+At the same time, application owners want to configure their alert
+notifications without requesting external intervention.
+
 In this case, UWM admins have the possibility to deploy a dedicated
-Alertmanager. The configuration options will be equivalent to the options
-exposed for the Platform Alertmanager and exposed under the `alertmanager` key
+Alertmanager. The configuration options are to the options
+exposed for the Platform Alertmanager and live under the `alertmanager` key
 in the UWM configmap.
 
 ```yaml
@@ -254,6 +259,7 @@ data:
   config.yaml: |-
     alertmanager:
       enabled: true
+      enableUserAlertmanagerConfig: true
       logLevel: info
       nodeSelector: {...}
       tolerations: [...]
@@ -263,16 +269,16 @@ data:
     thanosRuler: {}
 ```
 
-The UWM Alertmanager will be automatically configured to reconcile
-`AlertmanagerConfig` resources from all user namespaces (just like for UWM
-service/pod monitors and rules). Again namespaces with the
-`openshift.io/user-monitoring: false` label will be excluded.
+When `enableUserAlertmanagerConfig` is true, the UWM Alertmanager is
+automatically configured to reconcile `AlertmanagerConfig` resources from all
+user namespaces (just like for UWM service/pod monitors and rules). Again
+namespaces with the `openshift.io/user-monitoring: false` label are
+excluded.
 
-When the UWM Alertmanager is enabled: 
-* The Platform Alertmanager will be configured to not reconcile
-  `AlertmanagerConfig` resources from user
-  namespaces.
-* The UWM Prometheus and Thanos Ruler will send alerts to
+When the UWM Alertmanager is enabled:
+* The Platform Alertmanager is configured to not reconcile
+  `AlertmanagerConfig` resources from user namespaces.
+* The UWM Prometheus and Thanos Ruler send alerts to
   the UWM Alertmanager only.
 
 The UWM admins are responsible for provisioning the root configuration of the
@@ -280,15 +286,15 @@ UWM Alertmanager in the
 `openshift-user-workload-monitoring/alertmanager-user-workload` secret.
 
 
-##### Summary
-
-| User alert destination | User notifications managed by | `enableUserWorkload` | `enableUserAlertmanagerConfig` | `alertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) |
-|----|----|:--------------------:|:------------------------------:|:------------------------------------:|:-------------------------------------:|
-| Nowhere | No-one | true | &lt;any&gt; | false | empty |
-| Platform Alertmanager | Cluster admins | true | false | empty | empty |
-| Platform Alertmanager<br/>External Alertmanager(s) | Cluster admins for the Platform Alertmanager | true | false | empty | not empty |
-| UWM Alertmanager | Application owners | true | true | not empty | empty |
-| UWM Alertmanager<br/>External Alertmanager(s) | Application owners | true | true | not empty | not empty |
+| User alert destination | User notifications managed by | `enableUserAlertmanagerConfig` | `alertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) |
+|------------------------|-------------------------------|:------------------------------:|:--------------------:|:-------------------------------------:|
+| Platform Alertmanager | Cluster admins | false | empty | empty |
+| Platform Alertmanager<br/>External Alertmanager(s) | Cluster admins | false | empty | not empty |
+| Platform Alertmanager | Application owners | true | empty | empty |
+| UWM Alertmanager | UWM admins | &lt;any&gt; | {enabled: true, enableUserAlertmanagerConfig: false} | empty |
+| UWM Alertmanager | Application owners | &lt;any&gt; | {enabled: true, enableUserAlertmanagerConfig: true} | empty |
+| UWM Alertmanager<br/>External Alertmanager(s) | UWM admins | &lt;any&gt; | {enabled: true, enableUserAlertmanagerConfig: false} | not empty |
+| UWM Alertmanager<br/>External Alertmanager(s) | Application owners | &lt;any&gt; | {enabled: true, enableUserAlertmanagerConfig: true} | not empty |
 
 
 #### Distinction between platform and user alerts
@@ -379,7 +385,7 @@ Cluster admins can bind the cluster role with a `RoleBinding` to grant
 permissions to users or groups on `AlertmanagerConfig` custom resources within
 a given namespace.
 
-```
+```bash
 oc -n <namespace> adm policy add-role-to-user \
   alertmanager-config-edit <user> --role-namespace <namespace>
 ```
@@ -456,10 +462,17 @@ Mitigation
 
 ### Open Questions
 
-1. How can the console support the UWM Alertmanager?
+1. How can the Dev Console support the UWM Alertmanager?
 
-Right now the console backend manages the user-defined silences via the
-Platform Alertmanager API. It would need to be aware of the deployment model.
+Users are able to silence alerts from the Dev Console and the console backend
+assumes that the API is served by the
+`alertmanager-main.openshift-monitoring.svc` service. To support the UWM
+Alertmanager configuration, CMO should provide to the console operator the name
+of the Alertmanager service managing user alerts (either
+`alertmanager-main.openshift-monitoring.svc` or
+`alertmanager.openshift-user-workload-monitoring.svc`). Based on the presence
+of the `openshift_io_alert_source` label, the console backend can decide which
+Alertmanager service should be queried.
 
 ### Test Plan
 
@@ -478,12 +491,13 @@ N/A
 
 #### Tech Preview -> GA
 
-- The `AlertmanagerConfig` CRD is exposed as `v1` API.
+- The `AlertmanagerConfig` CRD is exposed as `v1beta1` API.
 - More testing (upgrade, downgrade, scale)
 - Sufficient time for feedback including signals from telemetry about the customer adoption (e.g. number of `AlertmanagerConfig` resources across the fleet).
 - Counter-measures to avoid service degradation of the Platform Alertmanager.
+- Option to deploy UWM Alertmanager
 - Conduct load testing
-- Console integration?
+- Console integration
 
 #### Removing a deprecated feature
 

From abeadb93861a1fec131d476fbcf89779e7682065 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Fri, 26 Nov 2021 15:08:16 +0100
Subject: [PATCH 09/17] Add new required sections

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
---
 .../monitoring/multi-tenant-alerting.md       | 24 ++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index a223261f9f..eb69fc7927 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -147,9 +147,16 @@ As a UWM admin, I don't want to send user alerts to the Platform Alertmanager
 cluster because these alerts are managed by an external system (off-cluster Alertmanager for
 instance).
 
-### Implementation Details/Notes/Constraints
+### API Extensions
+
+This enhancement proposal leverages the `AlertmanagerConfig` CRD which is
+exposed by the `coreos.monitoring.com/v1alpha1` API group and defined by the
+upstream Prometheus operator.
 
-The `AlertmanagerConfig` CRD is exposed by the `coreos.monitoring.com/v1alpha1` API group.
+The cluster monitoring operator deploys a `ValidatingWebhookConfiguration`
+resource to validate the `AlertmanagerConfig` resources.
+
+### Implementation Details/Notes/Constraints
 
 #### Deployment models
 
@@ -245,7 +252,7 @@ At the same time, application owners want to configure their alert
 notifications without requesting external intervention.
 
 In this case, UWM admins have the possibility to deploy a dedicated
-Alertmanager. The configuration options are to the options
+Alertmanager. The configuration options are equivalent to the options
 exposed for the Platform Alertmanager and live under the `alertmanager` key
 in the UWM configmap.
 
@@ -517,6 +524,17 @@ of the monitoring components.
 
 N/A
 
+### Operational Aspects of API Extensions
+
+#### Failure Modes
+
+The validating webhook is configured with `failurePolicy: Ignore` to not block
+creations and updates when the operator is down.
+
+#### Support Procedures
+
+N/A
+
 ## Implementation History
 
 Major milestones in the life cycle of a proposal should be tracked in `Implementation

From 6913f54fddaccf06ec0095e46c61f180027e1523 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Mon, 6 Dec 2021 12:20:58 +0100
Subject: [PATCH 10/17] Add details about validating webhook deployment and
 justifications

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
---
 .../monitoring/multi-tenant-alerting.md        | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index eb69fc7927..390aad7e84 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -154,7 +154,14 @@ exposed by the `coreos.monitoring.com/v1alpha1` API group and defined by the
 upstream Prometheus operator.
 
 The cluster monitoring operator deploys a `ValidatingWebhookConfiguration`
-resource to validate the `AlertmanagerConfig` resources.
+resource to check the validity of `AlertmanagerConfig` resources for things
+that can't be enforced by the OpenAPI specification. In particular, the
+AlertmanagerConfig's `Route` struct is a recursive type which isn't supported
+right now by [controller-tools][controller-tools-issue]).
+
+The validating webhook points to
+the prometheus operator's service in the `openshift-user-workload-monitoring`
+namespace (path: `/admission-alertmanagerconfigs/validate`).
 
 ### Implementation Details/Notes/Constraints
 
@@ -528,8 +535,11 @@ N/A
 
 #### Failure Modes
 
-The validating webhook is configured with `failurePolicy: Ignore` to not block
-creations and updates when the operator is down.
+The validating webhook is configured with `failurePolicy: Fail`. Currently the
+validating webhook service is backed by a single prometheus-operator pod so
+there is a risk that users can't create/update AlertmanagerConfig resources
+when the pod isn't ready. We will address this limitation upstream by allowing
+the deployment of a highly-available webhook service ([issue][ha-webhook-service-issue]).
 
 #### Support Procedures
 
@@ -571,3 +581,5 @@ The downsides are
 [unsupported-resources]: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html#support-considerations_configuring-the-monitoring-stack
 [monitoring-stack-operator]: https://github.com/openshift/enhancements/pull/866
 [bz-1933239]: https://bugzilla.redhat.com/show_bug.cgi?id=1933239
+[controller-tools-issue]: https://github.com/kubernetes-sigs/controller-tools/issues/477
+[ha-webhook-service-issue]: https://github.com/prometheus-operator/prometheus-operator/issues/4437

From f848992099e91452ef80f35d99a7b743dc9bb7c8 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Mon, 6 Dec 2021 12:31:04 +0100
Subject: [PATCH 11/17] Fix configuration for sourcing the platform alerts

---
 enhancements/monitoring/multi-tenant-alerting.md | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index 390aad7e84..85418748d9 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -318,8 +318,17 @@ alerts because cluster admins want to ensure that:
 1. all alerts originating from platform components are dispatched to at least one default receiver which is owned by the admin team.
 2. they aren't notified about any user alert and focus on platform alerts.
 
-To this effect, CMO configures the Platform Prometheus instances with a new
-external label: `openshift_io_alert_source="platform"`.
+To this effect, CMO configures the Platform Prometheus instances with an alert
+relabeling configuration adding the `openshift_io_alert_source="platform"`
+label:
+
+```yaml
+alerting:
+  relabel_configs:
+  - target_label: openshift_io_alert_source
+    action: replace
+    replacement: platform
+```
 
 The Alertmanager configuration can leverage the label's value to make the
 correct decision in the alert routing tree. For instance, the following

From 30217666a92fc94df5ab9c799ad884f025c441ae Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Wed, 22 Dec 2021 15:33:22 +0100
Subject: [PATCH 12/17] Rewrite mitigations section

Distinguish between Platform and UWM Alertmanager deployments.
---
 .../monitoring/multi-tenant-alerting.md       | 59 +++++++++++--------
 1 file changed, 33 insertions(+), 26 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index 85418748d9..9bc0eff528 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -428,35 +428,37 @@ resources and measure the impact on notification delivery.
 
 ### Risks and Mitigations
 
-#### Disruption of the platform Alertmanager
+#### Disruption of Alertmanager
 
 Even though the Prometheus operator prevents it as much as it can, it may be
 possible for users to create an `AlertmanagerConfig` resource that triggers the
 Prometheus operator to generate an invalid Alertmanager configuration, leading
-to a potential outage of the Platform Alertmanager cluster.
+to a potential outage of the Alertmanager cluster.
 
 Mitigations
+* If cluster admins have configured an external notification provider coupled with the always firing `Watchdog` alert, they should receive an out-of-band notification about the alerting pipeline being broken.
 * The `AlertmanagerBadConfig` alert fires when Alertmanager can't reload its configuration.
-* Cluster admins can turn off the support for `AlertmanagerConfig` globally so that the Platform Alertmanager cluster can process platform alerts again and the cluster admins have time to identiy the "rogue" `AlertmanagerConfig` resource(s).
 * Cluster admins can exclude specific user namespaces (once the "rogue" `AlertmanagerConfig` resource(s) have been identified) to restore UWM functionality for good citizens.
+* When alerts are sent to the Platform Alertmanager, cluster admins can turn off the support for `AlertmanagerConfig` in the CMO configmap so that the Platform Alertmanager cluster can process platform alerts again and the cluster admins have time to identiy the "rogue" `AlertmanagerConfig` resource(s).
 
 #### Misconfiguration of receivers
 
 Users may provide bad credentials for the receivers, the system receiving the
-notifications might be unreachable or the system might be unable to process the requests. These
-situations would trigger the `AlertmanagerFailedToSendAlerts` and/or
-`AlertmanagerClusterFailedToSendAlerts` alerts. The cluster admins have to act
-on upon the alerts and understand where the problem comes from.
+notifications might be unreachable, or the system might be unable to process
+the requests. These situations would trigger the
+`AlertmanagerFailedToSendAlerts` and/or `AlertmanagerClusterFailedToSendAlerts`
+alerts. The cluster admins have to act on upon the alerts and understand where
+the problem comes from.
 
 Mitigations
 * Detailed runbook for the `AlertmanagerFailedToSendAlerts` and `AlertmanagerClusterFailedToSendAlerts` alerts.
-* Ability to use a separate Alertmanager cluster to avoid messing up with the platform Alertmanager cluster.
+* Ability to use a separate Alertmanager cluster to avoid messing up with the Platform Alertmanager cluster.
 
 #### Non-optimal Alertmanager settings
 
 Users may use non-optimal settings for their alert notifications (such as
 reevaluation of alert groups at high frequency). This may impede the
-performances of Alertmanager globally because it would consume more CPU. It can
+performances of Alertmanager since it would consume more resources. It can
 also trigger notification failures if an exteral integration limits the number
 of requests a client IP address can do.
 
@@ -465,20 +467,23 @@ Mitigation
 
 #### AlertmanagerConfig resources not being reconciled
 
-An `AlertmanagerConfig` resource might not be valid for various reasons:
-* An alerting route references a receiver which doesn't exist.
+The `AlertmanagerConfig` CRD implements schema validation for things that can
+be modeled with the OpenAPI specification. However a
+resource might still not be valid for various reasons:
+* An alerting route contains a sub-route that is invalid (the `route` field has a self-reference to itself which means that it can't be validated at the API level).
 * Credentials (such as API keys) are referenced by secrets, the
   operator doesn't have permissions to read the secret or the reference
   is incorrect (wrong name or key).
 
-In such cases, the Prometheus operator discards the resource which isn't
-reconciled in the final Alertmanager configuration.
+In such cases, the Prometheus operator discards the invalid
+`AlertmanagerConfig` resource which isn't reconciled in the final Alertmanager
+configuration.
 
 The operator might also be unable to reconcile the AlertmanagerConfig resources temporiraly.
 
 Mitigation
-* The Prometheus operator should expose a validating admission webhook that should prevent invalid configurations.
-* The Prometheus operator implements the `Status` subresource of the `AlertmanagerConfig` CRD to report whether or not the resource is reconciled or not (with a message).
+* The Prometheus operator exposes a validating admission webhook that prevents invalid resources.
+* The Prometheus operator implements the `Status` subresource of the `AlertmanagerConfig` CRD to report whether or not the resource is reconciled or not (see [upstream issue][status-subresource-issue])
 * Users can validate that alerting routing works as expected by generating "fake" alerts triggering the notification system. _Users don't have permissions on the Alertmanager API endpoint so they would have to generate fake alerts from alerting rules themselves. We could also support the ability to craft an alert from the OCP console_.
 
 ## Design Details
@@ -491,7 +496,7 @@ Users are able to silence alerts from the Dev Console and the console backend
 assumes that the API is served by the
 `alertmanager-main.openshift-monitoring.svc` service. To support the UWM
 Alertmanager configuration, CMO should provide to the console operator the name
-of the Alertmanager service managing user alerts (either
+of the Alertmanager service managing the user alerts (either
 `alertmanager-main.openshift-monitoring.svc` or
 `alertmanager.openshift-user-workload-monitoring.svc`). Based on the presence
 of the `openshift_io_alert_source` label, the console backend can decide which
@@ -515,12 +520,11 @@ N/A
 #### Tech Preview -> GA
 
 - The `AlertmanagerConfig` CRD is exposed as `v1beta1` API.
-- More testing (upgrade, downgrade, scale)
+- More testing (upgrade, downgrade, scale).
 - Sufficient time for feedback including signals from telemetry about the customer adoption (e.g. number of `AlertmanagerConfig` resources across the fleet).
 - Counter-measures to avoid service degradation of the Platform Alertmanager.
-- Option to deploy UWM Alertmanager
-- Conduct load testing
-- Console integration
+- Option to deploy UWM Alertmanager with Console integration.
+- Conduct load testing.
 
 #### Removing a deprecated feature
 
@@ -544,11 +548,12 @@ N/A
 
 #### Failure Modes
 
-The validating webhook is configured with `failurePolicy: Fail`. Currently the
-validating webhook service is backed by a single prometheus-operator pod so
-there is a risk that users can't create/update AlertmanagerConfig resources
-when the pod isn't ready. We will address this limitation upstream by allowing
-the deployment of a highly-available webhook service ([issue][ha-webhook-service-issue]).
+The validating webhook for `AlertmanagerConfig` resources is configured with
+`failurePolicy: Fail`. Currently the validating webhook service is backed by a
+single prometheus-operator pod so there is a risk that users can't
+create/update AlertmanagerConfig resources when the pod isn't ready. We will
+address this limitation upstream by allowing the deployment of a
+highly-available webhook service ([issue][ha-webhook-service-issue]).
 
 #### Support Procedures
 
@@ -580,7 +585,8 @@ Alertmanager is delegated to the cluster admins which would leverage
 `additionalAlertmanagerConfigs` to point user alerts to this instance.
 
 The downsides are
-* Degraded user experience and overhead on the users.
+* Degraded user experience and overhead on the cluster admins.
+* No console integration.
 * The additional setup wouldn't be supported by Red Hat.
 
 [user-workload-monitoring-enhancement]: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/user-workload-monitoring.md
@@ -592,3 +598,4 @@ The downsides are
 [bz-1933239]: https://bugzilla.redhat.com/show_bug.cgi?id=1933239
 [controller-tools-issue]: https://github.com/kubernetes-sigs/controller-tools/issues/477
 [ha-webhook-service-issue]: https://github.com/prometheus-operator/prometheus-operator/issues/4437
+[status-subresource-issue]: https://github.com/prometheus-operator/prometheus-operator/issues/3335

From b0466099a61af4e8a4d17b986954bf535ea57b86 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Wed, 5 Jan 2022 10:49:25 +0100
Subject: [PATCH 13/17] Update reviewers and approvers

---
 enhancements/monitoring/multi-tenant-alerting.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index 9bc0eff528..d6c63f0f5d 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -3,9 +3,11 @@ title: multi-tenant-alerting
 authors:
   - "@simonpasquier"
 reviewers:
+  - "@dofinn"
+  - "@jeremyeder"
   - "@openshift/openshift-team-monitoring"
 approvers:
-  - TBD
+  - "@jeremyeder"
   - "@openshift/openshift-team-monitoring"
 creation-date: 2021-10-11
 last-updated: 2021-10-11

From f370ceb24c3d97cbf6a1823621b5e1dac726eb27 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Wed, 5 Jan 2022 14:08:37 +0100
Subject: [PATCH 14/17] Address @dofinn remarks

---
 .../monitoring/multi-tenant-alerting.md       | 39 ++++++++++---------
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index d6c63f0f5d..a92ac4bf34 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -21,10 +21,10 @@ see-also:
 ## Release Signoff Checklist
 
 - [X] Enhancement is `implementable`
-- [ ] Design details are appropriately documented from clear requirements
-- [ ] Test plan is defined
-- [ ] Operational readiness criteria is defined
-- [ ] Graduation criteria for dev preview, tech preview, GA
+- [X] Design details are appropriately documented from clear requirements
+- [X] Test plan is defined
+- [X] Operational readiness criteria is defined
+- [X] Graduation criteria for dev preview, tech preview, GA
 - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
 
 ## Summary
@@ -58,7 +58,7 @@ adjust the Alertmanager configuration.
 
 To streamline the process and avoid cluster admins being the bottleneck,
 application owners should be able to configure alert routing and notification
-receivers in the Plaform Alertmanager without cluster admin intervention.
+receivers by themselves.
 
 [AlertmanagerConfig][alertmanagerconfig-crd] CRD fullfills this requirement and
 is supported by the [Prometheus operator][prometheus-operator] since v0.43.0
@@ -76,7 +76,8 @@ routing scoped to individual namespaces.
 configuration (similar to what exist for service/pod monitors and rules using the
 `"openshift.io/user-monitoring: false"` label on the namespace).
 * Cluster admins should be able to opt-out from supporting `AlertmanagerConfig`
-resources from user namespaces.
+resources for user namespaces.
+* Cluster admins should be able to run a separated Alertmanager for user alerts.
 
 ### Non-Goals
 
@@ -86,7 +87,6 @@ resources from user namespaces.
 alerts originating from namespaces with the `openshift.io/cluster-monitoring: "true"`
 label).
 * Share alert receivers and routes across tenants.
-* Deploy an additional UWM Alertmanager.
 
 ## Proposal
 
@@ -195,7 +195,7 @@ metadata:
 data:
   config.yaml: |-
     enableUserWorkload: true
-    alertmanager:
+    alertmanagerMain:
       enableUserAlertmanagerConfig: true
 ```
 
@@ -243,7 +243,7 @@ metadata:
 data:
   config.yaml: |-
     enableUserWorkload: true
-    alertmanager:
+    alertmanagerMain:
       enableUserAlertmanagerConfig: true
     excludeUserNamespaces: [foo,bar]
 ```
@@ -304,13 +304,13 @@ UWM Alertmanager in the
 
 | User alert destination | User notifications managed by | `enableUserAlertmanagerConfig` | `alertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) |
 |------------------------|-------------------------------|:------------------------------:|:--------------------:|:-------------------------------------:|
-| Platform Alertmanager | Cluster admins | false | empty | empty |
-| Platform Alertmanager<br/>External Alertmanager(s) | Cluster admins | false | empty | not empty |
-| Platform Alertmanager | Application owners | true | empty | empty |
-| UWM Alertmanager | UWM admins | &lt;any&gt; | {enabled: true, enableUserAlertmanagerConfig: false} | empty |
-| UWM Alertmanager | Application owners | &lt;any&gt; | {enabled: true, enableUserAlertmanagerConfig: true} | empty |
-| UWM Alertmanager<br/>External Alertmanager(s) | UWM admins | &lt;any&gt; | {enabled: true, enableUserAlertmanagerConfig: false} | not empty |
-| UWM Alertmanager<br/>External Alertmanager(s) | Application owners | &lt;any&gt; | {enabled: true, enableUserAlertmanagerConfig: true} | not empty |
+| Platform Alertmanager | Cluster admins | `false` | empty | empty |
+| Platform Alertmanager<br/>External Alertmanager(s) | Cluster admins | `false` | empty | not empty |
+| Platform Alertmanager | Application owners | `true` | empty | empty |
+| UWM Alertmanager | UWM admins | &lt;any&gt; | `{enabled: true, enableUserAlertmanagerConfig: false}` | empty |
+| UWM Alertmanager | Application owners | &lt;any&gt; | `{enabled: true, enableUserAlertmanagerConfig: true}` | empty |
+| UWM Alertmanager<br/>External Alertmanager(s) | UWM admins | &lt;any&gt; | `{enabled: true, enableUserAlertmanagerConfig: false}` | not empty |
+| UWM Alertmanager<br/>External Alertmanager(s) | Application owners | &lt;any&gt; | `{enabled: true, enableUserAlertmanagerConfig: true}` | not empty |
 
 
 #### Distinction between platform and user alerts
@@ -449,8 +449,11 @@ Users may provide bad credentials for the receivers, the system receiving the
 notifications might be unreachable, or the system might be unable to process
 the requests. These situations would trigger the
 `AlertmanagerFailedToSendAlerts` and/or `AlertmanagerClusterFailedToSendAlerts`
-alerts. The cluster admins have to act on upon the alerts and understand where
-the problem comes from.
+alerts. The cluster admins have to act upon these alerts and understand where
+the problem comes from by looking at the Alertmanager logs.
+
+Because these alerts are evaluated by the Platform Prometheus, they are routed
+to the Platform Alertmanager.
 
 Mitigations
 * Detailed runbook for the `AlertmanagerFailedToSendAlerts` and `AlertmanagerClusterFailedToSendAlerts` alerts.

From 1551d1ab930b995f49540f8da3a73f82f089cb48 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Wed, 26 Jan 2022 16:55:41 +0100
Subject: [PATCH 15/17] fix RBAC role name

---
 enhancements/monitoring/multi-tenant-alerting.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index a92ac4bf34..1f5a9de2e8 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -389,14 +389,14 @@ receivers:
 #### RBAC
 
 The cluster monitoring operator ships a new cluster role
-`alertmanager-config-edit` that grants all actions on `AlertmanagerConfig`
+`alert-routing-edit` that grants all actions on `AlertmanagerConfig`
 custom resources.
 
 ```yaml
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole
 metadata:
-  name: alertmanager-config-edit
+  name: alert-routing-edit
 rules:
 - apiGroups:
   - monitoring.coreos.com
@@ -412,7 +412,7 @@ a given namespace.
 
 ```bash
 oc -n <namespace> adm policy add-role-to-user \
-  alertmanager-config-edit <user> --role-namespace <namespace>
+  alert-routing-edit <user> --role-namespace <namespace>
 ```
 
 This role complements the existing `monitoring-edit`, `monitoring-rules-edit` and `monitoring-rules-view` roles.

From 78d6ddb0da3d404169ceb72663a238dc7e628d6f Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Mon, 7 Mar 2022 17:59:16 +0100
Subject: [PATCH 16/17] Add details about OCP console integration

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
---
 .../monitoring/multi-tenant-alerting.md       | 42 ++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index 1f5a9de2e8..012c9ab195 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -77,7 +77,8 @@ configuration (similar to what exist for service/pod monitors and rules using th
 `"openshift.io/user-monitoring: false"` label on the namespace).
 * Cluster admins should be able to opt-out from supporting `AlertmanagerConfig`
 resources for user namespaces.
-* Cluster admins should be able to run a separated Alertmanager for user alerts.
+* Cluster admins should be able to run a separated Alertmanager for user alerts
+and the integration with the OCP console should be seamless.
 
 ### Non-Goals
 
@@ -386,6 +387,45 @@ receivers:
   ...
 ```
 
+#### Console integration
+
+The OCP user interface leverages the Alertmanager API to display and manage
+alert silences.
+
+* Admin console
+  * In the "Observe > Alerting > Silences" page, the user can view and edit all silences.
+  * In the "Observe > Alerting > Alerts" and "Observe > Alerting > Alert details" pages, the user can create a silence for a pending or firing alert.
+* Developer console
+  * In the "Observe > Alerts" page, the user can view alerts in the selected namespace and create/expire silences for active alerts.
+
+Before this proposal, the console assumes that there's only one Alertmanager
+API available at the `alertmanager-main.openshift-monitoring.svc` service. Now
+it needs to deal with the possible existence of UWM Alertmanager.
+
+When a silence is associated to an active alert, the console already knows
+whether it comes from a "platform" or "user" alerting rule (platform rules have
+the `prometheus="openshift-monitoring/k8s"` label). Provided that CMO tells the
+console operator the name of the Alertmanager API service managing user alerts
+(e.g.  `alertmanager.openshift-user-workload-monitoring.svc` if UWM
+Alertmanager is enabled), the console backend can infer which API service it
+needs to request:
+
+* Admin console
+  * "Observe > Alerting > Silences"
+    * When listing the silences, the console backend queries both Alertmanager services, merges the results and adds a field identifying the "origin" of the silence (platform vs. user).
+    * When creating a silence, the user defines whether the silence is for platform or user alerts. The console backend uses the information to request the right Alertmanager API.
+    * When editing/expiring a silence, the console frontend knows the origin and can pass the information to the console backend.
+  * "Observe > Alerting > Alerts" and "Observe > Alerting > Alert details"
+    * The frontend knows the origin of the rule and can pass the information to the console backend.
+* Developer console
+  * "Observe > Alerts" page
+    * When editing/expiring a silence, the console frontend knows the origin of the rule and can pass the information to the console backend.
+
+Implementation-wise, CMO reuses the `monitoring-shared-config` ConfigMap in the
+`openshift-managed-config` namespace to communicate to the console operator the
+location of platform and user Alertmanager APIs. The operator passes down the
+configuration to the console backend.
+
 #### RBAC
 
 The cluster monitoring operator ships a new cluster role

From f556b83846190df04529ed5d5510866420aea885 Mon Sep 17 00:00:00 2001
From: Simon Pasquier <spasquie@redhat.com>
Date: Tue, 15 Mar 2022 14:20:58 +0100
Subject: [PATCH 17/17] Clear up 'Open questions' section

---
 enhancements/monitoring/multi-tenant-alerting.md | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index 012c9ab195..373c27709a 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -535,17 +535,7 @@ Mitigation
 
 ### Open Questions
 
-1. How can the Dev Console support the UWM Alertmanager?
-
-Users are able to silence alerts from the Dev Console and the console backend
-assumes that the API is served by the
-`alertmanager-main.openshift-monitoring.svc` service. To support the UWM
-Alertmanager configuration, CMO should provide to the console operator the name
-of the Alertmanager service managing the user alerts (either
-`alertmanager-main.openshift-monitoring.svc` or
-`alertmanager.openshift-user-workload-monitoring.svc`). Based on the presence
-of the `openshift_io_alert_source` label, the console backend can decide which
-Alertmanager service should be queried.
+None.
 
 ### Test Plan