diff --git a/enhancements/monitoring/multi-tenant-alerting.md b/enhancements/monitoring/multi-tenant-alerting.md
index a88ab0b206b..fe5627b224c 100644
--- a/enhancements/monitoring/multi-tenant-alerting.md
+++ b/enhancements/monitoring/multi-tenant-alerting.md
@@ -151,6 +151,14 @@ instance).
The `AlertmanagerConfig` CRD is exposed by the `coreos.monitoring.com/v1alpha1` API group.
+#### Deployment models
+
+##### Leveraging the Platform Alertmanager
+
+In this model, no additional Alertmanager is deployed and the user alerts are
+forwarded to the existing Platform Alertmanager. This is matching the current
+model.
+
The `Alertmanager` custom resource defines 2 LabelSelector fields
(`alertmanagerConfigSelector` and `alertmanagerConfigNamespaceSelector`) to
select which `AlertmanagerConfig` resources should be reconciled by the
@@ -158,7 +166,7 @@ Prometheus operator and from which namespace(s). Before this proposal, the
Plaform Alertmanager resource defines the 2 selectors as null, meaning that it
doesn't select any `AlertmanagerConfig` resources.
-We propose to add a new field `enableUserAlertmanagerConfig` to the
+We propose to add a new boolean field `enableUserAlertmanagerConfig` to the
`openshift-montoring/cluster-monitoring-config` configmap. If
`enableUserAlertmanagerConfig` is missing, the default value is false.
@@ -175,7 +183,8 @@ data:
```
When `enableUserAlertmanagerConfig` is true, the cluster monitoring operator
-configures the Platform Alertmanager as follows.
+configures the Platform Alertmanager to reconcile `AlertmanagerConfig`
+resources from user namespaces as follows.
```yaml
apiVersion: monitoring.coreos.com/v1
@@ -200,12 +209,13 @@ spec:
To be consistent with what exists already for service/pod monitors and rules,
the Prometheus operator doesn't reconcile `AlertmanagerConfig` resources from
-namespaces with the `openshift.io/user-monitoring: "false"` label. It allows
+namespaces with the `openshift.io/user-monitoring: "false"` label. It allows
application owners to opt out completely from UWM in case they deploy and run
their own monitoring infrastructure (for instance with the [Monitoring Stack
operator][monitoring-stack-operator]).
-In addition, the cluster admins can exclude specific user namespace(s) from UWM with the new `excludeUserNamespaces` field.
+In addition, the cluster admins can exclude specific user namespace(s) from UWM
+with the new `excludeUserNamespaces` field.
```yaml
apiVersion: v1
@@ -220,11 +230,19 @@ data:
excludeUserNamespaces: [foo,bar]
```
-The UWM admins can also define that UWM alerts shouldn't be forwarded to the
-Platform Alertmanager. With this capability and the existing
-`additionalAlertmanagerConfigs`, it is possible to externalize the alert
-routing and notifications to an external Alertmanager instance when the cluster
-admins don't want to share the Plaform Alertmanager for instance.
+##### Dedicated UWM Alertmanager
+
+In some environments where cluster admins and UWM admins are different personas
+(e.g. OSD), it might not be acceptable for cluster admins to let users
+configure the Platform Alertmanager because:
+* User configurations may break the Alertmanager configuration.
+* Processing of user alerts may slow down the alert notification pipeline.
+* Cluster admins don't want to deal with delivery errors for user notifications.
+
+In this case, UWM admins have the possibility to deploy a dedicated
+Alertmanager. The configuration options will be equivalent to the options
+exposed for the Platform Alertmanager and exposed under the `alertmanager` key
+in the UWM configmap.
```yaml
apiVersion: v1
@@ -234,24 +252,43 @@ metadata:
namespace: openshift-user-workload-monitoring
data:
config.yaml: |-
- thanosRuler:
- usePlatformAlertmanager: false
- prometheus:
- usePlatformAlertmanager: false
- additionalAlertmanagerConfigs: [...]
+ alertmanager:
+ enabled: true
+ logLevel: info
+ nodeSelector: {...}
+ tolerations: [...]
+ resources: {...}
+ volumeClaimTemplate: {...}
+ prometheus: {}
+ thanosRuler: {}
```
-When this option is chosen, the OCP console can't be used to manage silences for user alerts.
+The UWM Alertmanager will be automatically configured to reconcile
+`AlertmanagerConfig` resources from all user namespaces (just like for UWM
+service/pod monitors and rules). Again namespaces with the
+`openshift.io/user-monitoring: false` label will be excluded.
-Summary of the different combinations:
+When the UWM Alertmanager is enabled:
+* The Platform Alertmanager will be configured to not reconcile
+ `AlertmanagerConfig` resources from user
+ namespaces.
+* The UWM Prometheus and Thanos Ruler will send alerts to
+ the UWM Alertmanager only.
-| User alert destination | User notifications managed by | `enableUserWorkload` | `enableUserAlertmanagerConfig` | `usePlatformAlertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) |
+The UWM admins are responsible for provisioning the root configuration of the
+UWM Alertmanager in the
+`openshift-user-workload-monitoring/alertmanager-user-workload` secret.
+
+
+##### Summary
+
+| User alert destination | User notifications managed by | `enableUserWorkload` | `enableUserAlertmanagerConfig` | `alertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) |
|----|----|:--------------------:|:------------------------------:|:------------------------------------:|:-------------------------------------:|
| Nowhere | No-one | true | <any> | false | empty |
-| Platform Alertmanager | Cluster admins | true | false | true | empty |
-| Platform Alertmanager
External Alertmanager(s) | Cluster admins for the Platform Alertmanager | true | false | true | not empty |
-| Platform Alertmanager | Application owners | true | true | true | empty |
-| Platform Alertmanager
External Alertmanager(s) | Application owners | true | true | true | not empty |
+| Platform Alertmanager | Cluster admins | true | false | empty | empty |
+| Platform Alertmanager
External Alertmanager(s) | Cluster admins for the Platform Alertmanager | true | false | empty | not empty |
+| UWM Alertmanager | Application owners | true | true | not empty | empty |
+| UWM Alertmanager
External Alertmanager(s) | Application owners | true | true | not empty | not empty |
#### Distinction between platform and user alerts
@@ -259,7 +296,7 @@ Summary of the different combinations:
It is important that platform alerts can be clearly distinguished from user
alerts because cluster admins want to ensure that:
1. all alerts originating from platform components are dispatched to at least one default receiver which is owned by the admin team.
-2. they aren't notified about any user alert.
+2. they aren't notified about any user alert and focus on platform alerts.
To this effect, CMO configures the Platform Prometheus instances with a new
external label: `openshift_io_alert_source="platform"`.
@@ -321,9 +358,8 @@ receivers:
#### RBAC
The cluster monitoring operator ships a new cluster role
-`alertmanager-config-edit` so that cluster admins can bind it with a
-`RoleBinding` to grant permissions to users or groups on `AlertmanagerConfig`
-custom resources within a given namespace.
+`alertmanager-config-edit` that grants all actions on `AlertmanagerConfig`
+custom resources.
```yaml
apiVersion: rbac.authorization.k8s.io/v1
@@ -339,6 +375,15 @@ rules:
- '*'
```
+Cluster admins can bind the cluster role with a `RoleBinding` to grant
+permissions to users or groups on `AlertmanagerConfig` custom resources within
+a given namespace.
+
+```
+oc -n adm policy add-role-to-user \
+ alertmanager-config-edit --role-namespace
+```
+
This role complements the existing `monitoring-edit`, `monitoring-rules-edit` and `monitoring-rules-view` roles.
#### Resource impacts
@@ -411,21 +456,10 @@ Mitigation
### Open Questions
-1. Should CMO allow UWM admins to deploy a separate UWM Alertmanager cluster if the cluster admins don't want to share the Platform Alertmanager?
-
-While the UWM admins have the ability to configure external Alertmanager
-endpoints where user alerts should be sent, it requires someone to manage the
-deployment of this additional Alertmanager. We could add an option in the UWM
-config map to enable an Alertmanager instance running in the
-`openshift-user-workload-monitoring` namespace.
+1. How can the console support the UWM Alertmanager?
-Pros
-* It provides a better experience for UWM admins: no need to maintain a standalone Alertmanager cluster, less likely to mess up the configuration of `additionalAlertmanagerConfigs`.
-Cons
-* Increased complexity in the CMO codebase and in the UWM configuration options.
-* Additional resource overhead (though Alertmanager is usually light on resources).
-* Redundancy with the [Monitoring Stack operator][monitoring-stack-operator].
-* More work required for a proper integration in the OCP console.
+Right now the console backend manages the user-defined silences via the
+Platform Alertmanager API. It would need to be aware of the deployment model.
### Test Plan
@@ -480,11 +514,24 @@ N/A
## Alternatives
+### Status-quo
+
An alternative is to keep the current status-quo and rely on cluster admins to
configure alert routing for their users. This proposal doesn't forbid this
model since cluster admins can decide to not reconcile user-defined
`AlertmanagerConfig` resources within the Platform Alertmanager.
+### Don't support UWM Alertmanager
+
+We could decide that CMO doesn't offer the ability to deploy the UWM
+Alertmanager. In this case the responsibility of deploying an additional
+Alertmanager is delegated to the cluster admins which would leverage
+`additionalAlertmanagerConfigs` to point user alerts to this instance.
+
+The downsides are
+* Degraded user experience and overhead on the users.
+* The additional setup wouldn't be supported by Red Hat.
+
[user-workload-monitoring-enhancement]: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/user-workload-monitoring.md
[uwm-docs]: https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html
[prometheus-operator]: https://prometheus-operator.dev/