Skip to content

Commit

Permalink
Draft a first version of UWM Alertmanager deployment
Browse files Browse the repository at this point in the history
  • Loading branch information
simonpasquier committed Nov 26, 2021
1 parent 908921f commit 9f234bb
Showing 1 changed file with 86 additions and 39 deletions.
125 changes: 86 additions & 39 deletions enhancements/monitoring/multi-tenant-alerting.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,14 +151,22 @@ instance).

The `AlertmanagerConfig` CRD is exposed by the `coreos.monitoring.com/v1alpha1` API group.

#### Deployment models

##### Leveraging the Platform Alertmanager

In this model, no additional Alertmanager is deployed and the user alerts are
forwarded to the existing Platform Alertmanager. This is matching the current
model.

The `Alertmanager` custom resource defines 2 LabelSelector fields
(`alertmanagerConfigSelector` and `alertmanagerConfigNamespaceSelector`) to
select which `AlertmanagerConfig` resources should be reconciled by the
Prometheus operator and from which namespace(s). Before this proposal, the
Plaform Alertmanager resource defines the 2 selectors as null, meaning that it
doesn't select any `AlertmanagerConfig` resources.

We propose to add a new field `enableUserAlertmanagerConfig` to the
We propose to add a new boolean field `enableUserAlertmanagerConfig` to the
`openshift-montoring/cluster-monitoring-config` configmap. If
`enableUserAlertmanagerConfig` is missing, the default value is false.

Expand All @@ -175,7 +183,8 @@ data:
```
When `enableUserAlertmanagerConfig` is true, the cluster monitoring operator
configures the Platform Alertmanager as follows.
configures the Platform Alertmanager to reconcile `AlertmanagerConfig`
resources from user namespaces as follows.

```yaml
apiVersion: monitoring.coreos.com/v1
Expand All @@ -200,12 +209,13 @@ spec:

To be consistent with what exists already for service/pod monitors and rules,
the Prometheus operator doesn't reconcile `AlertmanagerConfig` resources from
namespaces with the `openshift.io/user-monitoring: "false"` label. It allows
namespaces with the `openshift.io/user-monitoring: "false"` label. It allows
application owners to opt out completely from UWM in case they deploy and run
their own monitoring infrastructure (for instance with the [Monitoring Stack
operator][monitoring-stack-operator]).
In addition, the cluster admins can exclude specific user namespace(s) from UWM with the new `excludeUserNamespaces` field.
In addition, the cluster admins can exclude specific user namespace(s) from UWM
with the new `excludeUserNamespaces` field.

```yaml
apiVersion: v1
Expand All @@ -220,11 +230,19 @@ data:
excludeUserNamespaces: [foo,bar]
```

The UWM admins can also define that UWM alerts shouldn't be forwarded to the
Platform Alertmanager. With this capability and the existing
`additionalAlertmanagerConfigs`, it is possible to externalize the alert
routing and notifications to an external Alertmanager instance when the cluster
admins don't want to share the Plaform Alertmanager for instance.
##### Dedicated UWM Alertmanager

In some environments where cluster admins and UWM admins are different personas
(e.g. OSD), it might not be acceptable for cluster admins to let users
configure the Platform Alertmanager because:
* User configurations may break the Alertmanager configuration.
* Processing of user alerts may slow down the alert notification pipeline.
* Cluster admins don't want to deal with delivery errors for user notifications.

In this case, UWM admins have the possibility to deploy a dedicated
Alertmanager. The configuration options will be equivalent to the options
exposed for the Platform Alertmanager and exposed under the `alertmanager` key
in the UWM configmap.

```yaml
apiVersion: v1
Expand All @@ -234,32 +252,51 @@ metadata:
namespace: openshift-user-workload-monitoring
data:
config.yaml: |-
thanosRuler:
usePlatformAlertmanager: false
prometheus:
usePlatformAlertmanager: false
additionalAlertmanagerConfigs: [...]
alertmanager:
enabled: true
logLevel: info
nodeSelector: {...}
tolerations: [...]
resources: {...}
volumeClaimTemplate: {...}
prometheus: {}
thanosRuler: {}
```

When this option is chosen, the OCP console can't be used to manage silences for user alerts.
The UWM Alertmanager will be automatically configured to reconcile
`AlertmanagerConfig` resources from all user namespaces (just like for UWM
service/pod monitors and rules). Again namespaces with the
`openshift.io/user-monitoring: false` label will be excluded.

Summary of the different combinations:
When the UWM Alertmanager is enabled:
* The Platform Alertmanager will be configured to not reconcile
`AlertmanagerConfig` resources from user
namespaces.
* The UWM Prometheus and Thanos Ruler will send alerts to
the UWM Alertmanager only.

| User alert destination | User notifications managed by | `enableUserWorkload` | `enableUserAlertmanagerConfig` | `usePlatformAlertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) |
The UWM admins are responsible for provisioning the root configuration of the
UWM Alertmanager in the
`openshift-user-workload-monitoring/alertmanager-user-workload` secret.


##### Summary

| User alert destination | User notifications managed by | `enableUserWorkload` | `enableUserAlertmanagerConfig` | `alertmanager` (UWM) | `additionalAlertmanagerConfigs` (UWM) |
|----|----|:--------------------:|:------------------------------:|:------------------------------------:|:-------------------------------------:|
| Nowhere | No-one | true | <any> | false | empty |
| Platform Alertmanager | Cluster admins | true | false | true | empty |
| Platform Alertmanager<br/>External Alertmanager(s) | Cluster admins for the Platform Alertmanager | true | false | true | not empty |
| Platform Alertmanager | Application owners | true | true | true | empty |
| Platform Alertmanager<br/>External Alertmanager(s) | Application owners | true | true | true | not empty |
| Platform Alertmanager | Cluster admins | true | false | empty | empty |
| Platform Alertmanager<br/>External Alertmanager(s) | Cluster admins for the Platform Alertmanager | true | false | empty | not empty |
| UWM Alertmanager | Application owners | true | true | not empty | empty |
| UWM Alertmanager<br/>External Alertmanager(s) | Application owners | true | true | not empty | not empty |


#### Distinction between platform and user alerts

It is important that platform alerts can be clearly distinguished from user
alerts because cluster admins want to ensure that:
1. all alerts originating from platform components are dispatched to at least one default receiver which is owned by the admin team.
2. they aren't notified about any user alert.
2. they aren't notified about any user alert and focus on platform alerts.

To this effect, CMO configures the Platform Prometheus instances with a new
external label: `openshift_io_alert_source="platform"`.
Expand Down Expand Up @@ -321,9 +358,8 @@ receivers:
#### RBAC

The cluster monitoring operator ships a new cluster role
`alertmanager-config-edit` so that cluster admins can bind it with a
`RoleBinding` to grant permissions to users or groups on `AlertmanagerConfig`
custom resources within a given namespace.
`alertmanager-config-edit` that grants all actions on `AlertmanagerConfig`
custom resources.

```yaml
apiVersion: rbac.authorization.k8s.io/v1
Expand All @@ -339,6 +375,15 @@ rules:
- '*'
```

Cluster admins can bind the cluster role with a `RoleBinding` to grant
permissions to users or groups on `AlertmanagerConfig` custom resources within
a given namespace.

```
oc -n <namespace> adm policy add-role-to-user \
alertmanager-config-edit <user> --role-namespace <namespace>
```
This role complements the existing `monitoring-edit`, `monitoring-rules-edit` and `monitoring-rules-view` roles.
#### Resource impacts
Expand Down Expand Up @@ -411,21 +456,10 @@ Mitigation
### Open Questions
1. Should CMO allow UWM admins to deploy a separate UWM Alertmanager cluster if the cluster admins don't want to share the Platform Alertmanager?

While the UWM admins have the ability to configure external Alertmanager
endpoints where user alerts should be sent, it requires someone to manage the
deployment of this additional Alertmanager. We could add an option in the UWM
config map to enable an Alertmanager instance running in the
`openshift-user-workload-monitoring` namespace.
1. How can the console support the UWM Alertmanager?
Pros
* It provides a better experience for UWM admins: no need to maintain a standalone Alertmanager cluster, less likely to mess up the configuration of `additionalAlertmanagerConfigs`.
Cons
* Increased complexity in the CMO codebase and in the UWM configuration options.
* Additional resource overhead (though Alertmanager is usually light on resources).
* Redundancy with the [Monitoring Stack operator][monitoring-stack-operator].
* More work required for a proper integration in the OCP console.
Right now the console backend manages the user-defined silences via the
Platform Alertmanager API. It would need to be aware of the deployment model.
### Test Plan
Expand Down Expand Up @@ -480,11 +514,24 @@ N/A
## Alternatives
### Status-quo
An alternative is to keep the current status-quo and rely on cluster admins to
configure alert routing for their users. This proposal doesn't forbid this
model since cluster admins can decide to not reconcile user-defined
`AlertmanagerConfig` resources within the Platform Alertmanager.
### Don't support UWM Alertmanager
We could decide that CMO doesn't offer the ability to deploy the UWM
Alertmanager. In this case the responsibility of deploying an additional
Alertmanager is delegated to the cluster admins which would leverage
`additionalAlertmanagerConfigs` to point user alerts to this instance.
The downsides are
* Degraded user experience and overhead on the users.
* The additional setup wouldn't be supported by Red Hat.
[user-workload-monitoring-enhancement]: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/user-workload-monitoring.md
[uwm-docs]: https://docs.openshift.com/container-platform/4.8/monitoring/enabling-monitoring-for-user-defined-projects.html
[prometheus-operator]: https://prometheus-operator.dev/
Expand Down

0 comments on commit 9f234bb

Please sign in to comment.