Skip to content

Commit

Permalink
Add runbooks description for prometheus alerts which ingress operator…
Browse files Browse the repository at this point in the history
  • Loading branch information
miheer committed Mar 19, 2024
1 parent 5dffee3 commit 8c07fdd
Show file tree
Hide file tree
Showing 6 changed files with 202 additions and 0 deletions.
29 changes: 29 additions & 0 deletions alerts/cluster-ingress-operator/HAProxyDown.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# HAProxyDown

## Meaning

This alert fires when metrics report that HAProxy is down.

## Impact

Access to routes will fail. It may cause a severe outage for critical applications.

## Diagnosis

- Check the router logs:
```sh
oc logs <router pod> -n openshift-ingress
```

- Check the events:
```sh
oc get events -n openshift-ingress
```

- Check the load on the system where the routers are hosted.

## Mitigation

Based on the diagnosis, try to figure out the issue.
If the issue is configuration related then try to fix the haproxy config.
If the issue is load related try to fix the issues at infrastructure level.
26 changes: 26 additions & 0 deletions alerts/cluster-ingress-operator/HAProxyReloadFail.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# HAProxyReloadFail

## Meaning

This alert fires when HAProxy fails to reload its configuration, which will
result in the router not picking up recently created or modified routes.

## Impact

The router won't pick up recently created or modified routes. This may cause
an outage for critical applications.

## Diagnosis

Check the router logs:
```sh
oc logs <router pod> -n openshift-ingress
```

Check if any recently added configuration in the haproxy config via ingress
controller CR caused the issue.

## Mitigation

Try to fix the configuration of the haproxy via ingress controller CR on the
basis of the output of the logs.
42 changes: 42 additions & 0 deletions alerts/cluster-ingress-operator/IngressControllerDegraded.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# IngressControllerDegraded

## Meaning

This alert fires when the IngressController status is degraded.

## Impact

The routers won't be running in the cluster. This will cause an outage while
accessing the applications.

## Diagnosis

Ingress Controller may be degraded due to one or more reasons.

- Check the ingress operator logs using the following command:
```sh
oc logs <ingress operator pod> -n openshift-ingress-operator
```
- Check the router logs using the following commands:
```sh
oc logs <router pod> -n openshift-ingress
```
- Check the yaml file of the ingress controller and operator to see the reason
for failure:
```sh
oc get ingresscontroller <ingresscontroller name> -n openshift-ingress-operator -o yaml
```

```sh
oc get deployment -n openshift-ingress-operator -o yaml
```

```sh
oc get events
```

## Mitigation

Try to fix the issue based on what you see in the status of yaml and errors
in the logs from the above mentioned
commands.
41 changes: 41 additions & 0 deletions alerts/cluster-ingress-operator/IngressControllerUnavailable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# IngressControllerUnavailable

## Meaning

This alert fires when the IngressController is not available.

## Impact

This will cause an outage to the environment as the access to the
applications won't be available.

## Diagnosis

Ingress Controller may be degraded due to one or more reasons.

- Check the ingress operator logs using the following command:
```sh
oc logs <ingress operator pod> -n openshift-ingress-operator
```
- Check the router logs using the following commands:
```sh
oc logs <router pod> -n openshift-ingress
```
- Check the yaml file of the ingress controller and operator to see
the reason for failure:
```sh
oc get ingresscontroller <ingresscontroller name> -n openshift-ingress-operator -o yaml
```

```sh
oc get deployment -n openshift-ingress-operator -o yaml
```

```sh
oc get events
```

## Mitigation

Try to fix the issue based on what you see in the status of yaml
and errors in the logs from the above mentioned commands.
33 changes: 33 additions & 0 deletions alerts/cluster-ingress-operator/IngressWithoutClassName.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# IngressWithoutClassName

## Meaning

This alert fires when there is an Ingress with an unset IngressClassName
for longer than one day.

## Impact

It is possible that a user could have created an Ingress with
some nonempty value for spec.ingressClassName that did not match an
OpenShift IngressClass object, and nevertheless intended for OpenShift
to expose this Ingress. Again, it is impossible to determine reliably
what a user's intent was in such a scenario, but as OpenShift exposed
such an Ingress before this enhancement, changing this behavior could
break existing applications.

So, we considered modifying the ingress operator
to list all Ingresses and Routes in the cluster and publish a metric
for Routes that were created for Ingresses that OpenShift no longer
manage.

## Diagnosis

Check for alert messages on the UI.
Inspect the ingress object.
Inspect the route object. Check the status of it.
Check the logs of `cluster-openshift-controller-manager-operator`

## Mitigation
Figure out why the route which was created by ingress which OpenShift
no longer manages.
Delete that ingress and route if it is no longer needed.
31 changes: 31 additions & 0 deletions alerts/cluster-ingress-operator/UnmanagedRoutes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# UnmanagedRoutes

## Meaning

This alert fires when there is a Route owned by an unmanaged Ingress.

## Impact

The ingress-to-route controller does not remove Routes that earlier versions
of OpenShift created for Ingresses that specify `spec.ingressClassName`.
Thus, these Routes will continue to be in effect.
OpenShift does not update such Routes and does not recreate
them if the user deletes them.

In case any Routes existed in this state the alert would
help the administrator know that the Routes needed to be deleted,
or the Ingress modified to specify an appropriate IngressClass so
that OpenShift would once again reconcile the Routes.

## Diagnosis

Check for alert messages on the UI.
Inspect the ingress object.
Inspect the route object.
Check the logs of `cluster-openshift-controller-manager-operator`

## Mitigation

This alert will help the administrator to specify an appropriate
IngressClass in the Ingress object so that OpenShift would once
again reconcile the Routes.

0 comments on commit 8c07fdd

Please sign in to comment.