Add runbooks description for prometheus alerts which ingress operator…

… provides. Ticket: https://issues.redhat.com/browse/OCPBUGS-14057
openshift · Feb 26, 2024 · 3c0956d · 3c0956d
1 parent 5dffee3
commit 3c0956d
Show file tree

Hide file tree

Showing 6 changed files with 192 additions and 0 deletions.
diff --git a/alerts/cluster-ingress-operator/HAProxyDown.md b/alerts/cluster-ingress-operator/HAProxyDown.md
@@ -0,0 +1,31 @@
+# NodeFilesystemSpaceFillingUp
+
+## Meaning
+
+This alert is based on an extrapolation of the space used in a file system. It
+fires if both the current usage is above a certain threshold _and_ the
+extrapolation predicts to run out of space in a certain time. This is a
+warning-level alert if that time is less than 24h. It's a critical alert if that
+time is less than 4h.
+
+## Impact
+
+A filesystem running completely full is obviously very bad for any process in
+need to write to the filesystem. But even before a filesystem runs completely
+full, performance is usually degrading.
+
+## Diagnosis
+
+Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
+pattern of writing and cleaning up can trick the linear prediction into a false
+alert.
+
+Use the usual OS tools to investigate what directories are the worst and/or
+recent offenders.
+
+Is this some irregular condition, e.g. a process fails to clean up behind
+itself, or is this organic growth?
+
+## Mitigation
+
+<Insert site specific measures, for example to grow a persistent volume.>
diff --git a/alerts/cluster-ingress-operator/HAProxyReloadFail.md b/alerts/cluster-ingress-operator/HAProxyReloadFail.md
@@ -0,0 +1,24 @@
+# HAProxyReloadFail
+
+## Meaning
+
+This alert fires when HAProxy fails to reload its configuration, which will result in the router 
+not picking up recently created or modified routes.
+
+## Impact
+
+The router won't pick up recently created or modified routes. This may cause an outage for critical 
+applications.
+
+## Diagnosis
+
+Check the router logs:
+```sh
+oc logs <router pod> -n openshift-ingress
+```
+
+Check if any recently added configuration in the haproxy config via ingress controller CR caused the issue.
+
+## Mitigation
+
+Try to fix the configuration of the haproxy via ingress controller CR on the basis of the output of the logs.
diff --git a/alerts/cluster-ingress-operator/IngressControllerDegraded.md b/alerts/cluster-ingress-operator/IngressControllerDegraded.md
@@ -0,0 +1,39 @@
+# IngressControllerDegraded
+
+## Meaning
+
+This alert fires when the IngressController status is degraded.
+
+## Impact
+
+The routers won't be running in the cluster. This will cause outage while accessing the applications.
+
+## Diagnosis
+
+Ingress Controller may be degraded due to one or more reasons.
+
+- Check the ingress operator logs using the following command:
+```sh
+oc logs <ingress operator pod> -n openshift-ingress-operator
+```
+- Check the router logs using the following commands:
+```sh
+oc logs <router pod> -n openshift-ingress
+```
+- Check the yaml file of the ingress controller and operator to see the reason for failure:
+```sh
+oc get ingresscontroller <ingresscontroller name> -n openshift-ingress-operator -o yaml
+```
+
+```sh
+oc get deployment -n openshift-ingress-operator -o yaml
+```
+
+```sh
+oc get events
+```
+
+## Mitigation
+
+Try to fix the issue based on what you see in the status of yaml and errors in the logs from the above mentioned
+commands.
diff --git a/alerts/cluster-ingress-operator/IngressControllerUnavailable.md b/alerts/cluster-ingress-operator/IngressControllerUnavailable.md
@@ -0,0 +1,39 @@
+# IngressControllerUnavailable
+
+## Meaning
+
+This alert fires when the IngressController is not available.
+
+## Impact
+
+This will cause outage to the environment as the access to the applications won't be available.
+
+## Diagnosis
+
+Ingress Controller may be degraded due to one or more reasons.
+
+- Check the ingress operator logs using the following command:
+```sh
+oc logs <ingress operator pod> -n openshift-ingress-operator
+```
+- Check the router logs using the following commands:
+```sh
+oc logs <router pod> -n openshift-ingress
+```
+- Check the yaml file of the ingress controller and operator to see the reason for failure:
+```sh
+oc get ingresscontroller <ingresscontroller name> -n openshift-ingress-operator -o yaml
+```
+
+```sh
+oc get deployment -n openshift-ingress-operator -o yaml
+```
+
+```sh
+oc get events
+```
+
+## Mitigation
+
+Try to fix the issue based on what you see in the status of yaml and errors in the logs from the above mentioned
+commands.
diff --git a/alerts/cluster-ingress-operator/IngressWithoutClassName.md b/alerts/cluster-ingress-operator/IngressWithoutClassName.md
@@ -0,0 +1,31 @@
+# IngressWithoutClassName
+
+## Meaning
+
+This alert fires when there is an Ingress with an unset IngressClassName for longer than one day.
+
+## Impact
+
+It is possible that a user could have created an Ingress with
+some nonempty value for spec.ingressClassName that did not match an
+OpenShift IngressClass object, and nevertheless intended for OpenShift
+to expose this Ingress.  Again, it is impossible to determine reliably
+what a user's intent was in such a scenario, but as OpenShift exposed
+such an Ingress before this enhancement, changing this behavior could
+break existing applications.
+
+So, we considered modifying the ingress operator
+to list all Ingresses and Routes in the cluster and publish a metric
+for Routes that were created for Ingresses that OpenShift no longer
+manage.
+
+## Diagnosis
+
+Check for alert messages on the UI.
+Inspect the ingress object.
+Inspect the route object. Check the status of it.
+Check the logs of `cluster-openshift-controller-manager-operator`
+
+## Mitigation
+Figure out why the route which were created by ingress which OpenShift no longer manages.
+Delete that ingress and route if it is no longer needed.
diff --git a/alerts/cluster-ingress-operator/UnmanagedRoutes.md b/alerts/cluster-ingress-operator/UnmanagedRoutes.md
@@ -0,0 +1,28 @@
+# UnmanagedRoutes
+
+## Meaning
+
+This alert fires when there is a Route owned by an unmanaged Ingress.
+
+## Impact
+
+The ingress-to-route controller does not remove Routes that earlier versions of OpenShift created for
+Ingresses that specify `spec.ingressClassName`. Thus, these Routes will continue to
+be in effect. OpenShift does not update such Routes and does not recreate them if the user deletes them.
+
+In case any Routes existed in this state the alert would help the administrator
+know that the Routes needed to be deleted, or the Ingress modified to
+specify an appropriate IngressClass so that OpenShift would once again
+reconcile the Routes.
+
+## Diagnosis
+
+Check for alert messages on the UI.
+Inspect the ingress object. 
+Inspect the route object.
+Check the logs of `cluster-openshift-controller-manager-operator`
+
+## Mitigation
+
+This alert will help the administrator to specify an appropriate IngressClass in the Ingress object
+so that OpenShift would once again reconcile the Routes.