Add runbooks description for prometheus alerts which ingress operator…

… provides. Ticket: https://issues.redhat.com/browse/OCPBUGS-14057
openshift · Mar 19, 2024 · 8c07fdd · 8c07fdd
1 parent 5dffee3
commit 8c07fdd
Show file tree

Hide file tree

Showing 6 changed files with 202 additions and 0 deletions.
diff --git a/alerts/cluster-ingress-operator/HAProxyDown.md b/alerts/cluster-ingress-operator/HAProxyDown.md
@@ -0,0 +1,29 @@
+# HAProxyDown
+
+## Meaning
+
+This alert fires when metrics report that HAProxy is down.
+
+## Impact
+
+Access to routes will fail. It may cause a severe outage for critical applications.
+
+## Diagnosis
+
+- Check the router logs:
+```sh
+oc logs <router pod> -n openshift-ingress
+```
+
+- Check the events:
+```sh
+oc get events -n openshift-ingress
+```
+
+- Check the load on the system where the routers are hosted.
+
+## Mitigation
+
+Based on the diagnosis, try to figure out the issue.
+If the issue is configuration related then try to fix the haproxy config.
+If the issue is load related try to fix the issues at infrastructure level.
diff --git a/alerts/cluster-ingress-operator/HAProxyReloadFail.md b/alerts/cluster-ingress-operator/HAProxyReloadFail.md
@@ -0,0 +1,26 @@
+# HAProxyReloadFail
+
+## Meaning
+
+This alert fires when HAProxy fails to reload its configuration, which will
+result in the router not picking up recently created or modified routes.
+
+## Impact
+
+The router won't pick up recently created or modified routes. This may cause
+an outage for critical applications.
+
+## Diagnosis
+
+Check the router logs:
+```sh
+oc logs <router pod> -n openshift-ingress
+```
+
+Check if any recently added configuration in the haproxy config via ingress
+controller CR caused the issue.
+
+## Mitigation
+
+Try to fix the configuration of the haproxy via ingress controller CR on the
+basis of the output of the logs.
diff --git a/alerts/cluster-ingress-operator/IngressControllerDegraded.md b/alerts/cluster-ingress-operator/IngressControllerDegraded.md
@@ -0,0 +1,42 @@
+# IngressControllerDegraded
+
+## Meaning
+
+This alert fires when the IngressController status is degraded.
+
+## Impact
+
+The routers won't be running in the cluster. This will cause an outage while
+accessing the applications.
+
+## Diagnosis
+
+Ingress Controller may be degraded due to one or more reasons.
+
+- Check the ingress operator logs using the following command:
+```sh
+oc logs <ingress operator pod> -n openshift-ingress-operator
+```
+- Check the router logs using the following commands:
+```sh
+oc logs <router pod> -n openshift-ingress
+```
+- Check the yaml file of the ingress controller and operator to see the reason
+ for failure:
+```sh
+oc get ingresscontroller <ingresscontroller name> -n openshift-ingress-operator -o yaml
+```
+
+```sh
+oc get deployment -n openshift-ingress-operator -o yaml
+```
+
+```sh
+oc get events
+```
+
+## Mitigation
+
+Try to fix the issue based on what you see in the status of yaml and errors
+in the logs from the above mentioned
+commands.
diff --git a/alerts/cluster-ingress-operator/IngressControllerUnavailable.md b/alerts/cluster-ingress-operator/IngressControllerUnavailable.md
@@ -0,0 +1,41 @@
+# IngressControllerUnavailable
+
+## Meaning
+
+This alert fires when the IngressController is not available.
+
+## Impact
+
+This will cause an outage to the environment as the access to the
+applications won't be available.
+
+## Diagnosis
+
+Ingress Controller may be degraded due to one or more reasons.
+
+- Check the ingress operator logs using the following command:
+```sh
+oc logs <ingress operator pod> -n openshift-ingress-operator
+```
+- Check the router logs using the following commands:
+```sh
+oc logs <router pod> -n openshift-ingress
+```
+- Check the yaml file of the ingress controller and operator to see
+ the reason for failure:
+```sh
+oc get ingresscontroller <ingresscontroller name> -n openshift-ingress-operator -o yaml
+```
+
+```sh
+oc get deployment -n openshift-ingress-operator -o yaml
+```
+
+```sh
+oc get events
+```
+
+## Mitigation
+
+Try to fix the issue based on what you see in the status of yaml
+and errors in the logs from the above mentioned commands.
diff --git a/alerts/cluster-ingress-operator/IngressWithoutClassName.md b/alerts/cluster-ingress-operator/IngressWithoutClassName.md
@@ -0,0 +1,33 @@
+# IngressWithoutClassName
+
+## Meaning
+
+This alert fires when there is an Ingress with an unset IngressClassName
+for longer than one day.
+
+## Impact
+
+It is possible that a user could have created an Ingress with
+some nonempty value for spec.ingressClassName that did not match an
+OpenShift IngressClass object, and nevertheless intended for OpenShift
+to expose this Ingress. Again, it is impossible to determine reliably
+what a user's intent was in such a scenario, but as OpenShift exposed
+such an Ingress before this enhancement, changing this behavior could
+break existing applications.
+
+So, we considered modifying the ingress operator
+to list all Ingresses and Routes in the cluster and publish a metric
+for Routes that were created for Ingresses that OpenShift no longer
+manage.
+
+## Diagnosis
+
+Check for alert messages on the UI.
+Inspect the ingress object.
+Inspect the route object. Check the status of it.
+Check the logs of `cluster-openshift-controller-manager-operator`
+
+## Mitigation
+Figure out why the route which was created by ingress which OpenShift
+no longer manages.
+Delete that ingress and route if it is no longer needed.
diff --git a/alerts/cluster-ingress-operator/UnmanagedRoutes.md b/alerts/cluster-ingress-operator/UnmanagedRoutes.md
@@ -0,0 +1,31 @@
+# UnmanagedRoutes
+
+## Meaning
+
+This alert fires when there is a Route owned by an unmanaged Ingress.
+
+## Impact
+
+The ingress-to-route controller does not remove Routes that earlier versions
+of OpenShift created for Ingresses that specify `spec.ingressClassName`.
+Thus, these Routes will continue to be in effect.
+OpenShift does not update such Routes and does not recreate
+them if the user deletes them.
+
+In case any Routes existed in this state the alert would
+help the administrator know that the Routes needed to be deleted,
+or the Ingress modified to specify an appropriate IngressClass so
+that OpenShift would once again reconcile the Routes.
+
+## Diagnosis
+
+Check for alert messages on the UI.
+Inspect the ingress object. 
+Inspect the route object.
+Check the logs of `cluster-openshift-controller-manager-operator`
+
+## Mitigation
+
+This alert will help the administrator to specify an appropriate
+IngressClass in the Ingress object so that OpenShift would once
+again reconcile the Routes.