Add runbooks description for prometheus alerts which ingress operator…

… provides. Ticket: https://issues.redhat.com/browse/OCPBUGS-14057
openshift · Feb 5, 2024 · 5f8b936 · 5f8b936
1 parent 5dffee3
commit 5f8b936
Show file tree

Hide file tree

Showing 6 changed files with 186 additions and 0 deletions.
diff --git a/alerts/cluster-ingress-operator/HAProxyDown.md b/alerts/cluster-ingress-operator/HAProxyDown.md
@@ -0,0 +1,31 @@
+# NodeFilesystemSpaceFillingUp
+
+## Meaning
+
+This alert is based on an extrapolation of the space used in a file system. It
+fires if both the current usage is above a certain threshold _and_ the
+extrapolation predicts to run out of space in a certain time. This is a
+warning-level alert if that time is less than 24h. It's a critical alert if that
+time is less than 4h.
+
+## Impact
+
+A filesystem running completely full is obviously very bad for any process in
+need to write to the filesystem. But even before a filesystem runs completely
+full, performance is usually degrading.
+
+## Diagnosis
+
+Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
+pattern of writing and cleaning up can trick the linear prediction into a false
+alert.
+
+Use the usual OS tools to investigate what directories are the worst and/or
+recent offenders.
+
+Is this some irregular condition, e.g. a process fails to clean up behind
+itself, or is this organic growth?
+
+## Mitigation
+
+<Insert site specific measures, for example to grow a persistent volume.>
diff --git a/alerts/cluster-ingress-operator/HAProxyReloadFail.md b/alerts/cluster-ingress-operator/HAProxyReloadFail.md
@@ -0,0 +1,31 @@
+# NodeFilesystemSpaceFillingUp
+
+## Meaning
+
+This alert is based on an extrapolation of the space used in a file system. It
+fires if both the current usage is above a certain threshold _and_ the
+extrapolation predicts to run out of space in a certain time. This is a
+warning-level alert if that time is less than 24h. It's a critical alert if that
+time is less than 4h.
+
+## Impact
+
+A filesystem running completely full is obviously very bad for any process in
+need to write to the filesystem. But even before a filesystem runs completely
+full, performance is usually degrading.
+
+## Diagnosis
+
+Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
+pattern of writing and cleaning up can trick the linear prediction into a false
+alert.
+
+Use the usual OS tools to investigate what directories are the worst and/or
+recent offenders.
+
+Is this some irregular condition, e.g. a process fails to clean up behind
+itself, or is this organic growth?
+
+## Mitigation
+
+<Insert site specific measures, for example to grow a persistent volume.>
diff --git a/alerts/cluster-ingress-operator/IngressControllerDegraded.md b/alerts/cluster-ingress-operator/IngressControllerDegraded.md
@@ -0,0 +1,31 @@
+# NodeFilesystemSpaceFillingUp
+
+## Meaning
+
+This alert is based on an extrapolation of the space used in a file system. It
+fires if both the current usage is above a certain threshold _and_ the
+extrapolation predicts to run out of space in a certain time. This is a
+warning-level alert if that time is less than 24h. It's a critical alert if that
+time is less than 4h.
+
+## Impact
+
+A filesystem running completely full is obviously very bad for any process in
+need to write to the filesystem. But even before a filesystem runs completely
+full, performance is usually degrading.
+
+## Diagnosis
+
+Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
+pattern of writing and cleaning up can trick the linear prediction into a false
+alert.
+
+Use the usual OS tools to investigate what directories are the worst and/or
+recent offenders.
+
+Is this some irregular condition, e.g. a process fails to clean up behind
+itself, or is this organic growth?
+
+## Mitigation
+
+<Insert site specific measures, for example to grow a persistent volume.>
diff --git a/alerts/cluster-ingress-operator/IngressControllerUnavailable.md b/alerts/cluster-ingress-operator/IngressControllerUnavailable.md
@@ -0,0 +1,31 @@
+# NodeFilesystemSpaceFillingUp
+
+## Meaning
+
+This alert is based on an extrapolation of the space used in a file system. It
+fires if both the current usage is above a certain threshold _and_ the
+extrapolation predicts to run out of space in a certain time. This is a
+warning-level alert if that time is less than 24h. It's a critical alert if that
+time is less than 4h.
+
+## Impact
+
+A filesystem running completely full is obviously very bad for any process in
+need to write to the filesystem. But even before a filesystem runs completely
+full, performance is usually degrading.
+
+## Diagnosis
+
+Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
+pattern of writing and cleaning up can trick the linear prediction into a false
+alert.
+
+Use the usual OS tools to investigate what directories are the worst and/or
+recent offenders.
+
+Is this some irregular condition, e.g. a process fails to clean up behind
+itself, or is this organic growth?
+
+## Mitigation
+
+<Insert site specific measures, for example to grow a persistent volume.>
diff --git a/alerts/cluster-ingress-operator/IngressWithoutClassName.md b/alerts/cluster-ingress-operator/IngressWithoutClassName.md
@@ -0,0 +1,31 @@
+# NodeFilesystemSpaceFillingUp
+
+## Meaning
+
+This alert is based on an extrapolation of the space used in a file system. It
+fires if both the current usage is above a certain threshold _and_ the
+extrapolation predicts to run out of space in a certain time. This is a
+warning-level alert if that time is less than 24h. It's a critical alert if that
+time is less than 4h.
+
+## Impact
+
+A filesystem running completely full is obviously very bad for any process in
+need to write to the filesystem. But even before a filesystem runs completely
+full, performance is usually degrading.
+
+## Diagnosis
+
+Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
+pattern of writing and cleaning up can trick the linear prediction into a false
+alert.
+
+Use the usual OS tools to investigate what directories are the worst and/or
+recent offenders.
+
+Is this some irregular condition, e.g. a process fails to clean up behind
+itself, or is this organic growth?
+
+## Mitigation
+
+<Insert site specific measures, for example to grow a persistent volume.>
diff --git a/alerts/cluster-ingress-operator/UnmanagedRoutes.md b/alerts/cluster-ingress-operator/UnmanagedRoutes.md
@@ -0,0 +1,31 @@
+# NodeFilesystemSpaceFillingUp
+
+## Meaning
+
+This alert is based on an extrapolation of the space used in a file system. It
+fires if both the current usage is above a certain threshold _and_ the
+extrapolation predicts to run out of space in a certain time. This is a
+warning-level alert if that time is less than 24h. It's a critical alert if that
+time is less than 4h.
+
+## Impact
+
+A filesystem running completely full is obviously very bad for any process in
+need to write to the filesystem. But even before a filesystem runs completely
+full, performance is usually degrading.
+
+## Diagnosis
+
+Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
+pattern of writing and cleaning up can trick the linear prediction into a false
+alert.
+
+Use the usual OS tools to investigate what directories are the worst and/or
+recent offenders.
+
+Is this some irregular condition, e.g. a process fails to clean up behind
+itself, or is this organic growth?
+
+## Mitigation
+
+<Insert site specific measures, for example to grow a persistent volume.>