Skip to content

Commit

Permalink
Add runbooks description for prometheus alerts which ingress operator…
Browse files Browse the repository at this point in the history
  • Loading branch information
miheer committed Feb 5, 2024
1 parent 5dffee3 commit 5f8b936
Show file tree
Hide file tree
Showing 6 changed files with 186 additions and 0 deletions.
31 changes: 31 additions & 0 deletions alerts/cluster-ingress-operator/HAProxyDown.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# NodeFilesystemSpaceFillingUp

## Meaning

This alert is based on an extrapolation of the space used in a file system. It
fires if both the current usage is above a certain threshold _and_ the
extrapolation predicts to run out of space in a certain time. This is a
warning-level alert if that time is less than 24h. It's a critical alert if that
time is less than 4h.

## Impact

A filesystem running completely full is obviously very bad for any process in
need to write to the filesystem. But even before a filesystem runs completely
full, performance is usually degrading.

## Diagnosis

Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
pattern of writing and cleaning up can trick the linear prediction into a false
alert.

Use the usual OS tools to investigate what directories are the worst and/or
recent offenders.

Is this some irregular condition, e.g. a process fails to clean up behind
itself, or is this organic growth?

## Mitigation

<Insert site specific measures, for example to grow a persistent volume.>
31 changes: 31 additions & 0 deletions alerts/cluster-ingress-operator/HAProxyReloadFail.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# NodeFilesystemSpaceFillingUp

## Meaning

This alert is based on an extrapolation of the space used in a file system. It
fires if both the current usage is above a certain threshold _and_ the
extrapolation predicts to run out of space in a certain time. This is a
warning-level alert if that time is less than 24h. It's a critical alert if that
time is less than 4h.

## Impact

A filesystem running completely full is obviously very bad for any process in
need to write to the filesystem. But even before a filesystem runs completely
full, performance is usually degrading.

## Diagnosis

Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
pattern of writing and cleaning up can trick the linear prediction into a false
alert.

Use the usual OS tools to investigate what directories are the worst and/or
recent offenders.

Is this some irregular condition, e.g. a process fails to clean up behind
itself, or is this organic growth?

## Mitigation

<Insert site specific measures, for example to grow a persistent volume.>
31 changes: 31 additions & 0 deletions alerts/cluster-ingress-operator/IngressControllerDegraded.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# NodeFilesystemSpaceFillingUp

## Meaning

This alert is based on an extrapolation of the space used in a file system. It
fires if both the current usage is above a certain threshold _and_ the
extrapolation predicts to run out of space in a certain time. This is a
warning-level alert if that time is less than 24h. It's a critical alert if that
time is less than 4h.

## Impact

A filesystem running completely full is obviously very bad for any process in
need to write to the filesystem. But even before a filesystem runs completely
full, performance is usually degrading.

## Diagnosis

Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
pattern of writing and cleaning up can trick the linear prediction into a false
alert.

Use the usual OS tools to investigate what directories are the worst and/or
recent offenders.

Is this some irregular condition, e.g. a process fails to clean up behind
itself, or is this organic growth?

## Mitigation

<Insert site specific measures, for example to grow a persistent volume.>
31 changes: 31 additions & 0 deletions alerts/cluster-ingress-operator/IngressControllerUnavailable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# NodeFilesystemSpaceFillingUp

## Meaning

This alert is based on an extrapolation of the space used in a file system. It
fires if both the current usage is above a certain threshold _and_ the
extrapolation predicts to run out of space in a certain time. This is a
warning-level alert if that time is less than 24h. It's a critical alert if that
time is less than 4h.

## Impact

A filesystem running completely full is obviously very bad for any process in
need to write to the filesystem. But even before a filesystem runs completely
full, performance is usually degrading.

## Diagnosis

Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
pattern of writing and cleaning up can trick the linear prediction into a false
alert.

Use the usual OS tools to investigate what directories are the worst and/or
recent offenders.

Is this some irregular condition, e.g. a process fails to clean up behind
itself, or is this organic growth?

## Mitigation

<Insert site specific measures, for example to grow a persistent volume.>
31 changes: 31 additions & 0 deletions alerts/cluster-ingress-operator/IngressWithoutClassName.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# NodeFilesystemSpaceFillingUp

## Meaning

This alert is based on an extrapolation of the space used in a file system. It
fires if both the current usage is above a certain threshold _and_ the
extrapolation predicts to run out of space in a certain time. This is a
warning-level alert if that time is less than 24h. It's a critical alert if that
time is less than 4h.

## Impact

A filesystem running completely full is obviously very bad for any process in
need to write to the filesystem. But even before a filesystem runs completely
full, performance is usually degrading.

## Diagnosis

Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
pattern of writing and cleaning up can trick the linear prediction into a false
alert.

Use the usual OS tools to investigate what directories are the worst and/or
recent offenders.

Is this some irregular condition, e.g. a process fails to clean up behind
itself, or is this organic growth?

## Mitigation

<Insert site specific measures, for example to grow a persistent volume.>
31 changes: 31 additions & 0 deletions alerts/cluster-ingress-operator/UnmanagedRoutes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# NodeFilesystemSpaceFillingUp

## Meaning

This alert is based on an extrapolation of the space used in a file system. It
fires if both the current usage is above a certain threshold _and_ the
extrapolation predicts to run out of space in a certain time. This is a
warning-level alert if that time is less than 24h. It's a critical alert if that
time is less than 4h.

## Impact

A filesystem running completely full is obviously very bad for any process in
need to write to the filesystem. But even before a filesystem runs completely
full, performance is usually degrading.

## Diagnosis

Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic
pattern of writing and cleaning up can trick the linear prediction into a false
alert.

Use the usual OS tools to investigate what directories are the worst and/or
recent offenders.

Is this some irregular condition, e.g. a process fails to clean up behind
itself, or is this organic growth?

## Mitigation

<Insert site specific measures, for example to grow a persistent volume.>

0 comments on commit 5f8b936

Please sign in to comment.