-
Notifications
You must be signed in to change notification settings - Fork 109
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add runbooks description for prometheus alerts which ingress operator…
… provides. Ticket: https://issues.redhat.com/browse/OCPBUGS-14057
- Loading branch information
Showing
6 changed files
with
186 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# NodeFilesystemSpaceFillingUp | ||
|
||
## Meaning | ||
|
||
This alert is based on an extrapolation of the space used in a file system. It | ||
fires if both the current usage is above a certain threshold _and_ the | ||
extrapolation predicts to run out of space in a certain time. This is a | ||
warning-level alert if that time is less than 24h. It's a critical alert if that | ||
time is less than 4h. | ||
|
||
## Impact | ||
|
||
A filesystem running completely full is obviously very bad for any process in | ||
need to write to the filesystem. But even before a filesystem runs completely | ||
full, performance is usually degrading. | ||
|
||
## Diagnosis | ||
|
||
Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic | ||
pattern of writing and cleaning up can trick the linear prediction into a false | ||
alert. | ||
|
||
Use the usual OS tools to investigate what directories are the worst and/or | ||
recent offenders. | ||
|
||
Is this some irregular condition, e.g. a process fails to clean up behind | ||
itself, or is this organic growth? | ||
|
||
## Mitigation | ||
|
||
<Insert site specific measures, for example to grow a persistent volume.> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# NodeFilesystemSpaceFillingUp | ||
|
||
## Meaning | ||
|
||
This alert is based on an extrapolation of the space used in a file system. It | ||
fires if both the current usage is above a certain threshold _and_ the | ||
extrapolation predicts to run out of space in a certain time. This is a | ||
warning-level alert if that time is less than 24h. It's a critical alert if that | ||
time is less than 4h. | ||
|
||
## Impact | ||
|
||
A filesystem running completely full is obviously very bad for any process in | ||
need to write to the filesystem. But even before a filesystem runs completely | ||
full, performance is usually degrading. | ||
|
||
## Diagnosis | ||
|
||
Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic | ||
pattern of writing and cleaning up can trick the linear prediction into a false | ||
alert. | ||
|
||
Use the usual OS tools to investigate what directories are the worst and/or | ||
recent offenders. | ||
|
||
Is this some irregular condition, e.g. a process fails to clean up behind | ||
itself, or is this organic growth? | ||
|
||
## Mitigation | ||
|
||
<Insert site specific measures, for example to grow a persistent volume.> |
31 changes: 31 additions & 0 deletions
31
alerts/cluster-ingress-operator/IngressControllerDegraded.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# NodeFilesystemSpaceFillingUp | ||
|
||
## Meaning | ||
|
||
This alert is based on an extrapolation of the space used in a file system. It | ||
fires if both the current usage is above a certain threshold _and_ the | ||
extrapolation predicts to run out of space in a certain time. This is a | ||
warning-level alert if that time is less than 24h. It's a critical alert if that | ||
time is less than 4h. | ||
|
||
## Impact | ||
|
||
A filesystem running completely full is obviously very bad for any process in | ||
need to write to the filesystem. But even before a filesystem runs completely | ||
full, performance is usually degrading. | ||
|
||
## Diagnosis | ||
|
||
Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic | ||
pattern of writing and cleaning up can trick the linear prediction into a false | ||
alert. | ||
|
||
Use the usual OS tools to investigate what directories are the worst and/or | ||
recent offenders. | ||
|
||
Is this some irregular condition, e.g. a process fails to clean up behind | ||
itself, or is this organic growth? | ||
|
||
## Mitigation | ||
|
||
<Insert site specific measures, for example to grow a persistent volume.> |
31 changes: 31 additions & 0 deletions
31
alerts/cluster-ingress-operator/IngressControllerUnavailable.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# NodeFilesystemSpaceFillingUp | ||
|
||
## Meaning | ||
|
||
This alert is based on an extrapolation of the space used in a file system. It | ||
fires if both the current usage is above a certain threshold _and_ the | ||
extrapolation predicts to run out of space in a certain time. This is a | ||
warning-level alert if that time is less than 24h. It's a critical alert if that | ||
time is less than 4h. | ||
|
||
## Impact | ||
|
||
A filesystem running completely full is obviously very bad for any process in | ||
need to write to the filesystem. But even before a filesystem runs completely | ||
full, performance is usually degrading. | ||
|
||
## Diagnosis | ||
|
||
Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic | ||
pattern of writing and cleaning up can trick the linear prediction into a false | ||
alert. | ||
|
||
Use the usual OS tools to investigate what directories are the worst and/or | ||
recent offenders. | ||
|
||
Is this some irregular condition, e.g. a process fails to clean up behind | ||
itself, or is this organic growth? | ||
|
||
## Mitigation | ||
|
||
<Insert site specific measures, for example to grow a persistent volume.> |
31 changes: 31 additions & 0 deletions
31
alerts/cluster-ingress-operator/IngressWithoutClassName.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# NodeFilesystemSpaceFillingUp | ||
|
||
## Meaning | ||
|
||
This alert is based on an extrapolation of the space used in a file system. It | ||
fires if both the current usage is above a certain threshold _and_ the | ||
extrapolation predicts to run out of space in a certain time. This is a | ||
warning-level alert if that time is less than 24h. It's a critical alert if that | ||
time is less than 4h. | ||
|
||
## Impact | ||
|
||
A filesystem running completely full is obviously very bad for any process in | ||
need to write to the filesystem. But even before a filesystem runs completely | ||
full, performance is usually degrading. | ||
|
||
## Diagnosis | ||
|
||
Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic | ||
pattern of writing and cleaning up can trick the linear prediction into a false | ||
alert. | ||
|
||
Use the usual OS tools to investigate what directories are the worst and/or | ||
recent offenders. | ||
|
||
Is this some irregular condition, e.g. a process fails to clean up behind | ||
itself, or is this organic growth? | ||
|
||
## Mitigation | ||
|
||
<Insert site specific measures, for example to grow a persistent volume.> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# NodeFilesystemSpaceFillingUp | ||
|
||
## Meaning | ||
|
||
This alert is based on an extrapolation of the space used in a file system. It | ||
fires if both the current usage is above a certain threshold _and_ the | ||
extrapolation predicts to run out of space in a certain time. This is a | ||
warning-level alert if that time is less than 24h. It's a critical alert if that | ||
time is less than 4h. | ||
|
||
## Impact | ||
|
||
A filesystem running completely full is obviously very bad for any process in | ||
need to write to the filesystem. But even before a filesystem runs completely | ||
full, performance is usually degrading. | ||
|
||
## Diagnosis | ||
|
||
Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic | ||
pattern of writing and cleaning up can trick the linear prediction into a false | ||
alert. | ||
|
||
Use the usual OS tools to investigate what directories are the worst and/or | ||
recent offenders. | ||
|
||
Is this some irregular condition, e.g. a process fails to clean up behind | ||
itself, or is this organic growth? | ||
|
||
## Mitigation | ||
|
||
<Insert site specific measures, for example to grow a persistent volume.> |