-
Notifications
You must be signed in to change notification settings - Fork 109
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add runbooks description for prometheus alerts which ingress operator…
… provides. Ticket: https://issues.redhat.com/browse/OCPBUGS-14057
- Loading branch information
Showing
6 changed files
with
192 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# NodeFilesystemSpaceFillingUp | ||
|
||
## Meaning | ||
|
||
This alert is based on an extrapolation of the space used in a file system. It | ||
fires if both the current usage is above a certain threshold _and_ the | ||
extrapolation predicts to run out of space in a certain time. This is a | ||
warning-level alert if that time is less than 24h. It's a critical alert if that | ||
time is less than 4h. | ||
|
||
## Impact | ||
|
||
A filesystem running completely full is obviously very bad for any process in | ||
need to write to the filesystem. But even before a filesystem runs completely | ||
full, performance is usually degrading. | ||
|
||
## Diagnosis | ||
|
||
Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic | ||
pattern of writing and cleaning up can trick the linear prediction into a false | ||
alert. | ||
|
||
Use the usual OS tools to investigate what directories are the worst and/or | ||
recent offenders. | ||
|
||
Is this some irregular condition, e.g. a process fails to clean up behind | ||
itself, or is this organic growth? | ||
|
||
## Mitigation | ||
|
||
<Insert site specific measures, for example to grow a persistent volume.> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# HAProxyReloadFail | ||
|
||
## Meaning | ||
|
||
This alert fires when HAProxy fails to reload its configuration, which will result in the router | ||
not picking up recently created or modified routes. | ||
|
||
## Impact | ||
|
||
The router won't pick up recently created or modified routes. This may cause an outage for critical | ||
applications. | ||
|
||
## Diagnosis | ||
|
||
Check the router logs: | ||
```sh | ||
oc logs <router pod> -n openshift-ingress | ||
``` | ||
|
||
Check if any recently added configuration in the haproxy config via ingress controller CR caused the issue. | ||
|
||
## Mitigation | ||
|
||
Try to fix the configuration of the haproxy via ingress controller CR on the basis of the output of the logs. |
39 changes: 39 additions & 0 deletions
39
alerts/cluster-ingress-operator/IngressControllerDegraded.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# IngressControllerDegraded | ||
|
||
## Meaning | ||
|
||
This alert fires when the IngressController status is degraded. | ||
|
||
## Impact | ||
|
||
The routers won't be running in the cluster. This will cause outage while accessing the applications. | ||
|
||
## Diagnosis | ||
|
||
Ingress Controller may be degraded due to one or more reasons. | ||
|
||
- Check the ingress operator logs using the following command: | ||
```sh | ||
oc logs <ingress operator pod> -n openshift-ingress-operator | ||
``` | ||
- Check the router logs using the following commands: | ||
```sh | ||
oc logs <router pod> -n openshift-ingress | ||
``` | ||
- Check the yaml file of the ingress controller and operator to see the reason for failure: | ||
```sh | ||
oc get ingresscontroller <ingresscontroller name> -n openshift-ingress-operator -o yaml | ||
``` | ||
|
||
```sh | ||
oc get deployment -n openshift-ingress-operator -o yaml | ||
``` | ||
|
||
```sh | ||
oc get events | ||
``` | ||
|
||
## Mitigation | ||
|
||
Try to fix the issue based on what you see in the status of yaml and errors in the logs from the above mentioned | ||
commands. |
39 changes: 39 additions & 0 deletions
39
alerts/cluster-ingress-operator/IngressControllerUnavailable.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# IngressControllerUnavailable | ||
|
||
## Meaning | ||
|
||
This alert fires when the IngressController is not available. | ||
|
||
## Impact | ||
|
||
This will cause outage to the environment as the access to the applications won't be available. | ||
|
||
## Diagnosis | ||
|
||
Ingress Controller may be degraded due to one or more reasons. | ||
|
||
- Check the ingress operator logs using the following command: | ||
```sh | ||
oc logs <ingress operator pod> -n openshift-ingress-operator | ||
``` | ||
- Check the router logs using the following commands: | ||
```sh | ||
oc logs <router pod> -n openshift-ingress | ||
``` | ||
- Check the yaml file of the ingress controller and operator to see the reason for failure: | ||
```sh | ||
oc get ingresscontroller <ingresscontroller name> -n openshift-ingress-operator -o yaml | ||
``` | ||
|
||
```sh | ||
oc get deployment -n openshift-ingress-operator -o yaml | ||
``` | ||
|
||
```sh | ||
oc get events | ||
``` | ||
|
||
## Mitigation | ||
|
||
Try to fix the issue based on what you see in the status of yaml and errors in the logs from the above mentioned | ||
commands. |
31 changes: 31 additions & 0 deletions
31
alerts/cluster-ingress-operator/IngressWithoutClassName.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# IngressWithoutClassName | ||
|
||
## Meaning | ||
|
||
This alert fires when there is an Ingress with an unset IngressClassName for longer than one day. | ||
|
||
## Impact | ||
|
||
It is possible that a user could have created an Ingress with | ||
some nonempty value for spec.ingressClassName that did not match an | ||
OpenShift IngressClass object, and nevertheless intended for OpenShift | ||
to expose this Ingress. Again, it is impossible to determine reliably | ||
what a user's intent was in such a scenario, but as OpenShift exposed | ||
such an Ingress before this enhancement, changing this behavior could | ||
break existing applications. | ||
|
||
So, we considered modifying the ingress operator | ||
to list all Ingresses and Routes in the cluster and publish a metric | ||
for Routes that were created for Ingresses that OpenShift no longer | ||
manage. | ||
|
||
## Diagnosis | ||
|
||
Check for alert messages on the UI. | ||
Inspect the ingress object. | ||
Inspect the route object. Check the status of it. | ||
Check the logs of `cluster-openshift-controller-manager-operator` | ||
|
||
## Mitigation | ||
Figure out why the route which were created by ingress which OpenShift no longer manages. | ||
Delete that ingress and route if it is no longer needed. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# UnmanagedRoutes | ||
|
||
## Meaning | ||
|
||
This alert fires when there is a Route owned by an unmanaged Ingress. | ||
|
||
## Impact | ||
|
||
The ingress-to-route controller does not remove Routes that earlier versions of OpenShift created for | ||
Ingresses that specify `spec.ingressClassName`. Thus, these Routes will continue to | ||
be in effect. OpenShift does not update such Routes and does not recreate them if the user deletes them. | ||
|
||
In case any Routes existed in this state the alert would help the administrator | ||
know that the Routes needed to be deleted, or the Ingress modified to | ||
specify an appropriate IngressClass so that OpenShift would once again | ||
reconcile the Routes. | ||
|
||
## Diagnosis | ||
|
||
Check for alert messages on the UI. | ||
Inspect the ingress object. | ||
Inspect the route object. | ||
Check the logs of `cluster-openshift-controller-manager-operator` | ||
|
||
## Mitigation | ||
|
||
This alert will help the administrator to specify an appropriate IngressClass in the Ingress object | ||
so that OpenShift would once again reconcile the Routes. |