-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-14057: Add runbooks description for prometheus alerts which ingress operator provides #166
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# HAProxyDown | ||
|
||
## Meaning | ||
|
||
This alert fires when metrics report that HAProxy is down. | ||
|
||
## Impact | ||
|
||
Access to routes will fail. It may cause a severe outage for critical applications. | ||
|
||
## Diagnosis | ||
|
||
- Check the router logs: | ||
```sh | ||
oc logs <router pod> -n openshift-ingress | ||
``` | ||
|
||
- Check the events: | ||
```sh | ||
oc get events -n openshift-ingress | ||
``` | ||
|
||
- Check the load on the system where the routers are hosted. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are so many other things they could check. How about describing prometheus metrics to look at:
Container Threads: (We should see a fairly consistent value with some fluctuations based on load if healthy) Container Processes: (We should see a fairly consistent value with some fluctuations based on load if healthy) I got this from https://access.redhat.com/solutions/5721381, but I'm not sure we should link to an access article from a runbook. |
||
## Mitigation | ||
|
||
Based on the diagnosis, try to figure out the issue. | ||
If the issue is configuration related then try to fix the haproxy config. | ||
If the issue is load related try to fix the issues at infrastructure level. |
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,26 @@ | ||||||||||
# HAProxyReloadFail | ||||||||||
|
||||||||||
## Meaning | ||||||||||
|
||||||||||
This alert fires when HAProxy fails to reload its configuration, which will | ||||||||||
result in the router not picking up recently created or modified routes. | ||||||||||
|
||||||||||
## Impact | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add: Warning only. |
||||||||||
|
||||||||||
The router won't pick up recently created or modified routes. This may cause | ||||||||||
an outage for critical applications. | ||||||||||
|
||||||||||
## Diagnosis | ||||||||||
|
||||||||||
Check the router logs: | ||||||||||
```sh | ||||||||||
oc logs <router pod> -n openshift-ingress | ||||||||||
``` | ||||||||||
|
||||||||||
Check if any recently added configuration in the haproxy config via ingress | ||||||||||
controller CR caused the issue. | ||||||||||
Comment on lines
+20
to
+21
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tell them how to access the container and check the haproxy.config for issues. |
||||||||||
## Mitigation | ||||||||||
|
||||||||||
Try to fix the configuration of the haproxy via ingress controller CR on the | ||||||||||
basis of the output of the logs. | ||||||||||
Comment on lines
+25
to
+26
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,42 @@ | ||||||
# IngressControllerDegraded | ||||||
|
||||||
## Meaning | ||||||
|
||||||
This alert fires when the IngressController status is degraded. | ||||||
|
||||||
## Impact | ||||||
|
||||||
The routers won't be running in the cluster. This will cause an outage while | ||||||
accessing the applications. | ||||||
|
||||||
## Diagnosis | ||||||
|
||||||
Ingress Controller may be degraded due to one or more reasons. | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check the status of all operators, looking for error messages: oc get co |
||||||
- Check the ingress operator logs using the following command: | ||||||
```sh | ||||||
oc logs <ingress operator pod> -n openshift-ingress-operator | ||||||
``` | ||||||
- Check the router logs using the following commands: | ||||||
```sh | ||||||
oc logs <router pod> -n openshift-ingress | ||||||
``` | ||||||
- Check the yaml file of the ingress controller and operator to see the reason | ||||||
for failure: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
```sh | ||||||
oc get ingresscontroller <ingresscontroller name> -n openshift-ingress-operator -o yaml | ||||||
``` | ||||||
|
||||||
```sh | ||||||
oc get deployment -n openshift-ingress-operator -o yaml | ||||||
``` | ||||||
|
||||||
```sh | ||||||
oc get events | ||||||
``` | ||||||
|
||||||
## Mitigation | ||||||
|
||||||
Try to fix the issue based on what you see in the status of yaml and errors | ||||||
in the logs from the above mentioned | ||||||
commands. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# IngressControllerUnavailable | ||
|
||
## Meaning | ||
|
||
This alert fires when the IngressController is not available. | ||
|
||
## Impact | ||
|
||
This will cause an outage to the environment as the access to the | ||
applications won't be available. | ||
|
||
## Diagnosis | ||
|
||
Ingress Controller may be degraded due to one or more reasons. | ||
|
||
- Check the ingress operator logs using the following command: | ||
```sh | ||
oc logs <ingress operator pod> -n openshift-ingress-operator | ||
``` | ||
- Check the router logs using the following commands: | ||
```sh | ||
oc logs <router pod> -n openshift-ingress | ||
``` | ||
- Check the yaml file of the ingress controller and operator to see | ||
the reason for failure: | ||
```sh | ||
oc get ingresscontroller <ingresscontroller name> -n openshift-ingress-operator -o yaml | ||
``` | ||
|
||
```sh | ||
oc get deployment -n openshift-ingress-operator -o yaml | ||
``` | ||
|
||
```sh | ||
oc get events | ||
``` | ||
|
||
## Mitigation | ||
|
||
Try to fix the issue based on what you see in the status of yaml | ||
and errors in the logs from the above mentioned commands. |
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,33 @@ | ||||||||||||
# IngressWithoutClassName | ||||||||||||
|
||||||||||||
## Meaning | ||||||||||||
|
||||||||||||
This alert fires when there is an Ingress with an unset IngressClassName | ||||||||||||
for longer than one day. | ||||||||||||
|
||||||||||||
## Impact | ||||||||||||
|
||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add: Warning only. If this is a valid Ingress resource, it needs to have an |
||||||||||||
It is possible that a user could have created an Ingress with | ||||||||||||
some nonempty value for spec.ingressClassName that did not match an | ||||||||||||
OpenShift IngressClass object, and nevertheless intended for OpenShift | ||||||||||||
to expose this Ingress. Again, it is impossible to determine reliably | ||||||||||||
what a user's intent was in such a scenario, but as OpenShift exposed | ||||||||||||
such an Ingress before this enhancement, changing this behavior could | ||||||||||||
break existing applications. | ||||||||||||
Comment on lines
+13
to
+16
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't include design info.
Suggested change
|
||||||||||||
|
||||||||||||
So, we considered modifying the ingress operator | ||||||||||||
to list all Ingresses and Routes in the cluster and publish a metric | ||||||||||||
for Routes that were created for Ingresses that OpenShift no longer | ||||||||||||
manage. | ||||||||||||
Comment on lines
+18
to
+21
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please remove this. |
||||||||||||
|
||||||||||||
## Diagnosis | ||||||||||||
|
||||||||||||
Check for alert messages on the UI. | ||||||||||||
Inspect the ingress object. | ||||||||||||
Inspect the route object. Check the status of it. | ||||||||||||
Check the logs of `cluster-openshift-controller-manager-operator` | ||||||||||||
|
||||||||||||
## Mitigation | ||||||||||||
Figure out why the route which was created by ingress which OpenShift | ||||||||||||
no longer manages. | ||||||||||||
Delete that ingress and route if it is no longer needed. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# UnmanagedRoutes | ||
|
||
## Meaning | ||
|
||
This alert fires when there is a Route owned by an unmanaged Ingress. | ||
|
||
## Impact | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add: Warning only. |
||
|
||
The ingress-to-route controller does not remove Routes that earlier versions | ||
of OpenShift created for Ingresses that specify `spec.ingressClassName`. | ||
Thus, these Routes will continue to be in effect. | ||
OpenShift does not update such Routes and does not recreate | ||
them if the user deletes them. | ||
|
||
In case any Routes existed in this state the alert would | ||
help the administrator know that the Routes needed to be deleted, | ||
or the Ingress modified to specify an appropriate IngressClass so | ||
that OpenShift would once again reconcile the Routes. | ||
|
||
## Diagnosis | ||
|
||
Check for alert messages on the UI. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove this. The alert is already known. |
||
Inspect the ingress object. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For what? |
||
Inspect the route object. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For what? |
||
Check the logs of `cluster-openshift-controller-manager-operator` | ||
|
||
## Mitigation | ||
|
||
This alert will help the administrator to specify an appropriate | ||
IngressClass in the Ingress object so that OpenShift would once | ||
again reconcile the Routes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding this to each of the runbooks that tells them to check the logs.
set
spec.logging.access.destination.type: Container
To turn it off later, set
spec.logging.access: null