-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-14057: Add runbooks description for prometheus alerts which ingress operator provides #166
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: miheer The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@miheer: This pull request references Jira Issue OCPBUGS-14057, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@miheer: This pull request references Jira Issue OCPBUGS-14057, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@miheer thanks for tackling this! |
5f8b936
to
3c0956d
Compare
3c0956d
to
1a22d65
Compare
/jira refresh |
@Miciah: This pull request references Jira Issue OCPBUGS-14057, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
6e5bb71
to
8c07fdd
Compare
8c07fdd
to
88cbf5d
Compare
@miheer: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/label qe-approved |
@miheer: This pull request references Jira Issue OCPBUGS-14057, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/assign |
@miheer the bug only requires runbooks for |
We require runbooks only for critical alerts. A warning alert can certainly have a runbook too if it's helpful to users. If addressing a warning can avoid triggering a critical alert everybody wins. If there is information missing in the HAProxyDown runbook though, we should indeed add it. |
Is there a way to look at runbook content and determine that it was just a warning alert, rather than a critical alert? |
``` | ||
|
||
- Check the load on the system where the routers are hosted. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are so many other things they could check. How about describing prometheus metrics to look at:
- Use the console to issue a prometheus query:
Container Threads: (We should see a fairly consistent value with some fluctuations based on load if healthy)
avg(container_threads{namespace='openshift-ingress', container='router'}) by (instance)
Container Processes: (We should see a fairly consistent value with some fluctuations based on load if healthy)
avg(container_processes{namespace='openshift-ingress', container='router'}) by (instance)
I got this from https://access.redhat.com/solutions/5721381, but I'm not sure we should link to an access article from a runbook.
Try to fix the configuration of the haproxy via ingress controller CR on the | ||
basis of the output of the logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to fix the configuration of the haproxy via ingress controller CR on the | |
basis of the output of the logs. | |
Try to fix the configuration of the haproxy by editing the ingress controller spec. |
Check if any recently added configuration in the haproxy config via ingress | ||
controller CR caused the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if any recently added configuration in the haproxy config via ingress | |
controller CR caused the issue. | |
Check if any recently added configuration in the haproxy config via ingress | |
controller spec caused the issue. |
|
||
## Diagnosis | ||
|
||
- Check the router logs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding this to each of the runbooks that tells them to check the logs.
- Configure access logging:
oc edit -n openshift-ingress-operator ingresscontrollers/default
set spec.logging.access.destination.type: Container
spec:
logging:
access:
destination:
type: Container
To turn it off later, set spec.logging.access: null
|
||
Check if any recently added configuration in the haproxy config via ingress | ||
controller CR caused the issue. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tell them how to access the container and check the haproxy.config for issues.
oc logs <router pod> -n openshift-ingress | ||
``` | ||
- Check the yaml file of the ingress controller and operator to see the reason | ||
for failure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for failure: | |
for failure. Look for status.conditions: |
## Diagnosis | ||
|
||
Ingress Controller may be degraded due to one or more reasons. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the status of all operators, looking for error messages:
oc get co
So, we considered modifying the ingress operator | ||
to list all Ingresses and Routes in the cluster and publish a metric | ||
for Routes that were created for Ingresses that OpenShift no longer | ||
manage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this.
to expose this Ingress. Again, it is impossible to determine reliably | ||
what a user's intent was in such a scenario, but as OpenShift exposed | ||
such an Ingress before this enhancement, changing this behavior could | ||
break existing applications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't include design info.
to expose this Ingress. Again, it is impossible to determine reliably | |
what a user's intent was in such a scenario, but as OpenShift exposed | |
such an Ingress before this enhancement, changing this behavior could | |
break existing applications. | |
to expose this Ingress. |
for longer than one day. | ||
|
||
## Impact | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add:
Warning only.
If this is a valid Ingress resource, it needs to have an ingressClassName
to stop the Alert. ingressClassName
is the name of an ingressClass
cluster resource. Otherwise, delete the misconfigured Ingress.
This alert fires when HAProxy fails to reload its configuration, which will | ||
result in the router not picking up recently created or modified routes. | ||
|
||
## Impact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add:
Warning only.
|
||
This alert fires when there is a Route owned by an unmanaged Ingress. | ||
|
||
## Impact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add:
Warning only.
## Diagnosis | ||
|
||
Check for alert messages on the UI. | ||
Inspect the ingress object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what?
|
||
Check for alert messages on the UI. | ||
Inspect the ingress object. | ||
Inspect the route object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what?
|
||
## Diagnosis | ||
|
||
Check for alert messages on the UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this. The alert is already known.
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Add runbooks description for prometheus alerts which ingress operator provides.
Ticket: https://issues.redhat.com/browse/OCPBUGS-14057