OCPBUGS-14057: Add runbooks description for prometheus alerts which ingress operator provides #166

miheer · 2024-02-05T23:46:02Z

Add runbooks description for prometheus alerts which ingress operator provides.

Ticket: https://issues.redhat.com/browse/OCPBUGS-14057

openshift-ci · 2024-02-05T23:46:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: miheer
Once this PR has been reviewed and has the lgtm label, please assign nautilux for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-02-06T00:04:29Z

@miheer: This pull request references Jira Issue OCPBUGS-14057, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Add runbooks description for prometheus alerts which ingress operator provides.

Ticket: https://issues.redhat.com/browse/OCPBUGS-14057

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

miheer · 2024-02-06T00:04:38Z

/jira refresh

openshift-ci-robot · 2024-02-06T00:04:41Z

@miheer: This pull request references Jira Issue OCPBUGS-14057, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

simonpasquier · 2024-02-07T09:17:51Z

@miheer thanks for tackling this!
I'd advise to create another PR creating the alerts/cluster-ingress-operator directory + OWNERS file containing the folks that need to review the ingress runbooks. This way, the team will be able to merge as needed.

Miciah · 2024-03-18T15:21:44Z

/jira refresh

openshift-ci-robot · 2024-03-18T15:21:51Z

@Miciah: This pull request references Jira Issue OCPBUGS-14057, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

… provides. Ticket: https://issues.redhat.com/browse/OCPBUGS-14057

openshift-ci · 2024-03-19T22:10:56Z

@miheer: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ShudiLi · 2024-03-27T02:49:16Z

/label qe-approved
thanks

openshift-ci-robot · 2024-03-27T02:49:27Z

@miheer: This pull request references Jira Issue OCPBUGS-14057, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Add runbooks description for prometheus alerts which ingress operator provides.

Ticket: https://issues.redhat.com/browse/OCPBUGS-14057

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

candita · 2024-03-27T15:37:57Z

/assign

candita · 2024-04-19T19:19:53Z

@miheer the bug only requires runbooks for critical alerts, and HAProxy Down is the only critical alert we fire in cluster-ingress-operator. In my opinion, it would be better to remove the other runbooks you added here, and enhance the HAProxyDown runbook with more useful information.

jan--f · 2024-04-23T08:22:50Z

We require runbooks only for critical alerts. A warning alert can certainly have a runbook too if it's helpful to users. If addressing a warning can avoid triggering a critical alert everybody wins.

If there is information missing in the HAProxyDown runbook though, we should indeed add it.

candita · 2024-05-13T20:28:04Z

We require runbooks only for critical alerts. A warning alert can certainly have a runbook too if it's helpful to users. If addressing a warning can avoid triggering a critical alert everybody wins.

If there is information missing in the HAProxyDown runbook though, we should indeed add it.

Is there a way to look at runbook content and determine that it was just a warning alert, rather than a critical alert?

candita · 2024-05-14T22:34:10Z

alerts/cluster-ingress-operator/HAProxyDown.md

+```
+
+- Check the load on the system where the routers are hosted.
+


There are so many other things they could check. How about describing prometheus metrics to look at:

Use the console to issue a prometheus query:

Container Threads: (We should see a fairly consistent value with some fluctuations based on load if healthy)
avg(container_threads{namespace='openshift-ingress', container='router'}) by (instance)

Container Processes: (We should see a fairly consistent value with some fluctuations based on load if healthy)
avg(container_processes{namespace='openshift-ingress', container='router'}) by (instance)

I got this from https://access.redhat.com/solutions/5721381, but I'm not sure we should link to an access article from a runbook.

candita · 2024-05-14T22:37:38Z

alerts/cluster-ingress-operator/HAProxyReloadFail.md

+Try to fix the configuration of the haproxy via ingress controller CR on the
+basis of the output of the logs.


Suggested change

Try to fix the configuration of the haproxy via ingress controller CR on the

basis of the output of the logs.

Try to fix the configuration of the haproxy by editing the ingress controller spec.

candita · 2024-05-14T22:38:29Z

alerts/cluster-ingress-operator/HAProxyReloadFail.md

+Check if any recently added configuration in the haproxy config via ingress
+controller CR caused the issue.


Suggested change

Check if any recently added configuration in the haproxy config via ingress

controller CR caused the issue.

Check if any recently added configuration in the haproxy config via ingress

controller spec caused the issue.

candita · 2024-05-14T22:46:43Z

alerts/cluster-ingress-operator/HAProxyDown.md

+
+## Diagnosis
+
+- Check the router logs:


I suggest adding this to each of the runbooks that tells them to check the logs.

Configure access logging:

oc edit -n openshift-ingress-operator ingresscontrollers/default

set spec.logging.access.destination.type: Container

spec: logging: access: destination: type: Container

To turn it off later, set spec.logging.access: null

candita · 2024-05-14T22:53:31Z

alerts/cluster-ingress-operator/HAProxyReloadFail.md

+
+Check if any recently added configuration in the haproxy config via ingress
+controller CR caused the issue.
+


Tell them how to access the container and check the haproxy.config for issues.

candita · 2024-05-14T22:54:37Z

alerts/cluster-ingress-operator/IngressControllerDegraded.md

+oc logs <router pod> -n openshift-ingress
+```
+- Check the yaml file of the ingress controller and operator to see the reason
+ for failure:


Suggested change

for failure:

for failure. Look for status.conditions:

candita · 2024-05-14T22:55:28Z

alerts/cluster-ingress-operator/IngressControllerDegraded.md

+## Diagnosis
+
+Ingress Controller may be degraded due to one or more reasons.
+


Check the status of all operators, looking for error messages:

oc get co

candita · 2024-05-14T22:57:01Z

alerts/cluster-ingress-operator/IngressWithoutClassName.md

+So, we considered modifying the ingress operator
+to list all Ingresses and Routes in the cluster and publish a metric
+for Routes that were created for Ingresses that OpenShift no longer
+manage.


Please remove this.

candita · 2024-05-14T22:58:01Z

alerts/cluster-ingress-operator/IngressWithoutClassName.md

+to expose this Ingress. Again, it is impossible to determine reliably
+what a user's intent was in such a scenario, but as OpenShift exposed
+such an Ingress before this enhancement, changing this behavior could
+break existing applications.


Don't include design info.

Suggested change

to expose this Ingress. Again, it is impossible to determine reliably

what a user's intent was in such a scenario, but as OpenShift exposed

such an Ingress before this enhancement, changing this behavior could

break existing applications.

to expose this Ingress.

candita · 2024-05-14T23:04:38Z

alerts/cluster-ingress-operator/IngressWithoutClassName.md

+for longer than one day.
+
+## Impact
+


Add:

Warning only.

If this is a valid Ingress resource, it needs to have an ingressClassName to stop the Alert. ingressClassName is the name of an ingressClass cluster resource. Otherwise, delete the misconfigured Ingress.

candita · 2024-05-14T23:06:24Z

alerts/cluster-ingress-operator/HAProxyReloadFail.md

+This alert fires when HAProxy fails to reload its configuration, which will
+result in the router not picking up recently created or modified routes.
+
+## Impact


Add:

Warning only.

candita · 2024-05-14T23:07:06Z

alerts/cluster-ingress-operator/UnmanagedRoutes.md

+
+This alert fires when there is a Route owned by an unmanaged Ingress.
+
+## Impact


Add:

Warning only.

candita · 2024-05-14T23:08:32Z

alerts/cluster-ingress-operator/UnmanagedRoutes.md

+## Diagnosis
+
+Check for alert messages on the UI.
+Inspect the ingress object.


candita · 2024-05-14T23:08:45Z

alerts/cluster-ingress-operator/UnmanagedRoutes.md

+
+Check for alert messages on the UI.
+Inspect the ingress object.
+Inspect the route object.


candita · 2024-05-14T23:09:20Z

alerts/cluster-ingress-operator/UnmanagedRoutes.md

+
+## Diagnosis
+
+Check for alert messages on the UI.


Remove this. The alert is already known.

openshift-bot · 2024-08-13T09:00:23Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2024-09-13T00:30:38Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2024

openshift-ci bot requested review from jan--f and NautiluX February 5, 2024 23:46

miheer changed the title ~~WIP: Add runbooks description for prometheus alerts which ingress operator provides~~ OCPBUGS-14057: WIP: Add runbooks description for prometheus alerts which ingress operator provides Feb 6, 2024

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2024

jan--f mentioned this pull request Feb 6, 2024

OCPBUGS-14057: Removes HAProxyDown critical alert exception. openshift/origin#28575

Open

miheer force-pushed the alert-rules-runbook branch from 5f8b936 to 3c0956d Compare February 26, 2024 23:57

miheer force-pushed the alert-rules-runbook branch from 3c0956d to 1a22d65 Compare March 6, 2024 05:00

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 18, 2024

openshift-ci bot requested a review from ShudiLi March 18, 2024 15:22

miheer changed the title ~~OCPBUGS-14057: WIP: Add runbooks description for prometheus alerts which ingress operator provides~~ OCPBUGS-14057: Add runbooks description for prometheus alerts which ingress operator provides Mar 19, 2024

miheer force-pushed the alert-rules-runbook branch 3 times, most recently from 6e5bb71 to 8c07fdd Compare March 19, 2024 21:57

Add runbooks description for prometheus alerts which ingress operator…

88cbf5d

… provides. Ticket: https://issues.redhat.com/browse/OCPBUGS-14057

miheer force-pushed the alert-rules-runbook branch from 8c07fdd to 88cbf5d Compare March 19, 2024 22:07

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Mar 27, 2024

openshift-ci bot assigned candita Mar 27, 2024

candita reviewed May 14, 2024

View reviewed changes

alerts/cluster-ingress-operator/UnmanagedRoutes.md

## Diagnosis

Check for alert messages on the UI.

Inspect the ingress object.

Copy link

candita May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what?

candita reviewed May 14, 2024

View reviewed changes

alerts/cluster-ingress-operator/UnmanagedRoutes.md

## Diagnosis

Check for alert messages on the UI.

Copy link

candita May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this. The alert is already known.

candita mentioned this pull request May 14, 2024

OCPBUGS-14057: Add runbook urls for prometheus alerts. openshift/cluster-ingress-operator#1024

Open

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2024

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-14057: Add runbooks description for prometheus alerts which ingress operator provides #166

OCPBUGS-14057: Add runbooks description for prometheus alerts which ingress operator provides #166

miheer commented Feb 5, 2024

openshift-ci bot commented Feb 5, 2024

openshift-ci-robot commented Feb 6, 2024

miheer commented Feb 6, 2024

openshift-ci-robot commented Feb 6, 2024

simonpasquier commented Feb 7, 2024

Miciah commented Mar 18, 2024

openshift-ci-robot commented Mar 18, 2024

openshift-ci bot commented Mar 19, 2024

ShudiLi commented Mar 27, 2024

openshift-ci-robot commented Mar 27, 2024

candita commented Mar 27, 2024

candita commented Apr 19, 2024

jan--f commented Apr 23, 2024

candita commented May 13, 2024

candita May 14, 2024 •

edited

Loading

candita May 14, 2024

candita May 14, 2024

candita May 14, 2024 •

edited

Loading

candita May 14, 2024

candita May 14, 2024

candita May 14, 2024

candita May 14, 2024

candita May 14, 2024

candita May 14, 2024 •

edited

Loading

candita May 14, 2024

candita May 14, 2024

candita May 14, 2024

candita May 14, 2024

candita May 14, 2024

openshift-bot commented Aug 13, 2024

openshift-bot commented Sep 13, 2024

		```

		- Check the load on the system where the routers are hosted.

		Try to fix the configuration of the haproxy via ingress controller CR on the
		basis of the output of the logs.

		Check if any recently added configuration in the haproxy config via ingress
		controller CR caused the issue.

		## Diagnosis

		Ingress Controller may be degraded due to one or more reasons.


		This alert fires when there is a Route owned by an unmanaged Ingress.

		## Impact

OCPBUGS-14057: Add runbooks description for prometheus alerts which ingress operator provides #166

Are you sure you want to change the base?

OCPBUGS-14057: Add runbooks description for prometheus alerts which ingress operator provides #166

Conversation

miheer commented Feb 5, 2024

openshift-ci bot commented Feb 5, 2024

openshift-ci-robot commented Feb 6, 2024

miheer commented Feb 6, 2024

openshift-ci-robot commented Feb 6, 2024

simonpasquier commented Feb 7, 2024

Miciah commented Mar 18, 2024

openshift-ci-robot commented Mar 18, 2024

openshift-ci bot commented Mar 19, 2024

ShudiLi commented Mar 27, 2024

openshift-ci-robot commented Mar 27, 2024

candita commented Mar 27, 2024

candita commented Apr 19, 2024

jan--f commented Apr 23, 2024

candita commented May 13, 2024

candita May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

candita May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

candita May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Aug 13, 2024

openshift-bot commented Sep 13, 2024

candita May 14, 2024 •

edited

Loading

candita May 14, 2024 •

edited

Loading

candita May 14, 2024 •

edited

Loading