Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-14057: Add runbooks description for prometheus alerts which ingress operator provides #166

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

miheer
Copy link

@miheer miheer commented Feb 5, 2024

Add runbooks description for prometheus alerts which ingress operator provides.

Ticket: https://issues.redhat.com/browse/OCPBUGS-14057

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2024
Copy link
Contributor

openshift-ci bot commented Feb 5, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: miheer
Once this PR has been reviewed and has the lgtm label, please assign nautilux for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@miheer miheer changed the title WIP: Add runbooks description for prometheus alerts which ingress operator provides OCPBUGS-14057: WIP: Add runbooks description for prometheus alerts which ingress operator provides Feb 6, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 6, 2024
@openshift-ci-robot
Copy link

@miheer: This pull request references Jira Issue OCPBUGS-14057, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Add runbooks description for prometheus alerts which ingress operator provides.

Ticket: https://issues.redhat.com/browse/OCPBUGS-14057

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@miheer
Copy link
Author

miheer commented Feb 6, 2024

/jira refresh

@openshift-ci-robot
Copy link

@miheer: This pull request references Jira Issue OCPBUGS-14057, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@simonpasquier
Copy link
Contributor

@miheer thanks for tackling this!
I'd advise to create another PR creating the alerts/cluster-ingress-operator directory + OWNERS file containing the folks that need to review the ingress runbooks. This way, the team will be able to merge as needed.

@Miciah
Copy link

Miciah commented Mar 18, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 18, 2024
@openshift-ci-robot
Copy link

@Miciah: This pull request references Jira Issue OCPBUGS-14057, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from ShudiLi March 18, 2024 15:22
@miheer miheer changed the title OCPBUGS-14057: WIP: Add runbooks description for prometheus alerts which ingress operator provides OCPBUGS-14057: Add runbooks description for prometheus alerts which ingress operator provides Mar 19, 2024
@miheer miheer force-pushed the alert-rules-runbook branch 3 times, most recently from 6e5bb71 to 8c07fdd Compare March 19, 2024 21:57
Copy link
Contributor

openshift-ci bot commented Mar 19, 2024

@miheer: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ShudiLi
Copy link
Member

ShudiLi commented Mar 27, 2024

/label qe-approved
thanks

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Mar 27, 2024
@openshift-ci-robot
Copy link

@miheer: This pull request references Jira Issue OCPBUGS-14057, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Add runbooks description for prometheus alerts which ingress operator provides.

Ticket: https://issues.redhat.com/browse/OCPBUGS-14057

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@candita
Copy link

candita commented Mar 27, 2024

/assign

@candita
Copy link

candita commented Apr 19, 2024

@miheer the bug only requires runbooks for critical alerts, and HAProxy Down is the only critical alert we fire in cluster-ingress-operator. In my opinion, it would be better to remove the other runbooks you added here, and enhance the HAProxyDown runbook with more useful information.

@jan--f
Copy link
Contributor

jan--f commented Apr 23, 2024

We require runbooks only for critical alerts. A warning alert can certainly have a runbook too if it's helpful to users. If addressing a warning can avoid triggering a critical alert everybody wins.

If there is information missing in the HAProxyDown runbook though, we should indeed add it.

@candita
Copy link

candita commented May 13, 2024

We require runbooks only for critical alerts. A warning alert can certainly have a runbook too if it's helpful to users. If addressing a warning can avoid triggering a critical alert everybody wins.

If there is information missing in the HAProxyDown runbook though, we should indeed add it.

Is there a way to look at runbook content and determine that it was just a warning alert, rather than a critical alert?

```

- Check the load on the system where the routers are hosted.

Copy link

@candita candita May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are so many other things they could check. How about describing prometheus metrics to look at:

  • Use the console to issue a prometheus query:

Container Threads: (We should see a fairly consistent value with some fluctuations based on load if healthy)
avg(container_threads{namespace='openshift-ingress', container='router'}) by (instance)

Container Processes: (We should see a fairly consistent value with some fluctuations based on load if healthy)
avg(container_processes{namespace='openshift-ingress', container='router'}) by (instance)

I got this from https://access.redhat.com/solutions/5721381, but I'm not sure we should link to an access article from a runbook.

Comment on lines +25 to +26
Try to fix the configuration of the haproxy via ingress controller CR on the
basis of the output of the logs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Try to fix the configuration of the haproxy via ingress controller CR on the
basis of the output of the logs.
Try to fix the configuration of the haproxy by editing the ingress controller spec.

Comment on lines +20 to +21
Check if any recently added configuration in the haproxy config via ingress
controller CR caused the issue.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Check if any recently added configuration in the haproxy config via ingress
controller CR caused the issue.
Check if any recently added configuration in the haproxy config via ingress
controller spec caused the issue.


## Diagnosis

- Check the router logs:
Copy link

@candita candita May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding this to each of the runbooks that tells them to check the logs.

  • Configure access logging:
oc edit -n openshift-ingress-operator ingresscontrollers/default

set spec.logging.access.destination.type: Container

spec:
  logging:
    access:
      destination:
        type: Container

To turn it off later, set spec.logging.access: null


Check if any recently added configuration in the haproxy config via ingress
controller CR caused the issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell them how to access the container and check the haproxy.config for issues.

oc logs <router pod> -n openshift-ingress
```
- Check the yaml file of the ingress controller and operator to see the reason
for failure:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for failure:
for failure. Look for status.conditions:

## Diagnosis

Ingress Controller may be degraded due to one or more reasons.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the status of all operators, looking for error messages:

oc get co 

Comment on lines +18 to +21
So, we considered modifying the ingress operator
to list all Ingresses and Routes in the cluster and publish a metric
for Routes that were created for Ingresses that OpenShift no longer
manage.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this.

Comment on lines +13 to +16
to expose this Ingress. Again, it is impossible to determine reliably
what a user's intent was in such a scenario, but as OpenShift exposed
such an Ingress before this enhancement, changing this behavior could
break existing applications.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't include design info.

Suggested change
to expose this Ingress. Again, it is impossible to determine reliably
what a user's intent was in such a scenario, but as OpenShift exposed
such an Ingress before this enhancement, changing this behavior could
break existing applications.
to expose this Ingress.

for longer than one day.

## Impact

Copy link

@candita candita May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add:

Warning only.

If this is a valid Ingress resource, it needs to have an ingressClassName to stop the Alert. ingressClassName is the name of an ingressClass cluster resource. Otherwise, delete the misconfigured Ingress.

This alert fires when HAProxy fails to reload its configuration, which will
result in the router not picking up recently created or modified routes.

## Impact
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add:

Warning only.


This alert fires when there is a Route owned by an unmanaged Ingress.

## Impact
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add:

Warning only.

## Diagnosis

Check for alert messages on the UI.
Inspect the ingress object.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what?


Check for alert messages on the UI.
Inspect the ingress object.
Inspect the route object.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what?


## Diagnosis

Check for alert messages on the UI.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this. The alert is already known.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2024
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants