Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Alerts Integration #14238

Closed
moolitayer opened this issue Mar 8, 2017 · 17 comments
Closed

Prometheus Alerts Integration #14238

moolitayer opened this issue Mar 8, 2017 · 17 comments

Comments

@moolitayer
Copy link

moolitayer commented Mar 8, 2017

The propose of this issue is to document the status & usage of ManageIQ integration with Prometheus alerts for Kubernetes/OpenShift.

Table of Contents

Description
Status
Setup
Debugging
Alerts

Description

Prometheus is used as an external alerting component and ManageIQ collects alerts from it and attaches them to inventory objects.

It is then possible to view on going alerts in ManageIQ's Monitoring screen, manage their life cycle (view data & related objects, assign, acknowledge, comment).

  • Alert definition are configured in a Prometheus instance running inside a container cluster.
  • ManageIQ pickups the alerts from Prometheus.
  • Operators can view active alerts per provider in the alerts Dashboard screen
  • After alerts are resolved in the cluster they go away from the screen
  • Alerts have metadata: severity, Url ("View SOP"), description, miqTarget
  • User can assign & acknowledge alerts in the alerts list screen.

Status

Tech Preview for the Gaprindashvili release

Setup

  1. Adding Prometheus to an OpenShift cluster & configure alert definitions:
    The only supported way is to install OpenShift is using openshift-ansible, first implemented in:
    Create ansible role for deploying prometheus on openshift openshift/openshift-ansible#4509
    (add inventory flag: openshift_hosted_prometheus_deploy=true)
    configure alerts in prometheus:
$ oc edit configmap -n openshift-metrics prometheus
# Supported annotations:
#   miqTarget: ContainerNode|ExtManagementSystem, defaults to ContainerNode.
#   miqIgnore: "true|false", should ManageIQ pick up this alert, defaults to true.
#   description: A string the screen will show
# Labels:
#   severity: ERROR|WARNING|INFO. defaults to ERROR.

  prometheus.rules: |
    groups:
    - name: example-rules
      interval: 30s # defaults to global interval
      rules:
      # 
      # ------------- Copy below this line -------------
      #
      - alert: "NodeDown"
        expr: up{job="kubernetes-nodes"} == 0
        annotations:
          miqTarget: "ContainerNode"
          severity: "ERROR"
          url: "https://www.example.com/node_down_fixing_instructions"
          description: "Node {{$labels.instance}} is down"
      - alert: "TooManyRequests"
        expr: rate(authenticated_user_requests[2m]) > 12
        annotations:
          miqTarget: "ExtManagementSystem"
          severity: "ERROR"
          url: "https://www.example.com/too_many_requests_fixing_instructions"
          description: "Too many authenticated requests"

See some common alerts
Note to reload the configuration please delete the pod OR send a HUP signal to the Prometheus process

  1. Go to: control -> explorer, create one container node alert based on "all datawarehouse alerts" and one provider alert based on "all datawarehouse alerts"
  2. Add each new Alert To an Alert Profile
  3. Assign "Prometheus node Profile" and "Prometheus Provider Profile" to the enterprise
  4. Add a Prometheus alerts endpoint

Debugging

Prometheus Side

  1. Is the prometheus container running?
$ oc get pods -name prometheus-0 -n openshift-metrics
  1. Is prometheus-alert-buffer in the cluster returning results?
OPENSHIFT_PROMETHEUS_ALERTS_ROUTE=$(oc get routes -n openshift-metrics -o go-template --template='{{.spec.host}}' alerts)
OPENSHIFT_MANAGEMENT_ADMIN_TOKEN=$(oc sa get-token -n management-infra management-admin)
curl  -H "Authorization: Bearer ${OPENSHIFT_MANAGEMENT_ADMIN_TOKEN}" -k https://${OPENSHIFT_PROMETHEUS_ALERTS_ROUTE}/topics/alerts

ManageIQ - Worker Management

  1. is the event collection worker running?
 bundle exec bin/rake evm:status| grep "MonitoringManager::EventCatcher"
  1. if the answer to the previous question is false: is the MonitoringManager authentication_status_ok(other wise event collection will not start)? is the event collection role on?
# replace first with your manager (.find(<id>))
ManageIQ::Providers::Openshift::MonitoringManager.first.authentication_status_ok?

ManageIQ - Collection and Alerting logic

  1. Is the log showing event collection adding events to the event queue? Is there an ERROR in evm.log?
less log/container_monitoring.log
  1. Do we have ems_events persisted in the system? Are they translated to alerts?
# rails console
EmsEvent.where(:source=>"DATAWAREHOUSE").count # How many events were recorded in ManageIQ?
MiqAlertStatus.count # How Many alerts are there(including resolved, one per incident)?
pp MiqAlertStatus.all # output
pp EmsEvent.where(:source=>"DATAWAREHOUSE").all # output
  1. To restart event collection in the system (detrimental to production systems!!!, valuable for debugging)
- EmsEvent.where(:source=>"DATAWAREHOUSE").destroy_all
- MiqAlertStatus.destroy_all
- systemctl restart evmserverd # restart evm

Alerts

Here are some common usable alerts:

  prometheus.rules: |
    groups:
    - name: example-rules
      rules:
      - alert: Node Down
        expr: up{job="kubernetes-nodes"} == 0
        annotations:
          miqTarget: "ContainerNode"
          url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
          description: "{{$labels.instance}} is down"
        labels: 
          severity: "ERROR"
      - alert: "Node up" # helpful for testing
        expr: up{job="kubernetes-nodes"} == 1
        annotations:
          miqTarget: "ContainerNode"
          url: "https://www.example.com/fixing_instructions"
          description: "Alerts configured correctly! ContainerNode {{$labels.instance}} is up"
        labels: 
          severity: "ERROR"
      - alert: "BigMemNode" # high memory usage, not currently usable since it's a fixed number
        expr: container_spec_memory_limit_bytes > 1000000000
        annotations:
          url: "https://www.example.com/fixing_instructions"
          description: "Huge node detected"
        labels: 
          severity: "ERROR"
      - alert: Too Many Pods
        expr: sum(kubelet_running_pod_count) > 20
        annotations:
          url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
          description: "Too many pods! Please delete"
        labels: 
          severity: "ERROR"

The pod count alert can be easily triggered using:

oc new-app https://github.com/openshift/ruby-hello-world.git
oc scale dc ruby-hello-world --replicas=10 # trigger (might need to adjust the container number)
oc scale dc ruby-hello-world --replicas=1 # resolve

Triggering the "Too Many Authenticated Requests"

while true; do curl -k -s -H "Authorization: Bearer $OPENSHIFT_MANAGEMENT_ADMIN_TOKEN" https://$OPENSHIFT_MASTER_HOST:8443/api/v1/pods &> /dev/null ; done
@moolitayer
Copy link
Author

@miq-bot assign moolitayer
@miq-bot add_label providers/containers, enhancement

@moolitayer moolitayer changed the title [WIP] cm-ops alerts feature [WIP] Prometheus alerts feature Sep 26, 2017
@moolitayer
Copy link
Author

@joelddiez @shalomnaim1 Please review

@shalomnaim1
Copy link

shalomnaim1 commented Sep 26, 2017

You have a typo on the setup section

The only supported way is to install OpenShift using OpenShift ansible, inplemented in: ...

I believe you meant to implemented, right?

@shalomnaim1
Copy link

it was nice if you set a link to openshift/openshift-ansible#4509 so it be easier to access this reference

@moolitayer moolitayer changed the title [WIP] Prometheus alerts feature Prometheus alerts feature Oct 1, 2017
@moolitayer moolitayer changed the title Prometheus alerts feature Prometheus Alerts Integration Oct 15, 2017
@moolitayer
Copy link
Author

cc @joelddiaz

@joelddiaz
Copy link

It would be helpful to have some text explaining the various 'annotation' fields for the Prometheus alerts. Specifically, listing the acceptable values for miqTarget & severity.

@moolitayer
Copy link
Author

moolitayer commented Oct 17, 2017

@joelddiaz I'm working now on implementing provider targeted alerts, will update docs afterwards

@moolitayer
Copy link
Author

@joelddiaz @ilackarms updated the document. Note now one should setup two miq alerts and not one.

@ilackarms
Copy link

@moolitayer unable to currently set up any alerts

@moolitayer
Copy link
Author

@moolitayer unable to currently set up any alerts

ManageIQ/manageiq-ui-classic#2714 should help

@moolitayer
Copy link
Author

@shalomnaim1 can you please paste the two alert definitions you are using in tests?

@shalomnaim1
Copy link

@moolitayer , In the Debug section under step 2, you added an example of how to get the current firing alerts form Prometheus, it seems like you used 2 different variables for accessing the route, the route saved to PROMETHEUS_ALERTS_ROUTE but on the curl request, you used OPENSHIFT_PROMETHEUS_ALERTS_ROUTE

@pemcg
Copy link

pemcg commented Feb 8, 2018

In the examples here you're defining the severity as both an annotation and a label:

        annotations:
          severity: "ERROR"

and

        labels: 
          severity: "ERROR"

@moolitayer Is there a recommendation as to which is preferable?

@miq-bot miq-bot added the stale label Aug 13, 2018
@miq-bot
Copy link
Member

miq-bot commented Aug 13, 2018

This issue has been automatically marked as stale because it has not been updated for at least 6 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions!

@cben
Copy link
Contributor

cben commented Aug 21, 2018

This is great documentation 👍
Not sure if this is 100% complere or there was any more work intended here, anyway we're no longer improving this, closing.
@miq-bot close

@miq-bot
Copy link
Member

miq-bot commented Aug 21, 2018

@cben unrecognized command 'close', ignoring...

Accepted commands are: add_label, add_reviewer, assign, close_issue, move_issue, remove_label, rm_label, set_milestone

@cben
Copy link
Contributor

cben commented Aug 21, 2018

@miq-bot close-issue

@miq-bot miq-bot closed this as completed Aug 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants