Prometheus Alerts Integration #14238

moolitayer · 2017-03-08T22:23:31Z

The propose of this issue is to document the status & usage of ManageIQ integration with Prometheus alerts for Kubernetes/OpenShift.

Description

Prometheus is used as an external alerting component and ManageIQ collects alerts from it and attaches them to inventory objects.

It is then possible to view on going alerts in ManageIQ's Monitoring screen, manage their life cycle (view data & related objects, assign, acknowledge, comment).

Alert definition are configured in a Prometheus instance running inside a container cluster.
ManageIQ pickups the alerts from Prometheus.
Operators can view active alerts per provider in the alerts Dashboard screen
After alerts are resolved in the cluster they go away from the screen
Alerts have metadata: severity, Url ("View SOP"), description, miqTarget
User can assign & acknowledge alerts in the alerts list screen.

Status

Tech Preview for the Gaprindashvili release

Setup

Adding Prometheus to an OpenShift cluster & configure alert definitions:
The only supported way is to install OpenShift is using openshift-ansible, first implemented in:
Create ansible role for deploying prometheus on openshift openshift/openshift-ansible#4509
(add inventory flag: openshift_hosted_prometheus_deploy=true)
configure alerts in prometheus:

$ oc edit configmap -n openshift-metrics prometheus
# Supported annotations:
#   miqTarget: ContainerNode|ExtManagementSystem, defaults to ContainerNode.
#   miqIgnore: "true|false", should ManageIQ pick up this alert, defaults to true.
#   description: A string the screen will show
# Labels:
#   severity: ERROR|WARNING|INFO. defaults to ERROR.

  prometheus.rules: |
    groups:
    - name: example-rules
      interval: 30s # defaults to global interval
      rules:
      # 
      # ------------- Copy below this line -------------
      #
      - alert: "NodeDown"
        expr: up{job="kubernetes-nodes"} == 0
        annotations:
          miqTarget: "ContainerNode"
          severity: "ERROR"
          url: "https://www.example.com/node_down_fixing_instructions"
          description: "Node {{$labels.instance}} is down"
      - alert: "TooManyRequests"
        expr: rate(authenticated_user_requests[2m]) > 12
        annotations:
          miqTarget: "ExtManagementSystem"
          severity: "ERROR"
          url: "https://www.example.com/too_many_requests_fixing_instructions"
          description: "Too many authenticated requests"

See some common alerts
Note to reload the configuration please delete the pod OR send a HUP signal to the Prometheus process

~~Go to: control -> explorer, create one container node alert based on "all datawarehouse alerts" and one provider alert based on "all datawarehouse alerts"~~
~~Add each new Alert To an Alert Profile~~
Assign "Prometheus node Profile" and "Prometheus Provider Profile" to the enterprise
Add a Prometheus alerts endpoint

Debugging

Prometheus Side

Is the prometheus container running?

$ oc get pods -name prometheus-0 -n openshift-metrics

Is prometheus-alert-buffer in the cluster returning results?

OPENSHIFT_PROMETHEUS_ALERTS_ROUTE=$(oc get routes -n openshift-metrics -o go-template --template='{{.spec.host}}' alerts)
OPENSHIFT_MANAGEMENT_ADMIN_TOKEN=$(oc sa get-token -n management-infra management-admin)
curl  -H "Authorization: Bearer ${OPENSHIFT_MANAGEMENT_ADMIN_TOKEN}" -k https://${OPENSHIFT_PROMETHEUS_ALERTS_ROUTE}/topics/alerts

ManageIQ - Worker Management

is the event collection worker running?

 bundle exec bin/rake evm:status| grep "MonitoringManager::EventCatcher"

if the answer to the previous question is false: is the MonitoringManager authentication_status_ok(other wise event collection will not start)? is the event collection role on?

# replace first with your manager (.find(<id>))
ManageIQ::Providers::Openshift::MonitoringManager.first.authentication_status_ok?

ManageIQ - Collection and Alerting logic

Is the log showing event collection adding events to the event queue? Is there an ERROR in evm.log?

less log/container_monitoring.log

Do we have ems_events persisted in the system? Are they translated to alerts?

# rails console
EmsEvent.where(:source=>"DATAWAREHOUSE").count # How many events were recorded in ManageIQ?
MiqAlertStatus.count # How Many alerts are there(including resolved, one per incident)?
pp MiqAlertStatus.all # output
pp EmsEvent.where(:source=>"DATAWAREHOUSE").all # output

To restart event collection in the system (detrimental to production systems!!!, valuable for debugging)

- EmsEvent.where(:source=>"DATAWAREHOUSE").destroy_all
- MiqAlertStatus.destroy_all
- systemctl restart evmserverd # restart evm

Alerts

Here are some common usable alerts:

  prometheus.rules: |
    groups:
    - name: example-rules
      rules:
      - alert: Node Down
        expr: up{job="kubernetes-nodes"} == 0
        annotations:
          miqTarget: "ContainerNode"
          url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
          description: "{{$labels.instance}} is down"
        labels: 
          severity: "ERROR"
      - alert: "Node up" # helpful for testing
        expr: up{job="kubernetes-nodes"} == 1
        annotations:
          miqTarget: "ContainerNode"
          url: "https://www.example.com/fixing_instructions"
          description: "Alerts configured correctly! ContainerNode {{$labels.instance}} is up"
        labels: 
          severity: "ERROR"
      - alert: "BigMemNode" # high memory usage, not currently usable since it's a fixed number
        expr: container_spec_memory_limit_bytes > 1000000000
        annotations:
          url: "https://www.example.com/fixing_instructions"
          description: "Huge node detected"
        labels: 
          severity: "ERROR"
      - alert: Too Many Pods
        expr: sum(kubelet_running_pod_count) > 20
        annotations:
          url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
          description: "Too many pods! Please delete"
        labels: 
          severity: "ERROR"

The pod count alert can be easily triggered using:

oc new-app https://github.com/openshift/ruby-hello-world.git
oc scale dc ruby-hello-world --replicas=10 # trigger (might need to adjust the container number)
oc scale dc ruby-hello-world --replicas=1 # resolve

Triggering the "Too Many Authenticated Requests"

while true; do curl -k -s -H "Authorization: Bearer $OPENSHIFT_MANAGEMENT_ADMIN_TOKEN" https://$OPENSHIFT_MASTER_HOST:8443/api/v1/pods &> /dev/null ; done

The text was updated successfully, but these errors were encountered:

moolitayer · 2017-03-08T22:25:24Z

@miq-bot assign moolitayer
@miq-bot add_label providers/containers, enhancement

moolitayer · 2017-09-26T13:03:05Z

@joelddiez @shalomnaim1 Please review

shalomnaim1 · 2017-09-26T20:17:22Z

You have a typo on the setup section

The only supported way is to install OpenShift using OpenShift ansible, inplemented in: ...

I believe you meant to implemented, right?

shalomnaim1 · 2017-09-26T20:20:48Z

it was nice if you set a link to openshift/openshift-ansible#4509 so it be easier to access this reference

moolitayer · 2017-10-15T11:39:18Z

cc @joelddiaz

joelddiaz · 2017-10-16T14:21:12Z

It would be helpful to have some text explaining the various 'annotation' fields for the Prometheus alerts. Specifically, listing the acceptable values for miqTarget & severity.

moolitayer · 2017-10-17T06:47:07Z

@joelddiaz I'm working now on implementing provider targeted alerts, will update docs afterwards

moolitayer · 2017-11-12T13:14:56Z

@joelddiaz @ilackarms updated the document. Note now one should setup two miq alerts and not one.

ilackarms · 2017-11-13T18:20:58Z

@moolitayer unable to currently set up any alerts

moolitayer · 2017-11-14T08:26:53Z

@moolitayer unable to currently set up any alerts

ManageIQ/manageiq-ui-classic#2714 should help

moolitayer · 2017-11-20T09:46:16Z

@shalomnaim1 can you please paste the two alert definitions you are using in tests?

shalomnaim1 · 2018-01-29T14:26:32Z

@moolitayer , In the Debug section under step 2, you added an example of how to get the current firing alerts form Prometheus, it seems like you used 2 different variables for accessing the route, the route saved to PROMETHEUS_ALERTS_ROUTE but on the curl request, you used OPENSHIFT_PROMETHEUS_ALERTS_ROUTE

pemcg · 2018-02-08T11:05:09Z

In the examples here you're defining the severity as both an annotation and a label:

        annotations:
          severity: "ERROR"

and

        labels: 
          severity: "ERROR"

@moolitayer Is there a recommendation as to which is preferable?

miq-bot · 2018-08-13T04:00:18Z

This issue has been automatically marked as stale because it has not been updated for at least 6 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions!

cben · 2018-08-21T09:27:48Z

This is great documentation 👍
Not sure if this is 100% complere or there was any more work intended here, anyway we're no longer improving this, closing.
@miq-bot close

miq-bot · 2018-08-21T09:28:07Z

@cben unrecognized command 'close', ignoring...

Accepted commands are: add_label, add_reviewer, assign, close_issue, move_issue, remove_label, rm_label, set_milestone

cben · 2018-08-21T09:28:25Z

@miq-bot close-issue

miq-bot assigned moolitayer Mar 8, 2017

miq-bot added enhancement providers/containers labels Mar 8, 2017

moolitayer changed the title ~~[WIP] cm-ops alerts feature~~ [WIP] Prometheus alerts feature Sep 26, 2017

moolitayer mentioned this issue Oct 1, 2017

Prometheus alerts ManageIQ/manageiq-providers-kubernetes#40

Merged

moolitayer changed the title ~~[WIP] Prometheus alerts feature~~ Prometheus alerts feature Oct 1, 2017

moolitayer changed the title ~~Prometheus alerts feature~~ Prometheus Alerts Integration Oct 15, 2017

This was referenced Oct 26, 2017

Enable alerts on ext_management_system from prometheus events #16310

Merged

Collect ems targeted Prometheus events ManageIQ/manageiq-providers-kubernetes#140

Merged

moolitayer mentioned this issue Nov 6, 2017

Alert on Event Collection Problem #16395

Closed

moolitayer mentioned this issue Nov 15, 2017

Seed MiqAlerts used for Prometheus Alerts #16479

Merged

moolitayer mentioned this issue Jan 15, 2018

Change alert definition meta ManageIQ/manageiq-providers-kubernetes#217

Merged

miq-bot added the stale label Aug 13, 2018

miq-bot closed this as completed Aug 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus Alerts Integration #14238

Prometheus Alerts Integration #14238

moolitayer commented Mar 8, 2017 •

edited

Loading

Description

Status

Setup

Debugging

Prometheus Side

ManageIQ - Worker Management

ManageIQ - Collection and Alerting logic

Alerts

moolitayer commented Mar 8, 2017

moolitayer commented Sep 26, 2017

shalomnaim1 commented Sep 26, 2017 •

edited

Loading

shalomnaim1 commented Sep 26, 2017

moolitayer commented Oct 15, 2017

joelddiaz commented Oct 16, 2017

moolitayer commented Oct 17, 2017 •

edited

Loading

moolitayer commented Nov 12, 2017

ilackarms commented Nov 13, 2017

moolitayer commented Nov 14, 2017

moolitayer commented Nov 20, 2017

shalomnaim1 commented Jan 29, 2018

pemcg commented Feb 8, 2018 •

edited

Loading

miq-bot commented Aug 13, 2018

cben commented Aug 21, 2018

miq-bot commented Aug 21, 2018

cben commented Aug 21, 2018

Prometheus Alerts Integration #14238

Prometheus Alerts Integration #14238

Comments

moolitayer commented Mar 8, 2017 • edited Loading

Table of Contents

Description

Status

Setup

Debugging

Prometheus Side

ManageIQ - Worker Management

ManageIQ - Collection and Alerting logic

Alerts

moolitayer commented Mar 8, 2017

moolitayer commented Sep 26, 2017

shalomnaim1 commented Sep 26, 2017 • edited Loading

shalomnaim1 commented Sep 26, 2017

moolitayer commented Oct 15, 2017

joelddiaz commented Oct 16, 2017

moolitayer commented Oct 17, 2017 • edited Loading

moolitayer commented Nov 12, 2017

ilackarms commented Nov 13, 2017

moolitayer commented Nov 14, 2017

moolitayer commented Nov 20, 2017

shalomnaim1 commented Jan 29, 2018

pemcg commented Feb 8, 2018 • edited Loading

miq-bot commented Aug 13, 2018

cben commented Aug 21, 2018

miq-bot commented Aug 21, 2018

cben commented Aug 21, 2018

moolitayer commented Mar 8, 2017 •

edited

Loading

shalomnaim1 commented Sep 26, 2017 •

edited

Loading

moolitayer commented Oct 17, 2017 •

edited

Loading

pemcg commented Feb 8, 2018 •

edited

Loading