Skip to content

Firing -> Resolved -> Firing issue with prometheus + alertmanager #952

@artemlive

Description

@artemlive

Hello!
I cannot deal with one problem, since the moment when began to use prometheus.
I use blackbox_exporter for monitoring of accessibility of the domain on http and also alertmanager for sending messages to slack, telegram and the SMS.
Sometimes there is a situation when the domain is unavailable on http and prometheus sends the correct notification messages that the server is unavailable, but the message with the status of "resolved" at some point comes when the server is unavailable at this time, and through some time the message with the status of "firing" comes again.
The configuration is given below:
Prometheus (part of config):

global:
  scrape_interval: 10s
  scrape_timeout: 8s
  evaluation_interval: 1m
rule_files:
  - /etc/prometheus/production.rules
scrape_configs:
- job_name: 'myserver.com_n10.myserver.com_ui'
  scrape_interval: 30s
  metrics_path: /probe
  params:
    module: [ myserver.com_n10.myserver.com_ui ]  # Look for a HTTP 200 response.
  static_configs:
  - targets:
    - myserver.com
    labels:
      server: 'myserver.com'
      network: 'n10.myserver.com'
      domain: 'n10.myserver.com'
      type: 'http_ui'
  relabel_configs:
    - source_labels: [__address__]
      regex: (.*)(:80)?
      target_label: __param_target
      replacement: http://${1}/login
    - source_labels: [__param_target]
      regex: (.*)
      target_label: instance
      replacement: ${1}
    - source_labels: []
      regex: .*
      target_label: __address__
      replacement: 127.0.0.1:9115  # Blackbox exporter.

blackbox.yml

myserver.com_n10.myserver.com_ui:
    prober: http
    timeout: 10s
    http:
     valid_status_codes: [ 200, 302, 301]  # Defaults to 2xx
     method: GET
     headers:
       Host: someheader.com
     no_follow_redirects: true
     fail_if_ssl: false
     tls_config:
       insecure_skip_verify: false
     protocol: "tcp"
     preferred_ip_protocol: "ip4"

alertmanager.yml

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'smtp9.myserver.com:25'
  smtp_from: 'alertmanager@myserver.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'
  resolve_timeout: 1h
route:
  group_by: ['alertname', 'instance', 'type']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5m

route:
 receiver: slack_general
 routes:
  - match:
      severity: sms
    receiver: sms_admins
    continue: true
  - match:
      severity: warning
    receiver: slack_general
    repeat_interval: 1h
    continue: true
  - match:
      severity: warning
    receiver: admins
    repeat_interval: 1h
  - match_re:
      severity: critical|sms
    receiver: slack_general
    continue: true
  - match_re:
      severity: critical|sms
    receiver: admins
receivers:
- name: slack_general
  slack_configs:
  - api_url: https://hooks.slack.com/services/SLACK_URI
    channel: '#notifications'
    send_resolved: true
    title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] '
    text:  '{{ range $val := .Alerts }}
             Alert: <http://myserver.com:9095/#/alerts|{{ index $val.Annotations "description" }}>
             {{ end}}'
- name: admins
  webhook_configs:
  - send_resolved: True
    url: http://127.0.0.1:9088/alert/telegram_id1,telegram_id2
templates:
- '/etc/prometheus/alertmanager/templates/default.tmpl'
#test for rule inhibition, used for free space trigger for now
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance', 'mountpoint']

alert.rules

ALERT HTTPProblemOnUI
 IF probe_success{job="myserver.com_n10.myserver.com_ui"}  != 1
 FOR 5m
 LABELS { severity = "critical" }
 ANNOTATIONS {
 summary = "HTTP check of myserver.com on server myserver.com",
 description = "HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com",
}

Part of alertmanager journal (alert HTTPProblemOnUI[4f4a71c][active] ):

Aug 17 20:14:38 myserver.com alertmanager[13966]: time="2017-08-17T20:14:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:15:08 myserver.com alertmanager[13966]: time="2017-08-17T20:15:08Z" level=debug msg="flushing [HTTPProblemOnUI[ff7e520][active] HTTPProblemOnUI[31bfcf0][active] HTTPProblemOnUI[7eb9f05][active] HTTPProblemOnUI[abcb7d9][active] HTTPProblemOnUI[e3c0fa7][active] HTTPProblemOnUI[c7c16dc][active] HTTPProblemOnUI[4215379][active] HTTPProblemOnUI[c158c7c][active] HTTPProblemOnUI[f5691c0][active] HTTPProblemOnUI[4f4a71c][active] HTTPProblemOnUI[ee6848c][active] HTTPProblemOnUI[8c65933][active] HTTPProblemOnUI[82dcc81][active] HTTPProblemOnUI[7da68da][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="HTTPProblemOnUI"} source="dispatch.go:426"
Aug 17 20:15:08 myserver.com alertmanager[13966]: time="2017-08-17T20:15:08Z" level=debug msg="flushing [HTTPProblemOnUI[ff7e520][active] HTTPProblemOnUI[c7c16dc][active] HTTPProblemOnUI[f5691c0][active] HTTPProblemOnUI[8c65933][active] HTTPProblemOnUI[4215379][active] HTTPProblemOnUI[7eb9f05][active] HTTPProblemOnUI[abcb7d9][active] HTTPProblemOnUI[e3c0fa7][active] HTTPProblemOnUI[c158c7c][active] HTTPProblemOnUI[82dcc81][active] HTTPProblemOnUI[4f4a71c][active] HTTPProblemOnUI[ee6848c][active] HTTPProblemOnUI[31bfcf0][active] HTTPProblemOnUI[7da68da][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="HTTPProblemOnUI"} source="dispatch.go:426"
Aug 17 20:15:38 myserver.com alertmanager[13966]: time="2017-08-17T20:15:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:16:38 myserver.com alertmanager[13966]: time="2017-08-17T20:16:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"

Here it went into state "resolved" (but the query returns that the check at this point returns probe_success 0, i'll show it below)

...
Aug 17 20:33:41 myserver.com alertmanager[13966]: time="2017-08-17T20:33:41Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:34:38 myserver.com alertmanager[13966]: time="2017-08-17T20:34:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][resolved] component=dispatcher source="dispatch.go:184"
...

Then again "active":

Aug 17 20:48:38 myserver.com alertmanager[13966]: time="2017-08-17T20:48:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"

Here data from a prometheus:
http://joxi.net/Dr81e9eikPQDLm
http://joxi.net/1A51eJeiK1g1GA

Everything began in:

~]# date -d @1503000569.746
Thu Aug 17 20:09:29 UTC 2017

AND ended in:

~]# date -d @1503002909.746
Thu Aug 17 20:48:29 UTC 2017

There are data which to this interval were sent by prometheus to alertmanager:

[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:14:38.52Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://myserver.com:9090/graph"}]

Then for some reason came alert with "endsAt" and respectively the notification message "resolved" came:

[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:14:38.52Z","endsAt":"2017-08-17T20:34:38.52Z","generatorURL":"http://myserver.com:9090/graph"}]

And then the notification message with the status "firing" came again:

[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:47:38.52Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://myserver.com:9090/graph"}]

If any additional information is necessary, I am ready to provide it.
One nuance, at us is used many alerts with an identical name (for group of messages), I provided only one type of alert (HTTPProblemOnUI). Such problems happen not only with alerts of this group.
Thank you!
P.S. sorry for my poor English :)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions