-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Hello!
I cannot deal with one problem, since the moment when began to use prometheus.
I use blackbox_exporter for monitoring of accessibility of the domain on http and also alertmanager for sending messages to slack, telegram and the SMS.
Sometimes there is a situation when the domain is unavailable on http and prometheus sends the correct notification messages that the server is unavailable, but the message with the status of "resolved" at some point comes when the server is unavailable at this time, and through some time the message with the status of "firing" comes again.
The configuration is given below:
Prometheus (part of config):
global:
scrape_interval: 10s
scrape_timeout: 8s
evaluation_interval: 1m
rule_files:
- /etc/prometheus/production.rules
scrape_configs:
- job_name: 'myserver.com_n10.myserver.com_ui'
scrape_interval: 30s
metrics_path: /probe
params:
module: [ myserver.com_n10.myserver.com_ui ] # Look for a HTTP 200 response.
static_configs:
- targets:
- myserver.com
labels:
server: 'myserver.com'
network: 'n10.myserver.com'
domain: 'n10.myserver.com'
type: 'http_ui'
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
replacement: http://${1}/login
- source_labels: [__param_target]
regex: (.*)
target_label: instance
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 127.0.0.1:9115 # Blackbox exporter.
blackbox.yml
myserver.com_n10.myserver.com_ui:
prober: http
timeout: 10s
http:
valid_status_codes: [ 200, 302, 301] # Defaults to 2xx
method: GET
headers:
Host: someheader.com
no_follow_redirects: true
fail_if_ssl: false
tls_config:
insecure_skip_verify: false
protocol: "tcp"
preferred_ip_protocol: "ip4"
alertmanager.yml
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'smtp9.myserver.com:25'
smtp_from: 'alertmanager@myserver.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
resolve_timeout: 1h
route:
group_by: ['alertname', 'instance', 'type']
group_wait: 30s
group_interval: 5m
repeat_interval: 5m
route:
receiver: slack_general
routes:
- match:
severity: sms
receiver: sms_admins
continue: true
- match:
severity: warning
receiver: slack_general
repeat_interval: 1h
continue: true
- match:
severity: warning
receiver: admins
repeat_interval: 1h
- match_re:
severity: critical|sms
receiver: slack_general
continue: true
- match_re:
severity: critical|sms
receiver: admins
receivers:
- name: slack_general
slack_configs:
- api_url: https://hooks.slack.com/services/SLACK_URI
channel: '#notifications'
send_resolved: true
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] '
text: '{{ range $val := .Alerts }}
Alert: <http://myserver.com:9095/#/alerts|{{ index $val.Annotations "description" }}>
{{ end}}'
- name: admins
webhook_configs:
- send_resolved: True
url: http://127.0.0.1:9088/alert/telegram_id1,telegram_id2
templates:
- '/etc/prometheus/alertmanager/templates/default.tmpl'
#test for rule inhibition, used for free space trigger for now
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance', 'mountpoint']
alert.rules
ALERT HTTPProblemOnUI
IF probe_success{job="myserver.com_n10.myserver.com_ui"} != 1
FOR 5m
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "HTTP check of myserver.com on server myserver.com",
description = "HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com",
}
Part of alertmanager journal (alert HTTPProblemOnUI[4f4a71c][active] ):
Aug 17 20:14:38 myserver.com alertmanager[13966]: time="2017-08-17T20:14:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:15:08 myserver.com alertmanager[13966]: time="2017-08-17T20:15:08Z" level=debug msg="flushing [HTTPProblemOnUI[ff7e520][active] HTTPProblemOnUI[31bfcf0][active] HTTPProblemOnUI[7eb9f05][active] HTTPProblemOnUI[abcb7d9][active] HTTPProblemOnUI[e3c0fa7][active] HTTPProblemOnUI[c7c16dc][active] HTTPProblemOnUI[4215379][active] HTTPProblemOnUI[c158c7c][active] HTTPProblemOnUI[f5691c0][active] HTTPProblemOnUI[4f4a71c][active] HTTPProblemOnUI[ee6848c][active] HTTPProblemOnUI[8c65933][active] HTTPProblemOnUI[82dcc81][active] HTTPProblemOnUI[7da68da][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="HTTPProblemOnUI"} source="dispatch.go:426"
Aug 17 20:15:08 myserver.com alertmanager[13966]: time="2017-08-17T20:15:08Z" level=debug msg="flushing [HTTPProblemOnUI[ff7e520][active] HTTPProblemOnUI[c7c16dc][active] HTTPProblemOnUI[f5691c0][active] HTTPProblemOnUI[8c65933][active] HTTPProblemOnUI[4215379][active] HTTPProblemOnUI[7eb9f05][active] HTTPProblemOnUI[abcb7d9][active] HTTPProblemOnUI[e3c0fa7][active] HTTPProblemOnUI[c158c7c][active] HTTPProblemOnUI[82dcc81][active] HTTPProblemOnUI[4f4a71c][active] HTTPProblemOnUI[ee6848c][active] HTTPProblemOnUI[31bfcf0][active] HTTPProblemOnUI[7da68da][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="HTTPProblemOnUI"} source="dispatch.go:426"
Aug 17 20:15:38 myserver.com alertmanager[13966]: time="2017-08-17T20:15:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:16:38 myserver.com alertmanager[13966]: time="2017-08-17T20:16:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Here it went into state "resolved" (but the query returns that the check at this point returns probe_success 0, i'll show it below)
...
Aug 17 20:33:41 myserver.com alertmanager[13966]: time="2017-08-17T20:33:41Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:34:38 myserver.com alertmanager[13966]: time="2017-08-17T20:34:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][resolved] component=dispatcher source="dispatch.go:184"
...
Then again "active":
Aug 17 20:48:38 myserver.com alertmanager[13966]: time="2017-08-17T20:48:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Here data from a prometheus:
http://joxi.net/Dr81e9eikPQDLm
http://joxi.net/1A51eJeiK1g1GA
Everything began in:
~]# date -d @1503000569.746
Thu Aug 17 20:09:29 UTC 2017
AND ended in:
~]# date -d @1503002909.746
Thu Aug 17 20:48:29 UTC 2017
There are data which to this interval were sent by prometheus to alertmanager:
[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:14:38.52Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://myserver.com:9090/graph"}]
Then for some reason came alert with "endsAt" and respectively the notification message "resolved" came:
[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:14:38.52Z","endsAt":"2017-08-17T20:34:38.52Z","generatorURL":"http://myserver.com:9090/graph"}]
And then the notification message with the status "firing" came again:
[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:47:38.52Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://myserver.com:9090/graph"}]
If any additional information is necessary, I am ready to provide it.
One nuance, at us is used many alerts with an identical name (for group of messages), I provided only one type of alert (HTTPProblemOnUI). Such problems happen not only with alerts of this group.
Thank you!
P.S. sorry for my poor English :)
Metadata
Metadata
Assignees
Type
Projects
Status