Notifications for both firing and resolved items doubts #591

ddewaele · 2017-01-03T06:52:34Z

Hi,

When an alert is active, the alertmanager will send out notifications periodically (as per the group_wait or group_interval value) whenever new items are firing for that group but also when items got resolved in that group.

AFAIK the fact that it also sends separate notifications when items get resolved in the group for an active alert is not related to the global send_resolved flag, but I was wondering if there was a way to disable these notifications. (We're currently receiving lots of emails because of it.)

I've written down the following scenario and listed some questions where I don't understand the behaviour. I've tried to make it as clear / simple as possible, only using 1 target / 1 metric (2 label values) / 1 alert definition and the following configuration

ALERT metric1_down IF metric1 == 0
scrape_interval: 5s
group_wait: 30s
group_interval: 5m
repeat_interval: 24h

Every block below indicates a change in the alerting (new alert fired / alert getting resolved). I've included the timestamps of the actions that have happened and listed down the areas where I'm still missing something. I started with a clean prometheus / alertmanager with no previous alerts.

I've written down my assumptions and my questions.

metric1{v=a} == 0 FOR 1min

06:24:30 : metric1_down{v=a} pending
06:25:38 : metric1_down{v=a} firing (due to alert specifiying FOR 1min)
06:26:12 : receive notification : firing(metric1_down{v=a})

Assumption : Notification is sent after the interval following 06:25:38 + the group_wait: 30s)

metric1{v=b} == 0 FOR 1min

06:27:30 : metric1_down{v=b} pending
06:28:40 : metric1_down{v=b} firing
06:31:12 : receive notification : firing(metric1_down{v=a} and metric1_down{v=b})

Assumption : Alertmanager now waits group_interval: 5m after the previous email was sent. Does not use group_wait

metric1{v=a} == 1

06:31:50 : metric1_down{v=a} alert disappears
06:36:12 : receive notification : firing(metric1{v=b}) and resolved(metric1{v=a})

Assumption: it again waits group_interval: 5m after the previous email was sent

Q1 : Is there a way to resolve this email ? I already now metric1_down{v=b) is firing. Don't want a notification that metric1_down{v=a) is resolved

metric1{v=b} == 1

06:38:50 : metric1_down{v=b} alert disappears
06:41:12 : no slack / email saying everything resolved

Assumption : everything was resolved, no alerts active anymore and send_resolved=false so no email is sent ?

metric1{v=a} == 0 FOR 1min

06:43:00 : metric1_down{v=a} pending
06:44:10 : metric1_down{v=a} firing
06:46:25 : receive notification : firing(metric1_down{v=a}) , resolved(metric1_down{v=b})

Q2 : A bit surprised to see "resolved(metric1_down{v=b}) in this notification. at 06:41 everything as resolved and the system was ok. Also unclear why it got sent at 06:46:25. Does it take into account group_interval or group_wait here ?

metric1{v=a} == 1

06:48:30 : metric1_down{v=a} alert disappears

Assumption : Alertmanager considers the system healthy now ? ... all issues have been resolved.

metric1{v=b} == 0 FOR 1min

06:56:30 : metric1_down{v=b} pending
06:56:37 : metric1_down{v=b} firing
07:01:30 : receive notification : firing(metric1_down{v=b}), resolved(metric1_down{v=a})
07:11:30 : receive notification : firing(metric1_down{v=b})

Q3: 07:01:30 - Was not expecting it to say resolved: metric1{v=b}) in the 07:01:30 msg. Same as Q2

Q4 : 07:11:30 - Not sure why this email is sent. Adds little value. I already know metric1_down{v=b} is firing. Why is this sent 10 minutes after the previous one ?

If you've made it this far thanks a lot ! :) Would appreciate some feedback to get a better understanding of it.

The text was updated successfully, but these errors were encountered:

brian-brazil · 2017-01-03T10:00:15Z

Q1 : Is there a way to resolve this email ? I already now metric1_down{v=b) is firing. Don't want a notification that metric1_down{v=a) is resolved

With send_resolved disabled, there shouldn't have been mention of the resolved alert in any emails.

Does it take into account group_interval or group_wait here ?

It's within 5m, so group_interval still applies.

Is your alertmanager or prometheus under heavy load?

ddewaele · 2017-01-03T10:12:42Z

No heavy load ... we're using the email and slack notifier and from the docs / code the send_resolved should be false.

With the script below, and a group_interval=2m I'm getting the following mails/slack msgs:

10:23 - firing a1/a2 (30 seconds after a1 was raised, as expected)
10:25 - firing a1/a2/a3/b1/b2/b3 (2mins after first email, as expected due to group_interval)
10:29 - firing b3 / resolved b1,b2,a1,a2,a3 (no idea why this came 4 minutes after the previous mail)
10:32 - firing b3 (no idea why this is sent)

I thought the send_resolved was all about sending "resolved-only" msgs. In my case, the emails that always include at least 1 firing alert, albeit some duplicates)

curl -XPOST -d'[{"labels":{"alertname":"a1"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"a2"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"a3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b1"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b2"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; echo "done... now looping b3 for keepAlive"
while true; do  curl -XPOST -d'[{"labels":{"alertname":"b3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date  ; sleep 15; done

brian-brazil · 2017-01-03T10:17:16Z

Can you share your full alertmanager config for that?

ddewaele · 2017-01-03T10:20:51Z

Sure...

global:
  smtp_smarthost: 'xxx'
  smtp_from: 'xxx'
  smtp_auth_username: 'xxx'
  smtp_auth_password: 'xxx'
  slack_api_url: 'xxxx'
# The root route on which each incoming alert enters.
route:
  # The root route must not have any matchers as it is the entry point for
  # all alerts. It needs to have a receiver configured so alerts that do not
  # match any of the sub-routes are sent to someone.
  receiver: 'team-xxx-mails'

  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['group']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.(with the shell script. The other scenario was executed with group_interval: 5m)
  group_interval: 2m  

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 24h

  # All the above attributes are inherited by all child routes and can 
  # overwritten on each.

  # The child route trees.
  routes:
  - match:
      job: Controllers
    receiver: team-xxx-mails

receivers:
- name: 'team-xxx-mails'
  email_configs:
    - to: 'xxx'
      html: '{{ template "xxx.email.default.html" . }}'
      headers:
        subject: '{{ template "xxx.subject" . }}'

    #require_tls: false
  slack_configs:
  - channel: '#notify-xxx-dev'

templates:
 - '/etc/alertmanager/template-vars.tmpl'
 - '/etc/alertmanager/templates/xxx.tmpl'
 - '/etc/alertmanager/templates/yyy.tmpl'

ddewaele · 2017-01-03T15:36:10Z

@brian-brazil : if you need me to do some further testing just let me know...

Here's some debug logging output. From this output I got 4 mails where I expected only 2 mails.

The last 2 mails are redundant, as they both say that alert b3 is active (I knew that from the second mail).

time="2017-01-03T15:25:37Z" level=debug msg="Received alert" alert=a1[f00e214][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:25:52Z" level=debug msg="Received alert" alert=a2[f004214][active] component=dispatcher source="dispatch.go:168" 
** MAIL ** time="2017-01-03T15:26:07Z" level=debug msg="flushing [a1[f00e214][active] a2[f004214][active]]" aggrGroup=cbf29ce484222325 source="dispatch.go:405" 
time="2017-01-03T15:26:07Z" level=debug msg="Received alert" alert=a3[f008214][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:26:22Z" level=debug msg="Received alert" alert=b1[e75cf84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:26:37Z" level=debug msg="Received alert" alert=b2[e75af84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:26:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:26:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:27:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:27:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:27:37Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:27:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
** MAIL ** time="2017-01-03T15:28:07Z" level=debug msg="flushing [a2[f004214][resolved] a3[f008214][active] b1[e75cf84][active] b2[e75af84][active] b3[e756f84][active] a1[f00e214][resolved]]" aggrGroup=cbf29ce484222325 source="dispatch.go:405" 
time="2017-01-03T15:28:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:28:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:28:37Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:28:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:29:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:29:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:29:37Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:29:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
** MAIL ** time="2017-01-03T15:30:07Z" level=debug msg="flushing [b1[e75cf84][resolved] b2[e75af84][resolved] b3[e756f84][active]]" aggrGroup=cbf29ce484222325 source="dispatch.go:405" 
time="2017-01-03T15:30:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:30:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:30:37Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:30:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:31:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:31:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:31:37Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:31:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
** MAIL ** time="2017-01-03T15:32:07Z" level=debug msg="flushing [b3[e756f84][active]]" aggrGroup=cbf29ce484222325 source="dispatch.go:405" 
time="2017-01-03T15:32:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:32:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168"

Our operators are only interested in the first 2 emails. (as they know now that they need to investigate a1,a2,a3, b1,b2,b3). The third / fourth email is noise to them.

According to you with send_resolved: false should the alertmanager only be sending out the first 2 mails ?

brian-brazil · 2017-01-03T17:06:47Z

Is this only happening with email, and are you sure the config is what the alertmanager has loaded?

ddewaele · 2017-01-03T18:01:01Z

Using Slack, Email and webhooks. For every email there is also a slack and a webhook call. So don't think that it's related to the receivers.
I've verified that the correct config is loaded via the alertmanager status page
Tried it with both 0.50 and 0.51 (using Docker)

Can you reproduce it using the curl commands above ?

ddewaele · 2017-01-04T00:14:07Z

@brian-brazil : Could this be the problem ?

https://github.com/prometheus/alertmanager/blob/master/notify/impl.go#L82

// Notify implements the Notifier interface.
func (i *Integration) Notify(ctx context.Context, alerts ...*types.Alert) (bool, error) {
	var res []*types.Alert

	// Resolved alerts have to be filtered only at this point, because they need
	// to end up unfiltered in the SetNotifiesStage.
	if i.conf.SendResolved() {
		res = alerts
	} else {
		for _, a := range alerts {
			if a.Status() != model.AlertResolved {
				res = append(res, a)
			}
		}
	}
	if len(res) == 0 {
		return false, nil
	}

	return i.notifier.Notify(ctx, alerts...)
}

First time I'm looking at GO code, but shouldn't it be passing along the filtered alerts (res) instead of the original alerts (alerts) to the notify function ?

Will do some testing with it tomorrow now that I know how to compile and run GO code :)

brian-brazil · 2017-01-04T00:17:13Z

Great, that looks to be it. Didn't notice that when I was perusing the code.

ddewaele · 2017-01-04T00:22:19Z

Perfect. I've never used Go and don't know how high prio this is but what are the chances somebody from the core team doing a PR ? I can do the codefix but would lose a lot of time figuring out how to write the tests in go and such... But more than happy to test / validate and give feedback.

…

On Wed, 4 Jan 2017 at 01:17, Brian Brazil ***@***.***> wrote: Great, that looks to be it. Didn't notice that when I was perusing the code. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#591 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAWPFo1Er-cQFqqWrqwy80HLMWjm6c0Lks5rOuUKgaJpZM4LZYH2> .

brancz · 2017-01-04T08:57:59Z

Thanks for the detailed report and discussion, I'll prepare the fix and some tests around this shortly!

brancz mentioned this issue Jan 4, 2017

notify: replace unfiltered with filtered alerts #595

Merged

fabxc closed this as completed in #595 Jan 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notifications for both firing and resolved items doubts #591

Notifications for both firing and resolved items doubts #591

ddewaele commented Jan 3, 2017 •

edited

Loading

brian-brazil commented Jan 3, 2017

ddewaele commented Jan 3, 2017 •

edited

Loading

brian-brazil commented Jan 3, 2017

ddewaele commented Jan 3, 2017

ddewaele commented Jan 3, 2017

brian-brazil commented Jan 3, 2017

ddewaele commented Jan 3, 2017 •

edited

Loading

ddewaele commented Jan 4, 2017 •

edited

Loading

brian-brazil commented Jan 4, 2017

ddewaele commented Jan 4, 2017 via email

brancz commented Jan 4, 2017

Notifications for both firing and resolved items doubts #591

Notifications for both firing and resolved items doubts #591

Comments

ddewaele commented Jan 3, 2017 • edited Loading

metric1{v=a} == 0 FOR 1min

metric1{v=b} == 0 FOR 1min

metric1{v=a} == 1

metric1{v=b} == 1

metric1{v=a} == 0 FOR 1min

metric1{v=a} == 1

metric1{v=b} == 0 FOR 1min

brian-brazil commented Jan 3, 2017

ddewaele commented Jan 3, 2017 • edited Loading

brian-brazil commented Jan 3, 2017

ddewaele commented Jan 3, 2017

ddewaele commented Jan 3, 2017

brian-brazil commented Jan 3, 2017

ddewaele commented Jan 3, 2017 • edited Loading

ddewaele commented Jan 4, 2017 • edited Loading

brian-brazil commented Jan 4, 2017

ddewaele commented Jan 4, 2017 via email

brancz commented Jan 4, 2017

ddewaele commented Jan 3, 2017 •

edited

Loading

ddewaele commented Jan 3, 2017 •

edited

Loading

ddewaele commented Jan 3, 2017 •

edited

Loading

ddewaele commented Jan 4, 2017 •

edited

Loading