Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notifications for both firing and resolved items doubts #591

Closed
ddewaele opened this issue Jan 3, 2017 · 11 comments · Fixed by #595
Closed

Notifications for both firing and resolved items doubts #591

ddewaele opened this issue Jan 3, 2017 · 11 comments · Fixed by #595

Comments

@ddewaele
Copy link

ddewaele commented Jan 3, 2017

Hi,

When an alert is active, the alertmanager will send out notifications periodically (as per the group_wait or group_interval value) whenever new items are firing for that group but also when items got resolved in that group.

AFAIK the fact that it also sends separate notifications when items get resolved in the group for an active alert is not related to the global send_resolved flag, but I was wondering if there was a way to disable these notifications. (We're currently receiving lots of emails because of it.)

I've written down the following scenario and listed some questions where I don't understand the behaviour. I've tried to make it as clear / simple as possible, only using 1 target / 1 metric (2 label values) / 1 alert definition and the following configuration

ALERT metric1_down IF metric1 == 0
scrape_interval: 5s
group_wait: 30s
group_interval: 5m
repeat_interval: 24h

Every block below indicates a change in the alerting (new alert fired / alert getting resolved). I've included the timestamps of the actions that have happened and listed down the areas where I'm still missing something. I started with a clean prometheus / alertmanager with no previous alerts.

I've written down my assumptions and my questions.

metric1{v=a} == 0 FOR 1min

  • 06:24:30 : metric1_down{v=a} pending
  • 06:25:38 : metric1_down{v=a} firing (due to alert specifiying FOR 1min)
  • 06:26:12 : receive notification : firing(metric1_down{v=a})

Assumption : Notification is sent after the interval following 06:25:38 + the group_wait: 30s)

metric1{v=b} == 0 FOR 1min

  • 06:27:30 : metric1_down{v=b} pending
  • 06:28:40 : metric1_down{v=b} firing
  • 06:31:12 : receive notification : firing(metric1_down{v=a} and metric1_down{v=b})

Assumption : Alertmanager now waits group_interval: 5m after the previous email was sent. Does not use group_wait

metric1{v=a} == 1

  • 06:31:50 : metric1_down{v=a} alert disappears
  • 06:36:12 : receive notification : firing(metric1{v=b}) and resolved(metric1{v=a})

Assumption: it again waits group_interval: 5m after the previous email was sent

Q1 : Is there a way to resolve this email ? I already now metric1_down{v=b) is firing. Don't want a notification that metric1_down{v=a) is resolved

metric1{v=b} == 1

  • 06:38:50 : metric1_down{v=b} alert disappears
  • 06:41:12 : no slack / email saying everything resolved

Assumption : everything was resolved, no alerts active anymore and send_resolved=false so no email is sent ?

metric1{v=a} == 0 FOR 1min

  • 06:43:00 : metric1_down{v=a} pending
  • 06:44:10 : metric1_down{v=a} firing
  • 06:46:25 : receive notification : firing(metric1_down{v=a}) , resolved(metric1_down{v=b})

Q2 : A bit surprised to see "resolved(metric1_down{v=b}) in this notification. at 06:41 everything as resolved and the system was ok. Also unclear why it got sent at 06:46:25. Does it take into account group_interval or group_wait here ?

metric1{v=a} == 1

  • 06:48:30 : metric1_down{v=a} alert disappears

Assumption : Alertmanager considers the system healthy now ? ... all issues have been resolved.

metric1{v=b} == 0 FOR 1min

  • 06:56:30 : metric1_down{v=b} pending
  • 06:56:37 : metric1_down{v=b} firing
  • 07:01:30 : receive notification : firing(metric1_down{v=b}), resolved(metric1_down{v=a})
  • 07:11:30 : receive notification : firing(metric1_down{v=b})

Q3: 07:01:30 - Was not expecting it to say resolved: metric1{v=b}) in the 07:01:30 msg. Same as Q2

Q4 : 07:11:30 - Not sure why this email is sent. Adds little value. I already know metric1_down{v=b} is firing. Why is this sent 10 minutes after the previous one ?

If you've made it this far thanks a lot ! :) Would appreciate some feedback to get a better understanding of it.

@brian-brazil
Copy link
Contributor

Q1 : Is there a way to resolve this email ? I already now metric1_down{v=b) is firing. Don't want a notification that metric1_down{v=a) is resolved

With send_resolved disabled, there shouldn't have been mention of the resolved alert in any emails.

Does it take into account group_interval or group_wait here ?

It's within 5m, so group_interval still applies.

Is your alertmanager or prometheus under heavy load?

@ddewaele
Copy link
Author

ddewaele commented Jan 3, 2017

No heavy load ... we're using the email and slack notifier and from the docs / code the send_resolved should be false.

With the script below, and a group_interval=2m I'm getting the following mails/slack msgs:

  • 10:23 - firing a1/a2 (30 seconds after a1 was raised, as expected)
  • 10:25 - firing a1/a2/a3/b1/b2/b3 (2mins after first email, as expected due to group_interval)
  • 10:29 - firing b3 / resolved b1,b2,a1,a2,a3 (no idea why this came 4 minutes after the previous mail)
  • 10:32 - firing b3 (no idea why this is sent)

I thought the send_resolved was all about sending "resolved-only" msgs. In my case, the emails that always include at least 1 firing alert, albeit some duplicates)

curl -XPOST -d'[{"labels":{"alertname":"a1"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"a2"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"a3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b1"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b2"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; echo "done... now looping b3 for keepAlive"
while true; do  curl -XPOST -d'[{"labels":{"alertname":"b3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date  ; sleep 15; done

@brian-brazil
Copy link
Contributor

Can you share your full alertmanager config for that?

@ddewaele
Copy link
Author

ddewaele commented Jan 3, 2017

Sure...

global:
  smtp_smarthost: 'xxx'
  smtp_from: 'xxx'
  smtp_auth_username: 'xxx'
  smtp_auth_password: 'xxx'
  slack_api_url: 'xxxx'
# The root route on which each incoming alert enters.
route:
  # The root route must not have any matchers as it is the entry point for
  # all alerts. It needs to have a receiver configured so alerts that do not
  # match any of the sub-routes are sent to someone.
  receiver: 'team-xxx-mails'

  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['group']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.(with the shell script. The other scenario was executed with group_interval: 5m)
  group_interval: 2m  

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 24h

  # All the above attributes are inherited by all child routes and can 
  # overwritten on each.

  # The child route trees.
  routes:
  - match:
      job: Controllers
    receiver: team-xxx-mails

receivers:
- name: 'team-xxx-mails'
  email_configs:
    - to: 'xxx'
      html: '{{ template "xxx.email.default.html" . }}'
      headers:
        subject: '{{ template "xxx.subject" . }}'

    #require_tls: false
  slack_configs:
  - channel: '#notify-xxx-dev'

templates:
 - '/etc/alertmanager/template-vars.tmpl'
 - '/etc/alertmanager/templates/xxx.tmpl'
 - '/etc/alertmanager/templates/yyy.tmpl'

@ddewaele
Copy link
Author

ddewaele commented Jan 3, 2017

@brian-brazil : if you need me to do some further testing just let me know...

Here's some debug logging output. From this output I got 4 mails where I expected only 2 mails.

The last 2 mails are redundant, as they both say that alert b3 is active (I knew that from the second mail).

time="2017-01-03T15:25:37Z" level=debug msg="Received alert" alert=a1[f00e214][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:25:52Z" level=debug msg="Received alert" alert=a2[f004214][active] component=dispatcher source="dispatch.go:168" 
** MAIL ** time="2017-01-03T15:26:07Z" level=debug msg="flushing [a1[f00e214][active] a2[f004214][active]]" aggrGroup=cbf29ce484222325 source="dispatch.go:405" 
time="2017-01-03T15:26:07Z" level=debug msg="Received alert" alert=a3[f008214][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:26:22Z" level=debug msg="Received alert" alert=b1[e75cf84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:26:37Z" level=debug msg="Received alert" alert=b2[e75af84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:26:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:26:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:27:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:27:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:27:37Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:27:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
** MAIL ** time="2017-01-03T15:28:07Z" level=debug msg="flushing [a2[f004214][resolved] a3[f008214][active] b1[e75cf84][active] b2[e75af84][active] b3[e756f84][active] a1[f00e214][resolved]]" aggrGroup=cbf29ce484222325 source="dispatch.go:405" 
time="2017-01-03T15:28:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:28:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:28:37Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:28:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:29:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:29:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:29:37Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:29:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
** MAIL ** time="2017-01-03T15:30:07Z" level=debug msg="flushing [b1[e75cf84][resolved] b2[e75af84][resolved] b3[e756f84][active]]" aggrGroup=cbf29ce484222325 source="dispatch.go:405" 
time="2017-01-03T15:30:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:30:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:30:37Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:30:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:31:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:31:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:31:37Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:31:52Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
** MAIL ** time="2017-01-03T15:32:07Z" level=debug msg="flushing [b3[e756f84][active]]" aggrGroup=cbf29ce484222325 source="dispatch.go:405" 
time="2017-01-03T15:32:07Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 
time="2017-01-03T15:32:22Z" level=debug msg="Received alert" alert=b3[e756f84][active] component=dispatcher source="dispatch.go:168" 

Our operators are only interested in the first 2 emails. (as they know now that they need to investigate a1,a2,a3, b1,b2,b3). The third / fourth email is noise to them.

According to you with send_resolved: false should the alertmanager only be sending out the first 2 mails ?

@brian-brazil
Copy link
Contributor

Is this only happening with email, and are you sure the config is what the alertmanager has loaded?

@ddewaele
Copy link
Author

ddewaele commented Jan 3, 2017

Using Slack, Email and webhooks. For every email there is also a slack and a webhook call. So don't think that it's related to the receivers.
I've verified that the correct config is loaded via the alertmanager status page
Tried it with both 0.50 and 0.51 (using Docker)

Can you reproduce it using the curl commands above ?

@ddewaele
Copy link
Author

ddewaele commented Jan 4, 2017

@brian-brazil : Could this be the problem ?

https://github.com/prometheus/alertmanager/blob/master/notify/impl.go#L82

// Notify implements the Notifier interface.
func (i *Integration) Notify(ctx context.Context, alerts ...*types.Alert) (bool, error) {
	var res []*types.Alert

	// Resolved alerts have to be filtered only at this point, because they need
	// to end up unfiltered in the SetNotifiesStage.
	if i.conf.SendResolved() {
		res = alerts
	} else {
		for _, a := range alerts {
			if a.Status() != model.AlertResolved {
				res = append(res, a)
			}
		}
	}
	if len(res) == 0 {
		return false, nil
	}

	return i.notifier.Notify(ctx, alerts...)
}

First time I'm looking at GO code, but shouldn't it be passing along the filtered alerts (res) instead of the original alerts (alerts) to the notify function ?

Will do some testing with it tomorrow now that I know how to compile and run GO code :)

@brian-brazil
Copy link
Contributor

Great, that looks to be it. Didn't notice that when I was perusing the code.

@ddewaele
Copy link
Author

ddewaele commented Jan 4, 2017 via email

@brancz
Copy link
Member

brancz commented Jan 4, 2017

Thanks for the detailed report and discussion, I'll prepare the fix and some tests around this shortly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants