Resolved notification sent only when all alerts are solved #1403

pintify · 2018-06-05T10:30:07Z

What did you do?

Launch several alerts, and solve part of them

What did you expect to see?

A resolved notification after part of the alerts were solved.

What did you see instead? Under which circumstances?

No solved notification at all. Neither after group_interval time, nor when the repeat_interval reached. Resolution was notified when all the alerts were solved.

Environment

Running with the official docker images

System information:

Linux 3.10.0-693.5.2.el7.x86_64 x86_64
Alertmanager version:

alertmanager, version 0.14.0 (branch: HEAD, revision: 30af4d0)
build user: root@37b6a49ebba9
build date: 20180213-08:16:42
go version: go1.9.2
Prometheus version:

prometheus, version 1.6.1 (branch: master, revision: 4666df502c0e239ed4aa1d80abbbfb54f61b23c3)
build user: root@7e45fa0366a7
build date: 20170419-14:32:22
go version: go1.8.1
Alertmanager configuration file:

...
  routes:
  - match:
      ...
    receiver: 'X-team'
    group_by: ['alertname']
    group_wait: 1m
    group_interval: 2m
    repeat_interval: 6h

receivers:
- name: 'X-team'
  webhook_configs:
  - url: 'my-url'
    send_resolved: true

The text was updated successfully, but these errors were encountered:

brian-brazil · 2018-06-05T10:32:29Z

Can you please include the timeline of alerts firing, resolved, and what notifications you saw?

pintify · 2018-06-05T10:53:30Z

Service 1 has been down for long time. Its notifications are on repeat_interval regime:

17:45 -> Received notification of service 1 down
18:33 -> Fall of service 2
18:45 -> Received notification of service 1 and 2 down (the delay is due to the scrape interval and the rules, acceptable)
19:00 -> Restored service 2
00:47 -> Received notification of service 1 down
...
9:59 -> Received notification of service 1 down
10:38 -> Fall of service 2
10:48 -> Received notification of service 1 and 2 down
10:55 -> Restored service 2
11:34 -> Fall of service 3
11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
17:47 -> Received notification of service 1 down

None of the notifications received contained resolved information.

brian-brazil · 2018-06-05T11:14:22Z

No resolved information at all seems odd. Are you sure your alerts are hitting the route you think they are?

pintify · 2018-06-05T11:31:31Z

Sorry, I forgot to show a case I tested right before changing the version:

14:15 -> Alert firing for service 1
14:17 -> Received notification of service 1 down
15:25 -> Alert firing for service 2
15:26 -> Received notification of service 1 and 2 down
15:32 -> Alert solved for service 2
15:52 -> Alert solved for service 1
15:53 -> Received notification of service 1 solved

No notification of service 2 resolution ever received!

About your question, there is only one receiver connected to this route. And anyway all of them have send_resolved flag.

roidelapluie · 2018-06-05T12:03:49Z

We are very annoyed by this too. This was introduced by #1205.

brian-brazil · 2018-06-05T12:07:43Z

Sounds like that broke the case when send_resolved was set, as the stated behaviour is correct when it isn't set.

roidelapluie · 2018-06-05T12:10:22Z

@brian-brazil I think we should revert #1205 because we want to know when parts of the alert group is resolved.

brian-brazil · 2018-06-05T12:11:23Z

That'd be fixing one bug by introducing another, and generally you want to reduce alert noise rather than increase it. I suspect this will require a more involved fix.

roidelapluie · 2018-06-05T12:12:03Z

The problem of #1205 is that you do not know the current state by looking at the notifications.

roidelapluie · 2018-06-05T12:12:37Z

Why do you think that sending resolved alerts is a bug?

brian-brazil · 2018-06-05T12:15:35Z

I personally believe that resolved notifications are of little to negative value (and seem to cause quite a lot of bugs), but that's not the question here.

The behaviour as described in #1205 is correct when send_resolved is not set. The goal of notifications is to tell you about new things that have broken.
It sounds likes this inadvertently broke send_resolved, which should notify on every group_interval where the set of firing alerts has changed.

roidelapluie · 2018-06-05T12:22:42Z

@brian-brazil #1025: If there is more that 5 minutes between the resolution of one alert and the resolution of the group, the resolved will not be sent for that sub alert. That is the actual bug.

the goal of #1205 is that even with send_resolved, you only get the resolution of sub alerts when: 1. New firing alerts or 2. New resolved. That is what was expected in #1205.

brian-brazil · 2018-06-05T12:44:07Z

@roidelapluie to clarify, you're only reporting a bug form #1205 when send_resolved is set?

roidelapluie · 2018-06-05T12:46:47Z

I am not the reporter of this bug. But yes this bug is just when send_resolved is set.

simonpasquier · 2018-06-05T14:02:59Z

I'm unclear about the expected behavior. When I submitted #1205, my understanding was that another notification should be sent only if the group contains new firing alerts, irrespective of send_resolved.

Taking this sequence of events for example

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
17:47 -> Received notification of service 1 down

The expectation is still that the notification is sent at 17:47 (on repeat interval) but it should contain the firing alert for service 1 + the resolved alert for service 3. Correct?

pintify · 2018-06-05T14:12:00Z

I'm not sure if it is the intended behaviour, but what I expect (and prior to #1205 it worked fine) is:

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
12:02 -> Received notification of service 1 down, service 3 solved
18:02 -> Received notification of service 1 down

roidelapluie · 2018-06-05T14:35:44Z

@simonpasquier I would expect

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
12:02 -> Received notification of service 1 down, service 3 solved
18:02 -> Received notification of service 1 down

Why?

Because then when do we warn if an alert in a group is solved, then firing again?

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
12:02 -> Received notification of service 1 down, service 3 solved
14:02 -> Received notification of service 1 and 3 down
22:02 -> Received notification of service 1 and 3 solved

brian-brazil · 2018-06-05T18:23:35Z

Pintify's #1403 (comment) is the expected semantics (though the exact time of that 12:02 notification may vary, I'd expect it at either 12:00 or 12:05 currently).

roidelapluie · 2018-06-05T18:32:37Z

OKAY. Now I get it!!! I get the why of #1205.

simonpasquier · 2018-06-06T08:35:47Z

Just to be sure we're all on the same page. The semantics are:

send_resolved: true

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
~12:02 -> Received notification of service 1 down, service 3 solved
~18:02 -> Received notification of service 1 down

send_resolved: false

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
~17:45 -> Received notification of service 1 down

roidelapluie · 2018-06-06T08:38:42Z

yes

simonpasquier · 2018-06-06T13:23:00Z

Working on the fix.

* Closes issue prometheus#261 on node_exporter. Delegated mdstat parsing to procfs project. mdadm_linux.go now only exports the metrics. -> Added disk labels: "fail", "spare", "active" to indicate disk status -> hanged metric node_md_disks_total ==> node_md_disks_required -> Removed test cases for mdadm_linux.go, as the functionality they tested for has been moved to procfs project. Signed-off-by: Advait Bhatwadekar <advait123@ymail.com>

gfliker-emx · 2019-12-04T18:32:24Z

Hi,
Trying here because this matches my problem exactly.
My issue is not with the logic agreed above which makes sense to me.
What im experiencing is that the desired logic agreed on this thread is only working when going through an email receiver.
Thx

This is my version and config
Version: 0.19.0

route:
group_by: ['alertname']
receiver: team-devops-mails

The child route trees.

routes:

This routes performs a regular expression match on alert labels to

catch alerts that are related to a list of services.

match_re:
severity: ^(critical|major)$
continue: true
receiver: errors-slack
match_re:
severity: ^(critical|major)$
receiver: team-devops-mails

inhibit_rules:

source_match:
severity: 'critical'
target_match:
severity: 'warning'
Apply inhibition if the alertname is the same.
equal: ['alertname']

receivers:

name: 'team-devops-mails'
email_configs:
- to: '###@###.com'
  send_resolved: true
name: errors-slack
slack_configs:
- api_url: 'https://hooks.slack.com/s###################'
  username: '#############'
  channel: '#alerts'
  send_resolved: true
  title: |-
  [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
  text: >-
  
  Alert details:
  
  {{ range .Alerts -}}
  Alert: {{ .Annotations.title }}{{ if .Labels.severity }} - {{ .Labels.severity }}{{ end }}
  Description: {{ .Annotations.description }}
  
  Details:
  {{ range .Labels.SortedPairs }} • {{ .Name }}: {{ .Value }}
  {{ end }}
  {{ end }}

pintify mentioned this issue Jun 5, 2018

When a group has multiple alarms,cannot receive resolved notice #1399

Closed

simonpasquier mentioned this issue Jun 7, 2018

notify: notify resolved alerts properly #1408

Merged

stuartnelson3 closed this as completed in #1408 Jun 8, 2018

faabsen mentioned this issue Jul 7, 2021

Send resolved notification only when all alerts are solved #2644

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolved notification sent only when all alerts are solved #1403

Resolved notification sent only when all alerts are solved #1403

pintify commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

pintify commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

pintify commented Jun 5, 2018 •

edited

Loading

roidelapluie commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

roidelapluie commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

roidelapluie commented Jun 5, 2018

roidelapluie commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

roidelapluie commented Jun 5, 2018 •

edited

Loading

brian-brazil commented Jun 5, 2018 •

edited

Loading

roidelapluie commented Jun 5, 2018

simonpasquier commented Jun 5, 2018

pintify commented Jun 5, 2018

roidelapluie commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

roidelapluie commented Jun 5, 2018

simonpasquier commented Jun 6, 2018

roidelapluie commented Jun 6, 2018

simonpasquier commented Jun 6, 2018

gfliker-emx commented Dec 4, 2019

Apply inhibition if the alertname is the same.

Resolved notification sent only when all alerts are solved #1403

Resolved notification sent only when all alerts are solved #1403

Comments

pintify commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

pintify commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

pintify commented Jun 5, 2018 • edited Loading

roidelapluie commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

roidelapluie commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

roidelapluie commented Jun 5, 2018

roidelapluie commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

roidelapluie commented Jun 5, 2018 • edited Loading

brian-brazil commented Jun 5, 2018 • edited Loading

roidelapluie commented Jun 5, 2018

simonpasquier commented Jun 5, 2018

pintify commented Jun 5, 2018

roidelapluie commented Jun 5, 2018

brian-brazil commented Jun 5, 2018

roidelapluie commented Jun 5, 2018

simonpasquier commented Jun 6, 2018

roidelapluie commented Jun 6, 2018

simonpasquier commented Jun 6, 2018

gfliker-emx commented Dec 4, 2019

The child route trees.

This routes performs a regular expression match on alert labels to

catch alerts that are related to a list of services.

Apply inhibition if the alertname is the same.

pintify commented Jun 5, 2018 •

edited

Loading

roidelapluie commented Jun 5, 2018 •

edited

Loading

brian-brazil commented Jun 5, 2018 •

edited

Loading