Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolved notification sent only when all alerts are solved #1403

Closed
pintify opened this issue Jun 5, 2018 · 23 comments
Closed

Resolved notification sent only when all alerts are solved #1403

pintify opened this issue Jun 5, 2018 · 23 comments

Comments

@pintify
Copy link

pintify commented Jun 5, 2018

What did you do?

Launch several alerts, and solve part of them

What did you expect to see?

A resolved notification after part of the alerts were solved.

What did you see instead? Under which circumstances?

No solved notification at all. Neither after group_interval time, nor when the repeat_interval reached. Resolution was notified when all the alerts were solved.

Environment

Running with the official docker images

  • System information:

    Linux 3.10.0-693.5.2.el7.x86_64 x86_64

  • Alertmanager version:

    alertmanager, version 0.14.0 (branch: HEAD, revision: 30af4d0)
    build user: root@37b6a49ebba9
    build date: 20180213-08:16:42
    go version: go1.9.2

  • Prometheus version:

    prometheus, version 1.6.1 (branch: master, revision: 4666df502c0e239ed4aa1d80abbbfb54f61b23c3)
    build user: root@7e45fa0366a7
    build date: 20170419-14:32:22
    go version: go1.8.1

  • Alertmanager configuration file:

...
  routes:
  - match:
      ...
    receiver: 'X-team'
    group_by: ['alertname']
    group_wait: 1m
    group_interval: 2m
    repeat_interval: 6h

receivers:
- name: 'X-team'
  webhook_configs:
  - url: 'my-url'
    send_resolved: true
@brian-brazil
Copy link
Contributor

Can you please include the timeline of alerts firing, resolved, and what notifications you saw?

@pintify
Copy link
Author

pintify commented Jun 5, 2018

Service 1 has been down for long time. Its notifications are on repeat_interval regime:

17:45 -> Received notification of service 1 down
18:33 -> Fall of service 2
18:45 -> Received notification of service 1 and 2 down (the delay is due to the scrape interval and the rules, acceptable)
19:00 -> Restored service 2
00:47 -> Received notification of service 1 down
...
9:59 -> Received notification of service 1 down
10:38 -> Fall of service 2
10:48 -> Received notification of service 1 and 2 down
10:55 -> Restored service 2
11:34 -> Fall of service 3
11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
17:47 -> Received notification of service 1 down

None of the notifications received contained resolved information.

@brian-brazil
Copy link
Contributor

No resolved information at all seems odd. Are you sure your alerts are hitting the route you think they are?

@pintify
Copy link
Author

pintify commented Jun 5, 2018

Sorry, I forgot to show a case I tested right before changing the version:

14:15 -> Alert firing for service 1
14:17 -> Received notification of service 1 down
15:25 -> Alert firing for service 2
15:26 -> Received notification of service 1 and 2 down
15:32 -> Alert solved for service 2
15:52 -> Alert solved for service 1
15:53 -> Received notification of service 1 solved

No notification of service 2 resolution ever received!

About your question, there is only one receiver connected to this route. And anyway all of them have send_resolved flag.

@roidelapluie
Copy link
Member

We are very annoyed by this too. This was introduced by #1205.

@brian-brazil
Copy link
Contributor

Sounds like that broke the case when send_resolved was set, as the stated behaviour is correct when it isn't set.

@roidelapluie
Copy link
Member

@brian-brazil I think we should revert #1205 because we want to know when parts of the alert group is resolved.

@brian-brazil
Copy link
Contributor

That'd be fixing one bug by introducing another, and generally you want to reduce alert noise rather than increase it. I suspect this will require a more involved fix.

@roidelapluie
Copy link
Member

The problem of #1205 is that you do not know the current state by looking at the notifications.

@roidelapluie
Copy link
Member

Why do you think that sending resolved alerts is a bug?

@brian-brazil
Copy link
Contributor

I personally believe that resolved notifications are of little to negative value (and seem to cause quite a lot of bugs), but that's not the question here.

The behaviour as described in #1205 is correct when send_resolved is not set. The goal of notifications is to tell you about new things that have broken.
It sounds likes this inadvertently broke send_resolved, which should notify on every group_interval where the set of firing alerts has changed.

@roidelapluie
Copy link
Member

roidelapluie commented Jun 5, 2018

@brian-brazil #1025: If there is more that 5 minutes between the resolution of one alert and the resolution of the group, the resolved will not be sent for that sub alert. That is the actual bug.

the goal of #1205 is that even with send_resolved, you only get the resolution of sub alerts when: 1. New firing alerts or 2. New resolved. That is what was expected in #1205.

@brian-brazil
Copy link
Contributor

brian-brazil commented Jun 5, 2018

@roidelapluie to clarify, you're only reporting a bug form #1205 when send_resolved is set?

@roidelapluie
Copy link
Member

I am not the reporter of this bug. But yes this bug is just when send_resolved is set.

@simonpasquier
Copy link
Member

I'm unclear about the expected behavior. When I submitted #1205, my understanding was that another notification should be sent only if the group contains new firing alerts, irrespective of send_resolved.

Taking this sequence of events for example

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
17:47 -> Received notification of service 1 down

The expectation is still that the notification is sent at 17:47 (on repeat interval) but it should contain the firing alert for service 1 + the resolved alert for service 3. Correct?

@pintify
Copy link
Author

pintify commented Jun 5, 2018

I'm not sure if it is the intended behaviour, but what I expect (and prior to #1205 it worked fine) is:

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
12:02 -> Received notification of service 1 down, service 3 solved
18:02 -> Received notification of service 1 down

@roidelapluie
Copy link
Member

@simonpasquier I would expect

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
12:02 -> Received notification of service 1 down, service 3 solved
18:02 -> Received notification of service 1 down

Why?

Because then when do we warn if an alert in a group is solved, then firing again?

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
12:02 -> Received notification of service 1 down, service 3 solved
14:02 -> Received notification of service 1 and 3 down
22:02 -> Received notification of service 1 and 3 solved

@brian-brazil
Copy link
Contributor

Pintify's #1403 (comment) is the expected semantics (though the exact time of that 12:02 notification may vary, I'd expect it at either 12:00 or 12:05 currently).

@roidelapluie
Copy link
Member

OKAY. Now I get it!!! I get the why of #1205.

@simonpasquier
Copy link
Member

Just to be sure we're all on the same page. The semantics are:

  • send_resolved: true

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
~12:02 -> Received notification of service 1 down, service 3 solved
~18:02 -> Received notification of service 1 down

  • send_resolved: false

11:45 -> Received notification of service 1 and 3 down
12:00 -> Restored service 3
~17:45 -> Received notification of service 1 down

@roidelapluie
Copy link
Member

yes

@simonpasquier
Copy link
Member

Working on the fix.

hh pushed a commit to ii/alertmanager that referenced this issue Jul 1, 2019
* Closes issue prometheus#261 on node_exporter.

Delegated mdstat parsing to procfs project. mdadm_linux.go now only exports the metrics.
-> Added disk labels: "fail", "spare", "active" to indicate disk status
-> hanged metric node_md_disks_total ==> node_md_disks_required
-> Removed test cases for mdadm_linux.go, as the functionality they tested for has been moved to procfs project.

Signed-off-by: Advait Bhatwadekar <advait123@ymail.com>
@gfliker-emx
Copy link

Hi,
Trying here because this matches my problem exactly.
My issue is not with the logic agreed above which makes sense to me.
What im experiencing is that the desired logic agreed on this thread is only working when going through an email receiver.
Thx

This is my version and config
Version: 0.19.0

route:
group_by: ['alertname']
receiver: team-devops-mails

The child route trees.

routes:

This routes performs a regular expression match on alert labels to

catch alerts that are related to a list of services.

  • match_re:
    severity: ^(critical|major)$
    continue: true
    receiver: errors-slack
  • match_re:
    severity: ^(critical|major)$
    receiver: team-devops-mails

inhibit_rules:

  • source_match:
    severity: 'critical'
    target_match:
    severity: 'warning'

    Apply inhibition if the alertname is the same.

    equal: ['alertname']

receivers:

  • name: 'team-devops-mails'
    email_configs:

    • to: '###@###.com'
      send_resolved: true
  • name: errors-slack
    slack_configs:

    • api_url: 'https://hooks.slack.com/s###################'
      username: '#############'
      channel: '#alerts'
      send_resolved: true
      title: |-
      [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
      text: >-

      Alert details:

      {{ range .Alerts -}}
      Alert: {{ .Annotations.title }}{{ if .Labels.severity }} - {{ .Labels.severity }}{{ end }}
      Description: {{ .Annotations.description }}

      Details:
      {{ range .Labels.SortedPairs }} • {{ .Name }}: {{ .Value }}
      {{ end }}
      {{ end }}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants