-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolved notification sent only when all alerts are solved #1403
Comments
Can you please include the timeline of alerts firing, resolved, and what notifications you saw? |
Service 1 has been down for long time. Its notifications are on repeat_interval regime: 17:45 -> Received notification of service 1 down None of the notifications received contained resolved information. |
No resolved information at all seems odd. Are you sure your alerts are hitting the route you think they are? |
Sorry, I forgot to show a case I tested right before changing the version: 14:15 -> Alert firing for service 1 No notification of service 2 resolution ever received! About your question, there is only one receiver connected to this route. And anyway all of them have send_resolved flag. |
We are very annoyed by this too. This was introduced by #1205. |
Sounds like that broke the case when send_resolved was set, as the stated behaviour is correct when it isn't set. |
@brian-brazil I think we should revert #1205 because we want to know when parts of the alert group is resolved. |
That'd be fixing one bug by introducing another, and generally you want to reduce alert noise rather than increase it. I suspect this will require a more involved fix. |
The problem of #1205 is that you do not know the current state by looking at the notifications. |
Why do you think that sending resolved alerts is a bug? |
I personally believe that resolved notifications are of little to negative value (and seem to cause quite a lot of bugs), but that's not the question here. The behaviour as described in #1205 is correct when send_resolved is not set. The goal of notifications is to tell you about new things that have broken. |
@brian-brazil #1025: If there is more that 5 minutes between the resolution of one alert and the resolution of the group, the resolved will not be sent for that sub alert. That is the actual bug. the goal of #1205 is that even with send_resolved, you only get the resolution of sub alerts when: 1. New firing alerts or 2. New resolved. That is what was expected in #1205. |
@roidelapluie to clarify, you're only reporting a bug form #1205 when send_resolved is set? |
I am not the reporter of this bug. But yes this bug is just when send_resolved is set. |
I'm unclear about the expected behavior. When I submitted #1205, my understanding was that another notification should be sent only if the group contains new firing alerts, irrespective of Taking this sequence of events for example
The expectation is still that the notification is sent at 17:47 (on |
I'm not sure if it is the intended behaviour, but what I expect (and prior to #1205 it worked fine) is: 11:45 -> Received notification of service 1 and 3 down |
@simonpasquier I would expect 11:45 -> Received notification of service 1 and 3 down Why? Because then when do we warn if an alert in a group is solved, then firing again? 11:45 -> Received notification of service 1 and 3 down |
Pintify's #1403 (comment) is the expected semantics (though the exact time of that 12:02 notification may vary, I'd expect it at either 12:00 or 12:05 currently). |
OKAY. Now I get it!!! I get the why of #1205. |
Just to be sure we're all on the same page. The semantics are:
11:45 -> Received notification of service 1 and 3 down
11:45 -> Received notification of service 1 and 3 down |
yes |
Working on the fix. |
* Closes issue prometheus#261 on node_exporter. Delegated mdstat parsing to procfs project. mdadm_linux.go now only exports the metrics. -> Added disk labels: "fail", "spare", "active" to indicate disk status -> hanged metric node_md_disks_total ==> node_md_disks_required -> Removed test cases for mdadm_linux.go, as the functionality they tested for has been moved to procfs project. Signed-off-by: Advait Bhatwadekar <advait123@ymail.com>
Hi, This is my version and config route: The child route trees.routes: This routes performs a regular expression match on alert labels tocatch alerts that are related to a list of services.
inhibit_rules:
receivers:
|
What did you do?
Launch several alerts, and solve part of them
What did you expect to see?
A resolved notification after part of the alerts were solved.
What did you see instead? Under which circumstances?
No solved notification at all. Neither after group_interval time, nor when the repeat_interval reached. Resolution was notified when all the alerts were solved.
Environment
Running with the official docker images
System information:
Linux 3.10.0-693.5.2.el7.x86_64 x86_64
Alertmanager version:
alertmanager, version 0.14.0 (branch: HEAD, revision: 30af4d0)
build user: root@37b6a49ebba9
build date: 20180213-08:16:42
go version: go1.9.2
Prometheus version:
prometheus, version 1.6.1 (branch: master, revision: 4666df502c0e239ed4aa1d80abbbfb54f61b23c3)
build user: root@7e45fa0366a7
build date: 20170419-14:32:22
go version: go1.8.1
Alertmanager configuration file:
The text was updated successfully, but these errors were encountered: