-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
What did you do?
Have 2 alerts firing for a long time, and configured the inhibition rule in such a way that one of the alerts inhibits the other one.
What did you expect to see?
The Alertmanager does not send a notification for the inhibited alert if I restart/reload it.
What did you see instead? Under which circumstances?
The Alertmanager sent a notification for the alert, which should have been inhibited right away once it received the alert from Prometheus.
Root cause
We dug into the issue and found this if that is causing the issue.
What it does is that if the alert started to fire longer time before then the group_wait, it will send it right a way,
But for the inhibition to work, it is necessary for it to wait at least the group_wait after start before sending any alerts so it knows that no alerts that should inhibit it are firing.
Reasoning
We tried to find any reasoning for it, but the author is @fabxc 7y ago, and it was part of large changes without any particular obvious reason for this. The only meaning I can come up with is trying to avoid waiting the group_wait for an alert that might have started to fire during the Alertmanager unavailability, so it is sent ASAP.
Possible fixes
- Change the
alert.StartsAtin theiftoalert.UpdatedAt: This might help, but not sure if it would fix it in 100% cases - Drop the
ifcompletely since it makes the inhibitor "flaky" during reloads/restarts - Make the inhibitor state persistent and gossiped (or somehow loaded from nflog) so before starting up the Alertmanager would know what alerts are firing for the purpose of inhibiting
(we have currently been running the fix, where the if is dropped in production and everything is working fine and the flakiness of inhibiting is gone)
I'd be happy to send a PR with a fix if we agree that it is a bug that should be fixed and what is the best way.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status