Conversation
|
The goal is the Alertmanager is to reduce noise to users, not to continuously spam them with every individual alert. If you want this it's likely you want to integrate with Prometheus directly rather than the Alertmanager. |
|
Well, use cases are different, and sometimes such functionality is needed |
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
3b8a63b to
fe5d1f9
Compare
|
also it may help in situations like #1587 |
|
Actually @brian-brazil this is a useful feature. At the beginning I was also thinking that group_by: [] would do this. But it does the opposite. I ended up making a long list of all possible labels for grouping. |
|
Yeah, Alertmanager is a useful bridge between Prometheus and a bunch of other things. "Just talk to Prometheus directly" is throwing away the usefulness of the glue that is Alertmanager. Sure, this may not fit certain ideas of best practices, but not all environments are the same and not everyone has the same requirements with alert grouping. This seems like a simple enough feature to add, why not do so if it solves some people's problems? |
|
From my POV, one big reason for not adding it is that it makes the AlertManager configuration even more difficult to get right and understand. Also I'm not convinced that it helps for #1587. As for not grouping some alerts, I would rather add a special label to my alerting rules that would be based of all the labels necessary to uniquely identify an alert instance. - alert: MyAlert
expr: my_metric == 0
for: 1m
labels:
unique_key: "{{ $labels.foo }}/${{ $labels.bar }}/and/many/more"And then use |
|
It helps for #1587 because it guarantees that all alerts page individually, by never grouping anything. This is a plausible config for smaller shops, where even if everything is down the page storm is manageable. Honestly, I don't see how mucking up all alert rules with an extra custom label is better than this one-shot config... at that point you might as well add all possible unique labels to group_by, it's less work than messing with each alert rule individually. |
|
This is absolutely required where you want to feed through individual alerts to PD / OpsGenie etc... - its sort of useless at the moment when grouped together from a reporting perspective and actually knowing what to respond too. But I still would want to retain grouping for informal slack notifications etc.... |
|
That example is a different situation and its not clear what the lack of support actually from an opsgenie / pager duty perspective when its alert manager firing the alerts? |
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
4b5a6df to
ace028a
Compare
|
Okay, this is not a limitation of anything, it is more a limitation of HA in alertmanager. Here is my usecase: We have a webhook receiver that can only manage one alert at the time. That's it. We tried to implement this in the webhook receiver (threating each alert individually)... but then.. it is too slow and the second alertmanager sends the alerts too!! Also, when alert 5 fails, we need to redo all the 4 first alerts, we can not fail atomically. This can be avoided by this pull request. And it is more a way to deal with alertmanager HA. |
|
That sounds like an issue with the receiver, even with this you'd still need to serialise the incoming notifications. I'd suggest a queue. |
|
We use queues for alerts too but in another use case which is out of scope here. The idea here is: I expect group_by: [] to +not group alerts+. Instead, it creates a massive group. We have a receiver that, for exemple, posts to twitter. We are limited in lenght so we must send the alerts 1 by 1. If we do that in the receiver, it is possible that out of 10 alerts, 1 fails, but we need to reply 500 to the alertmanager, so that the complete group would be re-sent. Or it will take a lot of time and the 2nd alert manager will also fire the alerts. For those two reasons this feature makes sense. |
|
The Alertmanager is fundamentally an at-least-once delivery system. I don't think we should be adding (contentious) features to support webhooks that aren't prepared to deal with that. |
From my POV, it does the right thing: grouping on an empty set of label names means that you want to put all alerts in the same bucket. If you really want it, you could generate a label that identifies every alert uniquely: - alert: SomeAlert
expr: my_metric == 0
for: 1m
labels:
key: "{{ range $k, $v := $labels }}{{ $k }}={{ $v }}{{end}}" |
|
Duplicating configuration in potentially thousands of alerts doesn't seem good way to go, especially comparing with one-line simple configuration, which is helpful in many use cases. |
|
I still fail to see how a simple toggle to disable existing behavior is so controversial, and why everyone keeps proposing horrible workarounds that replace a single line in the Alertmanager config with changes to every single alerting rule in Prometheus. I get how this may not be something fitting certain ideas of how things should work or how external services should behave, but we've covered several use cases already where this simple feature would be useful to people. Clearly there is value here. Why the aversion to 1:1 alert routing? Alertmanager isn't just a grouper, it does more things, like silencing and converting alerting protocols. Why tie those things to mandatory grouping? This isn't some horrible subversion of what Alertmanager does or how it should behave in a stack. It's basically just trivial syntactic sugar for a behavior that people can already achieve in an ugly, non-ergonomic way. |
We generally do not add features that can already be handled via configuration management.
The purpose of the Alertmanager is to increase the signal to noise ratio of notifications, grouping and throttling are innate parts of that. Allowing this would be a subversion of the goals of the alertmanager, as it is one of several features that users request for the purposes of spamming themselves due to being used to less powerful systems (without the notion of labels), rather than following https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/preview. |
|
I don't see how asking all of our customers to add a certain label to every single alert config which looks horrible or alternatively having to pre-process their configs is a better idea than a simple configuration. |
But not only these. There are also http_config, pagerduty_config, webhooks_config which means that they are also responsibilities of Alertmanager. |
The goal of software should not be to enforce a particular philosophy on its users. The goal of software should be to empower users to choose the configuration that suits them best. It's fine to nudge users towards not hurting themselves with documentation, but it's not productive to actively alienate those users with legitimate use cases for such features. While there is obviously no requirement that software implement every possible feature that every possible user might use, refusing to implement a simple "turn off this behavior I don't need or want" toggle on ideological grounds is, in my opinion, just poor form. You're basically saying "I know better than everyone else, so you can either do things my way or write your own code". You know full well I was also an SRE at Google, and I agree with most of Rob's essay. But not everyone is Google. I've seen SRE teams at Google in monitoring hell with 10 pages per day. I've seen teams at Google that never got paged. There are many different scenarios that are very different from Google scale. In the extreme, small single-homed systems are often assumed to be able to stay up for years on end without paging. There is no point in optimizing alerting/monitoring rules for cases like that, because you don't even have the data to know what you need to monitor! E.g. symptom-based alerting is fine and dandy when you have a large distributed system with many frontends where you can afford to lose some with zero impact, but for a little single-homed service, damn right I care about finding out that it's running out of RAM before it actually goes down. There is no one-size-fits-all monitoring philosophy. |
|
The pushgateway is an official exporter. This is a recognition that Prometheus someone's needs to do exceptions in its principles. In would see this as an exception needed for some well defined use cases. |
|
One more thing: one of our receivers actually deals with grouping on its side. And I think that because we can not add notifiers to the core of alert manager, as a trade off, the core project could accept this kind of features. It should be alert manager role to be flexible towards advanced users too. |
|
Unfortunately, Do you agree with |
|
Too bad… However, |
|
|
|
@beorn7 raises a good point. |
|
Good point, +1 on |
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
92cb59c to
603898b
Compare
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
beorn7
left a comment
There was a problem hiding this comment.
I'll leave the detailed review to @stuartnelson3 (or whoever is invested in this).
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
|
Suggestion for the text:
|
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
|
Sorry for the long delay, I haven't had the time to look at this. I'll try to get to it early next week. |
stuartnelson3
left a comment
There was a problem hiding this comment.
The code looks fine. It does remind me though about issues with the UI and its assumed grouping, which are described here: #868
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
113b8c3 to
cf867cb
Compare
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
mxinden
left a comment
There was a problem hiding this comment.
Thanks a bunch for bearing with us @kirillsablin!
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
mxinden
left a comment
There was a problem hiding this comment.
Any further comments by others? Otherwise I will merge in the next couple of days.
This PR adds support for group_by_all parameter.
If it is true, all alert's labels will be used for grouping.
If each alert has unique set of labels, this setting will effectively disable aggregation of different alerts.