Skip to content

add group_by_all support#1588

Merged
mxinden merged 14 commits intoprometheus:masterfrom
kirillsablin:group-by-all-param
Nov 29, 2018
Merged

add group_by_all support#1588
mxinden merged 14 commits intoprometheus:masterfrom
kirillsablin:group-by-all-param

Conversation

@kirillsablin
Copy link
Contributor

This PR adds support for group_by_all parameter.

If it is true, all alert's labels will be used for grouping.
If each alert has unique set of labels, this setting will effectively disable aggregation of different alerts.

@kirillsablin
Copy link
Contributor Author

@stuartnelson3

@brian-brazil
Copy link
Contributor

The goal is the Alertmanager is to reduce noise to users, not to continuously spam them with every individual alert. If you want this it's likely you want to integrate with Prometheus directly rather than the Alertmanager.

@kirillsablin
Copy link
Contributor Author

Well, use cases are different, and sometimes such functionality is needed

Kyryl Sablin added 3 commits October 17, 2018 14:18
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
@kirillsablin
Copy link
Contributor Author

kirillsablin commented Oct 17, 2018

also it may help in situations like #1587

@roidelapluie
Copy link
Member

roidelapluie commented Oct 17, 2018

Actually @brian-brazil this is a useful feature.

At the beginning I was also thinking that group_by: [] would do this. But it does the opposite. I ended up making a long list of all possible labels for grouping.

@marcan
Copy link

marcan commented Oct 18, 2018

Yeah, Alertmanager is a useful bridge between Prometheus and a bunch of other things. "Just talk to Prometheus directly" is throwing away the usefulness of the glue that is Alertmanager. Sure, this may not fit certain ideas of best practices, but not all environments are the same and not everyone has the same requirements with alert grouping. This seems like a simple enough feature to add, why not do so if it solves some people's problems?

@simonpasquier
Copy link
Member

From my POV, one big reason for not adding it is that it makes the AlertManager configuration even more difficult to get right and understand. Also I'm not convinced that it helps for #1587.

As for not grouping some alerts, I would rather add a special label to my alerting rules that would be based of all the labels necessary to uniquely identify an alert instance.

  - alert: MyAlert
    expr: my_metric == 0
    for: 1m
    labels:
      unique_key: "{{ $labels.foo }}/${{ $labels.bar }}/and/many/more"

And then use group_by [alertname,job,instance,unique_key] would be enough to have all my alerts ungrouped without having AlertManager to know about all possible label names.

@marcan
Copy link

marcan commented Oct 18, 2018

It helps for #1587 because it guarantees that all alerts page individually, by never grouping anything. This is a plausible config for smaller shops, where even if everything is down the page storm is manageable.

Honestly, I don't see how mucking up all alert rules with an extra custom label is better than this one-shot config... at that point you might as well add all possible unique labels to group_by, it's less work than messing with each alert rule individually.

@pete-leese
Copy link

This is absolutely required where you want to feed through individual alerts to PD / OpsGenie etc... - its sort of useless at the moment when grouped together from a reporting perspective and actually knowing what to respond too.

But I still would want to retain grouping for informal slack notifications etc....

@simonpasquier
Copy link
Member

As @beorn7 already commented, features shouldn't be added to alleviate the lack of PD or OpsGenie. I can find at least one past example where a feature had been asked for similar reasons and eventually OpsGenie fixed the limitation on their end. You should try to get those people involved here.

@pete-leese
Copy link

That example is a different situation and its not clear what the lack of support actually from an opsgenie / pager duty perspective when its alert manager firing the alerts?

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
@roidelapluie
Copy link
Member

Okay, this is not a limitation of anything, it is more a limitation of HA in alertmanager.

Here is my usecase:

We have a webhook receiver that can only manage one alert at the time. That's it.

We tried to implement this in the webhook receiver (threating each alert individually)... but then.. it is too slow and the second alertmanager sends the alerts too!! Also, when alert 5 fails, we need to redo all the 4 first alerts, we can not fail atomically.

This can be avoided by this pull request. And it is more a way to deal with alertmanager HA.

@brian-brazil
Copy link
Contributor

That sounds like an issue with the receiver, even with this you'd still need to serialise the incoming notifications. I'd suggest a queue.

@roidelapluie
Copy link
Member

We use queues for alerts too but in another use case which is out of scope here.

The idea here is:

I expect group_by: [] to +not group alerts+. Instead, it creates a massive group.

We have a receiver that, for exemple, posts to twitter. We are limited in lenght so we must send the alerts 1 by 1.

If we do that in the receiver, it is possible that out of 10 alerts, 1 fails, but we need to reply 500 to the alertmanager, so that the complete group would be re-sent. Or it will take a lot of time and the 2nd alert manager will also fire the alerts.

For those two reasons this feature makes sense.

@brian-brazil
Copy link
Contributor

The Alertmanager is fundamentally an at-least-once delivery system. I don't think we should be adding (contentious) features to support webhooks that aren't prepared to deal with that.

@simonpasquier
Copy link
Member

I expect group_by: [] to +not group alerts+. Instead, it creates a massive group.

From my POV, it does the right thing: grouping on an empty set of label names means that you want to put all alerts in the same bucket.

If you really want it, you could generate a label that identifies every alert uniquely:

  - alert: SomeAlert
    expr: my_metric == 0
    for: 1m
    labels:
      key: "{{ range $k, $v := $labels }}{{ $k }}={{ $v }}{{end}}"

@kirillsablin
Copy link
Contributor Author

Duplicating configuration in potentially thousands of alerts doesn't seem good way to go, especially comparing with one-line simple configuration, which is helpful in many use cases.

@marcan
Copy link

marcan commented Oct 23, 2018

I still fail to see how a simple toggle to disable existing behavior is so controversial, and why everyone keeps proposing horrible workarounds that replace a single line in the Alertmanager config with changes to every single alerting rule in Prometheus.

I get how this may not be something fitting certain ideas of how things should work or how external services should behave, but we've covered several use cases already where this simple feature would be useful to people. Clearly there is value here. Why the aversion to 1:1 alert routing? Alertmanager isn't just a grouper, it does more things, like silencing and converting alerting protocols. Why tie those things to mandatory grouping? This isn't some horrible subversion of what Alertmanager does or how it should behave in a stack. It's basically just trivial syntactic sugar for a behavior that people can already achieve in an ugly, non-ergonomic way.

@brian-brazil
Copy link
Contributor

that replace a single line in the Alertmanager config with changes to every single alerting rule in Prometheus.

We generally do not add features that can already be handled via configuration management.

Alertmanager isn't just a grouper, it does more things, like silencing and converting alerting protocols. Why tie those things to mandatory grouping? This isn't some horrible subversion of what Alertmanager does or how it should behave in a stack.

The purpose of the Alertmanager is to increase the signal to noise ratio of notifications, grouping and throttling are innate parts of that. Allowing this would be a subversion of the goals of the alertmanager, as it is one of several features that users request for the purposes of spamming themselves due to being used to less powerful systems (without the notion of labels), rather than following https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/preview.

@michellevalentino
Copy link

I don't see how asking all of our customers to add a certain label to every single alert config which looks horrible or alternatively having to pre-process their configs is a better idea than a simple configuration.

@kirillsablin
Copy link
Contributor Author

kirillsablin commented Oct 23, 2018

The purpose of the Alertmanager is to increase the signal to noise ratio of notifications, grouping and throttling are innate parts of that.

But not only these. There are also http_config, pagerduty_config, webhooks_config which means that they are also responsibilities of Alertmanager.

@marcan
Copy link

marcan commented Oct 23, 2018

The purpose of the Alertmanager is to increase the signal to noise ratio of notifications, grouping and throttling are innate parts of that. Allowing this would be a subversion of the goals of the alertmanager, as it is one of several features that users request for the purposes of spamming themselves due to being used to less powerful systems (without the notion of labels), rather than following My Philosophy on Alerting

The goal of software should not be to enforce a particular philosophy on its users. The goal of software should be to empower users to choose the configuration that suits them best. It's fine to nudge users towards not hurting themselves with documentation, but it's not productive to actively alienate those users with legitimate use cases for such features. While there is obviously no requirement that software implement every possible feature that every possible user might use, refusing to implement a simple "turn off this behavior I don't need or want" toggle on ideological grounds is, in my opinion, just poor form. You're basically saying "I know better than everyone else, so you can either do things my way or write your own code".

You know full well I was also an SRE at Google, and I agree with most of Rob's essay. But not everyone is Google. I've seen SRE teams at Google in monitoring hell with 10 pages per day. I've seen teams at Google that never got paged. There are many different scenarios that are very different from Google scale. In the extreme, small single-homed systems are often assumed to be able to stay up for years on end without paging. There is no point in optimizing alerting/monitoring rules for cases like that, because you don't even have the data to know what you need to monitor! E.g. symptom-based alerting is fine and dandy when you have a large distributed system with many frontends where you can afford to lose some with zero impact, but for a little single-homed service, damn right I care about finding out that it's running out of RAM before it actually goes down. There is no one-size-fits-all monitoring philosophy.

@roidelapluie
Copy link
Member

The pushgateway is an official exporter. This is a recognition that Prometheus someone's needs to do exceptions in its principles. In would see this as an exception needed for some well defined use cases.

@roidelapluie
Copy link
Member

One more thing: one of our receivers actually deals with grouping on its side. And I think that because we can not add notifiers to the core of alert manager, as a trade off, the core project could accept this kind of features. It should be alert manager role to be flexible towards advanced users too.

@kirillsablin
Copy link
Contributor Author

Unfortunately,
according to
http://yaml.org/spec/current.html#id2535812
and
http://yaml.org/spec/current.html#alias/syntax,
alias is starting with *, so it have to be quoted.

Do you agree with
group_by: ['*'] ?

@beorn7
Copy link
Member

beorn7 commented Nov 2, 2018

Too bad…

However, group_by: [...] would be valid YAML. Should we go with the latter?

@beorn7
Copy link
Member

beorn7 commented Nov 2, 2018

group_by: [...] would also avoid any wrong assumption that we support globbing. (Seeing group_by: ['*'], people might try group_by: ['some_prefix_*'] to group by all labels starting with some_prefix_.)

@stuartnelson3
Copy link
Contributor

@beorn7 raises a good point. [...] sounds good, too.

@discordianfish
Copy link
Member

Good point, +1 on [...]

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Kyryl Sablin added 2 commits November 2, 2018 15:40
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave the detailed review to @stuartnelson3 (or whoever is invested in this).

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
@marcan
Copy link

marcan commented Nov 2, 2018

Suggestion for the text:

To aggregate by all possible labels use '...' as the sole label name. This effectively disables aggregation entirely, passing through all alerts as-is. This is unlikely to be what you want, unless you have a very low alert volume or your upstream notification system performs its own grouping.

Kyryl Sablin added 2 commits November 2, 2018 16:37
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
@stuartnelson3
Copy link
Contributor

Sorry for the long delay, I haven't had the time to look at this. I'll try to get to it early next week.

Copy link
Contributor

@stuartnelson3 stuartnelson3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine. It does remind me though about issues with the UI and its assumed grouping, which are described here: #868

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Kyryl Sablin added 2 commits November 20, 2018 10:03
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch for bearing with us @kirillsablin!

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any further comments by others? Otherwise I will merge in the next couple of days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.