add group_by_all support by kirillsablin · Pull Request #1588 · prometheus/alertmanager

kirillsablin · 2018-10-17T09:58:33Z

This PR adds support for group_by_all parameter.

If it is true, all alert's labels will be used for grouping.
If each alert has unique set of labels, this setting will effectively disable aggregation of different alerts.

kirillsablin · 2018-10-17T09:58:42Z

@stuartnelson3

brian-brazil · 2018-10-17T10:26:10Z

The goal is the Alertmanager is to reduce noise to users, not to continuously spam them with every individual alert. If you want this it's likely you want to integrate with Prometheus directly rather than the Alertmanager.

kirillsablin · 2018-10-17T12:04:35Z

Well, use cases are different, and sometimes such functionality is needed

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

kirillsablin · 2018-10-17T12:44:50Z

also it may help in situations like #1587

roidelapluie · 2018-10-17T13:29:57Z

Actually @brian-brazil this is a useful feature.

At the beginning I was also thinking that group_by: [] would do this. But it does the opposite. I ended up making a long list of all possible labels for grouping.

marcan · 2018-10-18T10:59:29Z

Yeah, Alertmanager is a useful bridge between Prometheus and a bunch of other things. "Just talk to Prometheus directly" is throwing away the usefulness of the glue that is Alertmanager. Sure, this may not fit certain ideas of best practices, but not all environments are the same and not everyone has the same requirements with alert grouping. This seems like a simple enough feature to add, why not do so if it solves some people's problems?

simonpasquier · 2018-10-18T13:07:35Z

From my POV, one big reason for not adding it is that it makes the AlertManager configuration even more difficult to get right and understand. Also I'm not convinced that it helps for #1587.

As for not grouping some alerts, I would rather add a special label to my alerting rules that would be based of all the labels necessary to uniquely identify an alert instance.

  - alert: MyAlert
    expr: my_metric == 0
    for: 1m
    labels:
      unique_key: "{{ $labels.foo }}/${{ $labels.bar }}/and/many/more"

And then use group_by [alertname,job,instance,unique_key] would be enough to have all my alerts ungrouped without having AlertManager to know about all possible label names.

marcan · 2018-10-18T13:34:34Z

It helps for #1587 because it guarantees that all alerts page individually, by never grouping anything. This is a plausible config for smaller shops, where even if everything is down the page storm is manageable.

Honestly, I don't see how mucking up all alert rules with an extra custom label is better than this one-shot config... at that point you might as well add all possible unique labels to group_by, it's less work than messing with each alert rule individually.

pete-leese · 2018-10-19T15:15:02Z

This is absolutely required where you want to feed through individual alerts to PD / OpsGenie etc... - its sort of useless at the moment when grouped together from a reporting perspective and actually knowing what to respond too.

But I still would want to retain grouping for informal slack notifications etc....

simonpasquier · 2018-10-19T15:30:51Z

As @beorn7 already commented, features shouldn't be added to alleviate the lack of PD or OpsGenie. I can find at least one past example where a feature had been asked for similar reasons and eventually OpsGenie fixed the limitation on their end. You should try to get those people involved here.

pete-leese · 2018-10-22T16:16:08Z

That example is a different situation and its not clear what the lack of support actually from an opsgenie / pager duty perspective when its alert manager firing the alerts?

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

roidelapluie · 2018-10-22T20:59:48Z

Okay, this is not a limitation of anything, it is more a limitation of HA in alertmanager.

Here is my usecase:

We have a webhook receiver that can only manage one alert at the time. That's it.

We tried to implement this in the webhook receiver (threating each alert individually)... but then.. it is too slow and the second alertmanager sends the alerts too!! Also, when alert 5 fails, we need to redo all the 4 first alerts, we can not fail atomically.

This can be avoided by this pull request. And it is more a way to deal with alertmanager HA.

brian-brazil · 2018-10-22T21:25:31Z

That sounds like an issue with the receiver, even with this you'd still need to serialise the incoming notifications. I'd suggest a queue.

roidelapluie · 2018-10-23T08:02:55Z

We use queues for alerts too but in another use case which is out of scope here.

The idea here is:

I expect group_by: [] to +not group alerts+. Instead, it creates a massive group.

We have a receiver that, for exemple, posts to twitter. We are limited in lenght so we must send the alerts 1 by 1.

If we do that in the receiver, it is possible that out of 10 alerts, 1 fails, but we need to reply 500 to the alertmanager, so that the complete group would be re-sent. Or it will take a lot of time and the 2nd alert manager will also fire the alerts.

For those two reasons this feature makes sense.

brian-brazil · 2018-10-23T08:20:50Z

The Alertmanager is fundamentally an at-least-once delivery system. I don't think we should be adding (contentious) features to support webhooks that aren't prepared to deal with that.

simonpasquier · 2018-10-23T09:21:03Z

I expect group_by: [] to +not group alerts+. Instead, it creates a massive group.

From my POV, it does the right thing: grouping on an empty set of label names means that you want to put all alerts in the same bucket.

If you really want it, you could generate a label that identifies every alert uniquely:

  - alert: SomeAlert
    expr: my_metric == 0
    for: 1m
    labels:
      key: "{{ range $k, $v := $labels }}{{ $k }}={{ $v }}{{end}}"

kirillsablin · 2018-10-23T09:26:20Z

Duplicating configuration in potentially thousands of alerts doesn't seem good way to go, especially comparing with one-line simple configuration, which is helpful in many use cases.

marcan · 2018-10-23T09:28:09Z

I still fail to see how a simple toggle to disable existing behavior is so controversial, and why everyone keeps proposing horrible workarounds that replace a single line in the Alertmanager config with changes to every single alerting rule in Prometheus.

I get how this may not be something fitting certain ideas of how things should work or how external services should behave, but we've covered several use cases already where this simple feature would be useful to people. Clearly there is value here. Why the aversion to 1:1 alert routing? Alertmanager isn't just a grouper, it does more things, like silencing and converting alerting protocols. Why tie those things to mandatory grouping? This isn't some horrible subversion of what Alertmanager does or how it should behave in a stack. It's basically just trivial syntactic sugar for a behavior that people can already achieve in an ugly, non-ergonomic way.

brian-brazil · 2018-10-23T09:37:17Z

that replace a single line in the Alertmanager config with changes to every single alerting rule in Prometheus.

We generally do not add features that can already be handled via configuration management.

Alertmanager isn't just a grouper, it does more things, like silencing and converting alerting protocols. Why tie those things to mandatory grouping? This isn't some horrible subversion of what Alertmanager does or how it should behave in a stack.

The purpose of the Alertmanager is to increase the signal to noise ratio of notifications, grouping and throttling are innate parts of that. Allowing this would be a subversion of the goals of the alertmanager, as it is one of several features that users request for the purposes of spamming themselves due to being used to less powerful systems (without the notion of labels), rather than following https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/preview.

michellevalentino · 2018-10-23T09:37:48Z

I don't see how asking all of our customers to add a certain label to every single alert config which looks horrible or alternatively having to pre-process their configs is a better idea than a simple configuration.

kirillsablin · 2018-10-23T09:44:58Z

The purpose of the Alertmanager is to increase the signal to noise ratio of notifications, grouping and throttling are innate parts of that.

But not only these. There are also http_config, pagerduty_config, webhooks_config which means that they are also responsibilities of Alertmanager.

marcan · 2018-10-23T10:04:06Z

The purpose of the Alertmanager is to increase the signal to noise ratio of notifications, grouping and throttling are innate parts of that. Allowing this would be a subversion of the goals of the alertmanager, as it is one of several features that users request for the purposes of spamming themselves due to being used to less powerful systems (without the notion of labels), rather than following My Philosophy on Alerting

The goal of software should not be to enforce a particular philosophy on its users. The goal of software should be to empower users to choose the configuration that suits them best. It's fine to nudge users towards not hurting themselves with documentation, but it's not productive to actively alienate those users with legitimate use cases for such features. While there is obviously no requirement that software implement every possible feature that every possible user might use, refusing to implement a simple "turn off this behavior I don't need or want" toggle on ideological grounds is, in my opinion, just poor form. You're basically saying "I know better than everyone else, so you can either do things my way or write your own code".

You know full well I was also an SRE at Google, and I agree with most of Rob's essay. But not everyone is Google. I've seen SRE teams at Google in monitoring hell with 10 pages per day. I've seen teams at Google that never got paged. There are many different scenarios that are very different from Google scale. In the extreme, small single-homed systems are often assumed to be able to stay up for years on end without paging. There is no point in optimizing alerting/monitoring rules for cases like that, because you don't even have the data to know what you need to monitor! E.g. symptom-based alerting is fine and dandy when you have a large distributed system with many frontends where you can afford to lose some with zero impact, but for a little single-homed service, damn right I care about finding out that it's running out of RAM before it actually goes down. There is no one-size-fits-all monitoring philosophy.

roidelapluie · 2018-10-25T09:07:18Z

The pushgateway is an official exporter. This is a recognition that Prometheus someone's needs to do exceptions in its principles. In would see this as an exception needed for some well defined use cases.

roidelapluie · 2018-10-26T17:23:47Z

One more thing: one of our receivers actually deals with grouping on its side. And I think that because we can not add notifiers to the core of alert manager, as a trade off, the core project could accept this kind of features. It should be alert manager role to be flexible towards advanced users too.

kirillsablin · 2018-11-02T10:10:20Z

Unfortunately,
according to
http://yaml.org/spec/current.html#id2535812
and
http://yaml.org/spec/current.html#alias/syntax,
alias is starting with *, so it have to be quoted.

Do you agree with
group_by: ['*'] ?

beorn7 · 2018-11-02T11:14:49Z

Too bad…

However, group_by: [...] would be valid YAML. Should we go with the latter?

beorn7 · 2018-11-02T11:16:29Z

group_by: [...] would also avoid any wrong assumption that we support globbing. (Seeing group_by: ['*'], people might try group_by: ['some_prefix_*'] to group by all labels starting with some_prefix_.)

stuartnelson3 · 2018-11-02T13:03:34Z

@beorn7 raises a good point. [...] sounds good, too.

discordianfish · 2018-11-02T13:43:43Z

Good point, +1 on [...]

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

beorn7

I'll leave the detailed review to @stuartnelson3 (or whoever is invested in this).

README.md

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

marcan · 2018-11-02T14:56:39Z

Suggestion for the text:

To aggregate by all possible labels use '...' as the sole label name. This effectively disables aggregation entirely, passing through all alerts as-is. This is unlikely to be what you want, unless you have a very low alert volume or your upstream notification system performs its own grouping.

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

stuartnelson3 · 2018-11-16T16:24:03Z

Sorry for the long delay, I haven't had the time to look at this. I'll try to get to it early next week.

stuartnelson3

The code looks fine. It does remind me though about issues with the UI and its assumed grouping, which are described here: #868

README.md

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

config/testdata/conf.good.yml

config/config.go

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

mxinden

Thanks a bunch for bearing with us @kirillsablin!

config/config.go

config/config_test.go

dispatch/route.go

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

mxinden

Any further comments by others? Otherwise I will merge in the next couple of days.

Kyryl Sablin added 3 commits October 17, 2018 14:18

add group_by_all support

62b8f25

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

fix styles

270a54c

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

add testdata

fe5d1f9

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

kirillsablin force-pushed the group-by-all-param branch from 3b8a63b to fe5d1f9 Compare October 17, 2018 12:19

trigger build

ace028a

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

kirillsablin force-pushed the group-by-all-param branch from 4b5a6df to ace028a Compare October 22, 2018 19:35

use ... as special label name to group be all

603898b

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

kirillsablin force-pushed the group-by-all-param branch from 92cb59c to 603898b Compare November 2, 2018 14:40

Kyryl Sablin added 2 commits November 2, 2018 15:40

fix docs and improve struct

e1ff914

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

fix style

afdee73

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

beorn7 reviewed Nov 2, 2018

View reviewed changes

README.md Outdated Show resolved Hide resolved

add usage warning

bbdbb8c

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

Kyryl Sablin added 2 commits November 2, 2018 16:37

improve docs

f1561a0

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

fix source config in test

533ce7f

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

stuartnelson3 approved these changes Nov 19, 2018

View reviewed changes

README.md Outdated Show resolved Hide resolved

remove space from README

cf867cb

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

kirillsablin force-pushed the group-by-all-param branch from 113b8c3 to cf867cb Compare November 19, 2018 16:25

simonpasquier reviewed Nov 19, 2018

View reviewed changes

config/testdata/conf.good.yml Outdated Show resolved Hide resolved

config/config.go Show resolved Hide resolved

simonpasquier reviewed Nov 20, 2018

View reviewed changes

config/config.go Outdated Show resolved Hide resolved

Kyryl Sablin added 2 commits November 20, 2018 10:03

fix grammar error

a5623c1

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

remove not needed newline

b7324d9

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

mxinden reviewed Nov 22, 2018

View reviewed changes

config/config.go Outdated Show resolved Hide resolved

config/config_test.go Outdated Show resolved Hide resolved

dispatch/route.go Outdated Show resolved Hide resolved

improve error message and polish comments

219a401

Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>

mxinden approved these changes Nov 27, 2018

View reviewed changes

stuartnelson3 approved these changes Nov 27, 2018

View reviewed changes

roidelapluie approved these changes Nov 28, 2018

View reviewed changes

roidelapluie mentioned this pull request Nov 28, 2018

*: Cut v0.16.0-beta.0 #1553

Closed

mxinden merged commit 32bb289 into prometheus:master Nov 29, 2018

freeseacher mentioned this pull request Jan 9, 2019

Support alertmanager as alert source StackStorm-Exchange/stackstorm-prometheus#7

Open

Conversation

kirillsablin commented Oct 17, 2018

Uh oh!

kirillsablin commented Oct 17, 2018

Uh oh!

brian-brazil commented Oct 17, 2018

Uh oh!

kirillsablin commented Oct 17, 2018

Uh oh!

kirillsablin commented Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roidelapluie commented Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcan commented Oct 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonpasquier commented Oct 18, 2018

Uh oh!

marcan commented Oct 18, 2018

Uh oh!

pete-leese commented Oct 19, 2018

Uh oh!

simonpasquier commented Oct 19, 2018

Uh oh!

pete-leese commented Oct 22, 2018

Uh oh!

roidelapluie commented Oct 22, 2018

Uh oh!

brian-brazil commented Oct 22, 2018

Uh oh!

roidelapluie commented Oct 23, 2018

Uh oh!

brian-brazil commented Oct 23, 2018

Uh oh!

simonpasquier commented Oct 23, 2018

Uh oh!

kirillsablin commented Oct 23, 2018

Uh oh!

marcan commented Oct 23, 2018

Uh oh!

brian-brazil commented Oct 23, 2018

Uh oh!

michellevalentino commented Oct 23, 2018

Uh oh!

kirillsablin commented Oct 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcan commented Oct 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roidelapluie commented Oct 25, 2018

Uh oh!

roidelapluie commented Oct 26, 2018

Uh oh!

kirillsablin commented Nov 2, 2018

Uh oh!

beorn7 commented Nov 2, 2018

Uh oh!

beorn7 commented Nov 2, 2018

Uh oh!

stuartnelson3 commented Nov 2, 2018

Uh oh!

discordianfish commented Nov 2, 2018

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marcan commented Nov 2, 2018

Uh oh!

stuartnelson3 commented Nov 16, 2018

Uh oh!

stuartnelson3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kirillsablin commented Oct 17, 2018 •

edited

Loading

roidelapluie commented Oct 17, 2018 •

edited

Loading

marcan commented Oct 18, 2018 •

edited

Loading

kirillsablin commented Oct 23, 2018 •

edited

Loading

marcan commented Oct 23, 2018 •

edited

Loading