Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to customize alias field for opsgenie alerts #1598

Closed
primeroz opened this issue Oct 30, 2018 · 16 comments
Closed

Add option to customize alias field for opsgenie alerts #1598

primeroz opened this issue Oct 30, 2018 · 16 comments

Comments

@primeroz
Copy link

OpsGenie deduplicate messages based on the Alert alias.

This alias is currently calculated by hashing the "GroupLabels" with sha256 here

In my scenario i have multiple Alertmanager sources ( for multiple kubernetes clusters ) using a single integration on the OpsGenie side with a common set of prometheus rules and alertmanager routing configuration.

Two alerts on two different clusters sharing the same set of GroupLabels produce the same ALIAS sha256 and as such they are treated as the same alert on opsgenie and deduplicated , even if they are coming from different clusters.

My current workaround is to add a static label to every group_by that include the Kubernetes ClusterId but that is not ideal since that is something i have to do for every route i ship to opsgenie

Proposal

Would it make sense to add a single configuration field to the opsgenie receiver configuration to add a static extra seed for the hashing function that generate this alias ?
This way , by leveraging templates when creating the alertmanager configuration, it would be possible to differentiate the same alert between different alertmanager sources

@brian-brazil
Copy link
Contributor

My current workaround is to add a static label to every group_by that include the Kubernetes ClusterId but that is not ideal since that is something i have to do for every route i ship to opsgenie

This is how grouping works in the alertmanager, and what you suggest would not work as the alerts would still be in the same group and thus the same notification. group_by is the right way to handle this.

@primeroz
Copy link
Author

@brian-brazil I might not have explained myself properly.

my problem is for the same group_by between different Alertmanager instances running managing alerts on different kubernetes clusters and the way the opsgenie alias for each alert is calculated.

Essentially i get the same Alias for 2 similar alert on 2 differenct clusters because they share the same group_by rules

Example scenario

Cluster1 - node01 / node02 / node03  - Alertmanager1
Cluster2 - node04 / node05 / node06  - Alertmanager2 

Example Rule

alert: KubeCPUOvercommit
expr: sum(kube_resourcequota{job="kube-state-metrics",resource="requests.cpu",type="hard"})
  / sum(node:node_num_cpu:sum) > 1.5
for: 5m
labels:
  severity: warning
annotations:
  message: Overcommited CPU resource request quota on Namespaces.
  runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit

Example Alertmanager (in my case this is matched by the main route )

route:
  group_by: ['service', 'severity', 'alertname']
  group_wait: 30s
  group_interval: 2m
  repeat_interval: 5m
  receiver: opsgenie

Generated alias by this and this is 760bb9fe1a3addca2c90d4accbb55d986c06c12970ee6e2321b6413934178901 from both Kubernetes cluster Alertmanager

This means that on opsGenie those 2 alerts, which are about the overcommit of 2 different clusters so must be different alerts for sure, are threated as the same alert and deduplicated.

I think that arbitrarily selecting that the alias will always be the sha256 of the GroupLabels for opsgenie is wrong and does not cover a scenario like the one i am asking about where i have multiple clusters running the exact same rules and the same routes and same grouping by in alertmanager.

Adding the option to add a static seed for the opsgenie configuration so to ensure that the sha256 of the alert is different in between different clusters would fix that without workarounds.

Adding a "CLUSTER=ID" label to every single group_by ( a label that is the same for every single alert alertmanager will see since is pushed as an externallabel by prometheus) is a workaround

@simonpasquier
Copy link
Member

Adding a "CLUSTER=ID" label to every single group_by ( a label that is the same for every single alert alertmanager will see since is pushed as an externallabel by prometheus) is a workaround

Using external_labels is the right thing to do here.

@primeroz
Copy link
Author

primeroz commented Oct 30, 2018

@simonpasquier i am using external_labels (that's exactly what hte quoted line says), from prometheus to alertmanager. How does that change the alias generated by alertmanager ?

@simonpasquier
Copy link
Member

I just meant to say that in any case, you have to define somewhere the key that differentiate your clusters and what you describe as a workaround is the normal way to go as long as your "group by" parameter includes this external label (eg group_by: ['service', 'severity', 'alertname', 'clusterid']).

@primeroz
Copy link
Author

yeah that is what i am doing, but it seems quite a workaround ... also , and maybe this is something i am doing wrong, i tend to have quite a few sub-routes, around 30 of them, to do different grouping for different rules. in each of them i had to add the clusterid as well ... just to change the alias produced by the code for my opsgenie alert.

I am just wondering if this is the correct behavior to hardcode the way the alias is generated with no other way to customize it than to mangle with all group_by in the config.

@pawadski
Copy link

+1. I am using OpsGenie and control over the Alias field would be appreciated. I understand that Alertmanager is supposed to deduplicate alerts for me, but I would like to see this implemented as it is a critical part of the OpsGenie API.

@jvdspeare
Copy link

+1 This is currently blocking some key functionality with OpsGenie.

@pawadski
Copy link

pawadski commented Mar 8, 2019

I have created this simplistic proxy script in Python3.5 to enable customization of this (on a per-rule level) as a workaround: https://github.com/pawadski/alertmanager-opsgenie-proxy

This way we can, for the time being, customize alias fields (albeit manually). So that you can define "opsgenie_alias: this_is_my_alias" in the alert rules, and get that alias in the alert.

While helpful, I believe it does not answer the seed question OP had, but hopefully it does help someone as much as it helped me.

@GMartinez-Sisti
Copy link

GMartinez-Sisti commented Mar 16, 2019

After checking this thread and a little further I managed to fix this with external labels.

These helped a lot in my case:

I added this to prometheus config (I'm using custom prometheus-operator in k8s):

externalLabels:
    cluster: ${CLUSTER_NAME}
    env: ${CLUSTER_ENV}

And in Alert Manager config added this:

route:
  group_by:
    - job
    - alertname
    - service
    - env

And now the alerts have the cluster and env in the alias, making them unique by cluster.

Hope it helps.

@primeroz
Copy link
Author

primeroz commented Mar 3, 2020

I did open this case and hit it again ... i believe though that the solution provided by #1598 (comment) is good enough.

I propose to close this issue but maybe add the info to the documentation for the opsgenie receiver with the example from the comment above to avoid people keep hitting the same issue ?

@simonpasquier
Copy link
Member

@primeroz closing for now then. I don't think that the reference documentation isn't the proper place for such details.

@juliandm
Copy link

@GMartinez-Sisti You mentioned that alerts have the cluster and env in the alias, but you only added env into the group_by statement. Did you forget to put cluster there?

@GMartinez-Sisti
Copy link

@GMartinez-Sisti You mentioned that alerts have the cluster and env in the alias, but you only added env into the group_by statement. Did you forget to put cluster there?

If you have multiple clusters per env you should add it to the group_by. Otherwise, if it is a 1:1 relation (env<>clusters) you can skip it. In any case, the cluster label will be available in the event, it just won't be used to group alerts on the former.

@dibo1don
Copy link

dibo1don commented Dec 5, 2023

@GMartinez-Sisti How did you verify the alias had in fact cluster on the opsgenie side?

@GMartinez-Sisti
Copy link

@GMartinez-Sisti How did you verify the alias had in fact cluster on the opsgenie side?

Sorry @marwaneldib, but I don't have the access to that infra anymore and can't remember the context reply to this 😅.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants