Bucket-Level Alerting #326

qreshi · 2020-12-18T23:27:23Z

The Document-Level Alerting feature enhancement seeks to address the concerns brought up in both #13 and #145 among others. Creating this issue to centralize discussion.

mgiammarco · 2020-12-25T11:12:37Z

Thank you for this thread.
An alerting system really useful for work should have these features:

Easy and scalable: I do not need to create a new monitor/alert when I monitor a new host.
Indipendent alerts for each host/resource.
Possibility to choose if an alert will be autoclosed or not.

Consider this (I hope typical) use case:

100 hosts to monitor
each host sends data with several agents (syslog, collectd, and so on) in different formats
if I have host1 and host2 with a failed backup I must have two alerts
if I have host1 and host2 and I must monitor average cpu usage I need to do groupby in an easy way and send again two separate alerts
for some alerts I do not want that they come to normal state automatically. For example high cpu usage at 3am and it stops at 5am. When I check it at 2pm I need to see an alert in red state.

One software that fulfills above criterias is InfluxDB. Another one is elastalert plugin for ElastiSearch. Please consider this one and eventually integrate it because it fulfills all needs.
Grafana has alerting too but it completely misses point 2.

verbecee · 2021-01-11T19:03:18Z

Just got off the community forum and wanted to post 3 recommendations for alerting:

For aggregation, there should flexibility on the groupby field. In our alerting implementation (we are using something besides open distro's alerting to accomplish our goals), we had an alert set up that would aggregate on field X. Initially, that field came in as a string, but then started coming in as an array of strings. So, we had to accommodate for this.
The aggregation should be able to deal with dirty data. Similar to the example above, this same index started receiving logs with arrays composed of strings and the value null. At least in our implementation, null really screwed up our aggregation and needed to be handled. In our case, too, we also had to deal with ECS special characters in logs, but that also might only be an issue for us because we are interfacing with Elasticsearch.
Suppression - provide context about what alert is suppressing. Is it a misconfigured server or a malware outbreak in the network?

rafael-gumiero · 2021-01-19T01:07:51Z

Basically our use case is very similar to the ones listed above.

Generate separate alerts based on a key to be defined (host, device type, etc).
Grouping categorizes alerts of similar nature into a single notification. This is especially useful during larger outages when many systems fail at once and hundreds to thousands of alerts may be firing simultaneously.
Inhibition is a concept of suppressing notifications for certain alerts if certain other alerts are already firing.

Use case breakdown:

100+ hosts;
Metrics being captured via: metricbeat and filebeat;
It is necessary to generate separate alerts for each host/device or specific key that is out of the desired condition;
Create the most standardized alerts to avoid having to create endless separate rules (costly to maintain);
Alerts based on anomaly detection and threshold.

elfisher · 2021-06-11T14:28:18Z

Reading through the RFC, it looks like the feature being built will enable alert triggering on a per bucket basis for a given aggregation query. I.E. the trigger condition would evaluate for each bucket value. My take is that a better name for this feature would be "Bucket-Level Alerting". This is because document level alerting (or in other systems event level alerting) implies that the trigger conditions will be evaluated for each document in a query. A per document alerting feature definitely has value, but the current RFC is more aggregation focused so I think it should be named accordingly.

qreshi · 2021-08-20T20:46:32Z

Reading through the RFC, it looks like the feature being built will enable alert triggering on a per bucket basis for a given aggregation query. I.E. the trigger condition would evaluate for each bucket value. My take is that a better name for this feature would be "Bucket-Level Alerting". This is because document level alerting (or in other systems event level alerting) implies that the trigger conditions will be evaluated for each document in a query. A per document alerting feature definitely has value, but the current RFC is more aggregation focused so I think it should be named accordingly.

Agreed, I'm updating this issue name to Bucket-Level Alerting. The OpenSearch Alerting issue will also be updated as that is where the feature itself is being tracked now.

qreshi · 2022-02-18T14:58:55Z

Closing this issue. Please refer to opensearch-project/alerting#86

qreshi added the enhancement New feature or request label Dec 18, 2020

qreshi assigned skkosuri-amzn and qreshi and unassigned skkosuri-amzn Dec 18, 2020

This was referenced Dec 18, 2020

Alerting per documents/events #13

Closed

Triggers generate unique alerts based on the value of a specified key #145

Closed

adityaj1107 mentioned this issue Jun 2, 2021

Bucket-Level Alerting opensearch-project/alerting#86

Closed

qreshi changed the title ~~Document-Level Alerting~~ Bucket-Level Alerting Aug 20, 2021

qreshi closed this as completed Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bucket-Level Alerting #326

Bucket-Level Alerting #326

qreshi commented Dec 18, 2020

mgiammarco commented Dec 25, 2020 •

edited

Loading

verbecee commented Jan 11, 2021

rafael-gumiero commented Jan 19, 2021

elfisher commented Jun 11, 2021

qreshi commented Aug 20, 2021

qreshi commented Feb 18, 2022

Bucket-Level Alerting #326

Bucket-Level Alerting #326

Comments

qreshi commented Dec 18, 2020

mgiammarco commented Dec 25, 2020 • edited Loading

verbecee commented Jan 11, 2021

rafael-gumiero commented Jan 19, 2021

elfisher commented Jun 11, 2021

qreshi commented Aug 20, 2021

qreshi commented Feb 18, 2022

mgiammarco commented Dec 25, 2020 •

edited

Loading