Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Bucket-Level Alerting #326

Closed
qreshi opened this issue Dec 18, 2020 · 6 comments
Closed

Bucket-Level Alerting #326

qreshi opened this issue Dec 18, 2020 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@qreshi
Copy link
Contributor

qreshi commented Dec 18, 2020

The Document-Level Alerting feature enhancement seeks to address the concerns brought up in both #13 and #145 among others. Creating this issue to centralize discussion.

@mgiammarco
Copy link

mgiammarco commented Dec 25, 2020

Thank you for this thread.
An alerting system really useful for work should have these features:

  1. Easy and scalable: I do not need to create a new monitor/alert when I monitor a new host.
  2. Indipendent alerts for each host/resource.
  3. Possibility to choose if an alert will be autoclosed or not.

Consider this (I hope typical) use case:

  • 100 hosts to monitor
  • each host sends data with several agents (syslog, collectd, and so on) in different formats
  • if I have host1 and host2 with a failed backup I must have two alerts
  • if I have host1 and host2 and I must monitor average cpu usage I need to do groupby in an easy way and send again two separate alerts
  • for some alerts I do not want that they come to normal state automatically. For example high cpu usage at 3am and it stops at 5am. When I check it at 2pm I need to see an alert in red state.

One software that fulfills above criterias is InfluxDB. Another one is elastalert plugin for ElastiSearch. Please consider this one and eventually integrate it because it fulfills all needs.
Grafana has alerting too but it completely misses point 2.

@verbecee
Copy link

Just got off the community forum and wanted to post 3 recommendations for alerting:

  1. For aggregation, there should flexibility on the groupby field. In our alerting implementation (we are using something besides open distro's alerting to accomplish our goals), we had an alert set up that would aggregate on field X. Initially, that field came in as a string, but then started coming in as an array of strings. So, we had to accommodate for this.
  2. The aggregation should be able to deal with dirty data. Similar to the example above, this same index started receiving logs with arrays composed of strings and the value null. At least in our implementation, null really screwed up our aggregation and needed to be handled. In our case, too, we also had to deal with ECS special characters in logs, but that also might only be an issue for us because we are interfacing with Elasticsearch.
  3. Suppression - provide context about what alert is suppressing. Is it a misconfigured server or a malware outbreak in the network?

@rafael-gumiero
Copy link

Basically our use case is very similar to the ones listed above.

  1. Generate separate alerts based on a key to be defined (host, device type, etc).
  2. Grouping categorizes alerts of similar nature into a single notification. This is especially useful during larger outages when many systems fail at once and hundreds to thousands of alerts may be firing simultaneously.
  3. Inhibition is a concept of suppressing notifications for certain alerts if certain other alerts are already firing.

Use case breakdown:

  • 100+ hosts;
  • Metrics being captured via: metricbeat and filebeat;
  • It is necessary to generate separate alerts for each host/device or specific key that is out of the desired condition;
  • Create the most standardized alerts to avoid having to create endless separate rules (costly to maintain);
  • Alerts based on anomaly detection and threshold.

@elfisher
Copy link
Contributor

Reading through the RFC, it looks like the feature being built will enable alert triggering on a per bucket basis for a given aggregation query. I.E. the trigger condition would evaluate for each bucket value. My take is that a better name for this feature would be "Bucket-Level Alerting". This is because document level alerting (or in other systems event level alerting) implies that the trigger conditions will be evaluated for each document in a query. A per document alerting feature definitely has value, but the current RFC is more aggregation focused so I think it should be named accordingly.

@qreshi
Copy link
Contributor Author

qreshi commented Aug 20, 2021

Reading through the RFC, it looks like the feature being built will enable alert triggering on a per bucket basis for a given aggregation query. I.E. the trigger condition would evaluate for each bucket value. My take is that a better name for this feature would be "Bucket-Level Alerting". This is because document level alerting (or in other systems event level alerting) implies that the trigger conditions will be evaluated for each document in a query. A per document alerting feature definitely has value, but the current RFC is more aggregation focused so I think it should be named accordingly.

Agreed, I'm updating this issue name to Bucket-Level Alerting. The OpenSearch Alerting issue will also be updated as that is where the feature itself is being tracked now.

@qreshi qreshi changed the title Document-Level Alerting Bucket-Level Alerting Aug 20, 2021
@qreshi
Copy link
Contributor Author

qreshi commented Feb 18, 2022

Closing this issue. Please refer to opensearch-project/alerting#86

@qreshi qreshi closed this as completed Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants