Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Time-Averaged Threshold for Alerts #93

Closed
RedwindA opened this issue Aug 7, 2024 · 7 comments
Closed

Feature Request: Time-Averaged Threshold for Alerts #93

RedwindA opened this issue Aug 7, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@RedwindA
Copy link

RedwindA commented Aug 7, 2024

Description:
Currently, beszel only supports instant thresholds for server monitoring and alerting. This can lead to false alarms triggered by normal, short-term operations such as file compression that may temporarily spike resource usage.

Feature request:
Implement a new threshold type that triggers alerts based on the average value of a monitored metric over a specified time period, rather than instantaneous values.

Proposed functionality:

  1. Allow users to set a time period (e.g., 5 minutes, 1 hour) for averaging the monitored value.
  2. Calculate the average value of the metric over the specified time period.
  3. Trigger an alert only if this average value exceeds the set threshold.

Example use case:

  • Metric: CPU usage
  • Time period: 15 minutes
  • Threshold: 80%

In this scenario, an alert would only be triggered if the average CPU usage over a 15-minute period exceeds 80%, reducing false alarms from short-term spikes.

Benefits:

  1. Reduced false alarms from temporary spikes in resource usage
  2. More accurate representation of sustained resource constraints
  3. Improved ability to distinguish between normal operations and actual issues

This feature would significantly enhance beszel's monitoring capabilities and provide more meaningful alerts to users.

@henrygd henrygd added the enhancement New feature or request label Aug 7, 2024
@henrygd
Copy link
Owner

henrygd commented Aug 7, 2024

I can add options at some point for 10m, 20m, and 2h time periods.

Shouldn't add much overhead since we're already calculating those averages for the 12h, 24h, and 1w charts.

Just a point of clarification - the threshold currently is not instant. It works exactly as you outlined -- time averaged -- but only based on one minute intervals. So you can have short spikes above threshold of under a minute that won't trigger an alert.

That may be what you meant, but wanted to point that out in case anyone was wondering.

@RedwindA
Copy link
Author

RedwindA commented Aug 7, 2024

Thank you for your explanation! I hope it can be a customizable value instead of hardcoded options, as the AUP varies across different IDCs, and the allowed duration for full load differs as well

@ghost
Copy link

ghost commented Aug 12, 2024

This would go a long way at improving the alerting features, I would love to see this implemented.

Would it be possible to have multiple alert triggers for each metric? This would make it even more customisable.

@henrygd
Copy link
Owner

henrygd commented Aug 12, 2024

Maybe a better implementation would be to add another slider allowing you to choose any number of minutes from 1m to 60m?

This would be slightly more intensive as we'd need to query, loop, and decode json for previous 1m records.

But we'd only need to do that if the alert hasn't been triggered and the current 1m record is above threshold, or the alert is triggered and the current record is below threshold.

Most of the time you'll be below threshold and without a triggered alert, so that operation wouldn't need to run.

Seems like that may be the way to go.

@henrygd
Copy link
Owner

henrygd commented Oct 16, 2024

Added in 0.6.0.

Please update and let me know if you run into any issues with it.

@Matthias-vdE
Copy link

How to dismiss an active alert? I currently have an alert for one of my servers:

image

I'm fine with the disk being filled for 50%, but even now raising it to 80%, the alert stays:

image

I assume I have to wait for another 10 minutes to pass? Disabling the alert and re-enabling it made it go away.
(Great feature by the way! It seriously reduces the alerts from small CPU spikes when doing updates).

@henrygd
Copy link
Owner

henrygd commented Oct 17, 2024

@Matthias-vdE It should clear on the next system update, but I'll change it so the alert gets set to inactive if you update the time or threshold.

@henrygd henrygd closed this as completed Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants