You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Gaps in monitor.timespan leading to flapping alerts
Browser monitors are more likely to run longer than the scheduled interval. When this happens, it can create gaps in the monitor.timespan value for individual checks. Gaps in the timeline can cause unintentional flapping of triggered and resolved alert state.
Potential solutions:
Update the monitor.timespan to factor in the length of the synthetic check: [Heartbeat] Append synthetic journey duration to monitor.timespan for browser monitors beats#29102
Our current alerting rules rely on monitor.timespan by looking back in history to see if there is a down check within the monitor.timespan range. By increasing the monitor.timespan to represent the greater of the time it takes to run the check or the time until the next scheduled check, we prevent gaps in the monitor timeline and can continue using existing architecture to resolve this issue.
Explicitly look for up monitors
Explicitly looking for up monitors has come up a few times as the most accurate way of telling if a monitor is resolved. However, this logic adds significant complexity into the existing alerting architecture. A PR was constructed to achieve this goal, but it was later determined that querying by monitor.timespan would be a cleaner, less complex option for resolution. It's also important to note that we've had discussions about completely overhauling alerting in the past, which has contributed to the desire prevent adding additional complexity to the existing design if possible. Example PR for this solution: [Uptime] [Alerts] update monitor status alerts to persist when documents are not found kibana#100339
The text was updated successfully, but these errors were encountered:
Solution 1: Solution 1 is not viable, as there is still the potential to have gaps in the timeline for suite monitors, which are run in serial and have the potential to delay subsequent journeys in significant ways when journeys are long-running
Problem
Browser monitors are sufficiently different than lightweight checks, leading to unintended bugs in the alerting framework.
History
We have received a handful of SDH's for browser monitor alerts. Some issues include:
Fixes
Fixes have gone in to improve the experience, including
Outstanding Issues
1. Gaps in
monitor.timespan
leading to flapping alertsBrowser monitors are more likely to run longer than the scheduled interval. When this happens, it can create gaps in the monitor.timespan value for individual checks. Gaps in the timeline can cause unintentional flapping of triggered and resolved alert state.
Potential solutions:
monitor.timespan
to factor in the length of the synthetic check: [Heartbeat] Append synthetic journey duration to monitor.timespan for browser monitors beats#29102Our current alerting rules rely on
monitor.timespan
by looking back in history to see if there is a down check within themonitor.timespan
range. By increasing themonitor.timespan
to represent the greater of the time it takes to run the check or the time until the next scheduled check, we prevent gaps in the monitor timeline and can continue using existing architecture to resolve this issue.Explicitly looking for up monitors has come up a few times as the most accurate way of telling if a monitor is resolved. However, this logic adds significant complexity into the existing alerting architecture. A PR was constructed to achieve this goal, but it was later determined that querying by
monitor.timespan
would be a cleaner, less complex option for resolution. It's also important to note that we've had discussions about completely overhauling alerting in the past, which has contributed to the desire prevent adding additional complexity to the existing design if possible. Example PR for this solution: [Uptime] [Alerts] update monitor status alerts to persist when documents are not found kibana#100339The text was updated successfully, but these errors were encountered: