[Discussion] Improve alerting for browser monitors #418

dominiqueclarke · 2021-12-09T13:49:37Z

Problem

Browser monitors are sufficiently different than lightweight checks, leading to unintended bugs in the alerting framework.

History

We have received a handful of SDH's for browser monitor alerts. Some issues include:

Browser alerts rules triggering before the specified amount of down monitors: [Uptime][Alert][Synthetics] Alert is triggered without matching the rule kibana#115928
Browser alerts flapping between triggered and resolved state. https://github.com/elastic/support-known-issues/issues/980

Fixes

Fixes have gone in to improve the experience, including

https://github.com/elastic/support-known-issues/issues/980

Outstanding Issues

1. Gaps in `monitor.timespan` leading to flapping alerts

Browser monitors are more likely to run longer than the scheduled interval. When this happens, it can create gaps in the monitor.timespan value for individual checks. Gaps in the timeline can cause unintentional flapping of triggered and resolved alert state.

Potential solutions:

Update the monitor.timespan to factor in the length of the synthetic check: [Heartbeat] Append synthetic journey duration to monitor.timespan for browser monitors beats#29102
Our current alerting rules rely on monitor.timespan by looking back in history to see if there is a down check within the monitor.timespan range. By increasing the monitor.timespan to represent the greater of the time it takes to run the check or the time until the next scheduled check, we prevent gaps in the monitor timeline and can continue using existing architecture to resolve this issue.
Explicitly look for up monitors
Explicitly looking for up monitors has come up a few times as the most accurate way of telling if a monitor is resolved. However, this logic adds significant complexity into the existing alerting architecture. A PR was constructed to achieve this goal, but it was later determined that querying by monitor.timespan would be a cleaner, less complex option for resolution. It's also important to note that we've had discussions about completely overhauling alerting in the past, which has contributed to the desire prevent adding additional complexity to the existing design if possible. Example PR for this solution: [Uptime] [Alerts] update monitor status alerts to persist when documents are not found kibana#100339

The text was updated successfully, but these errors were encountered:

dominiqueclarke · 2021-12-15T17:20:01Z

From the above potential solutions:

Solution 1: Solution 1 is not viable, as there is still the potential to have gaps in the timeline for suite monitors, which are run in serial and have the potential to delay subsequent journeys in significant ways when journeys are long-running
Solution 2: @dominiqueclarke to investigate this solution as part of [Spike] [Uptime] Investigate improving monitor status alert query to better support browser monitors kibana#121330

Additional suggestions:

Support timeout for browser monitors: Having a timeout less than the schedule monitor interval can prevent gaps in the timeline. @andrewvc to investigate in [Spike] [Heartbeat] Explore implementing timeout for browser monitors beats#29454
Check for incomplete monitor status by searching for the absence of a heartbeat/summary document before resolving the alert. @dominiqueclarke to investigate in [Spike] [Uptime] Investigate improving monitor status alert query to better support browser monitors kibana#121330

dominiqueclarke added discuss Team:Uptime Label for the Uptime team labels Dec 9, 2021

dominiqueclarke mentioned this issue Dec 15, 2021

[Spike] [Uptime] Investigate improving monitor status alert query to better support browser monitors elastic/kibana#121330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Improve alerting for browser monitors #418

[Discussion] Improve alerting for browser monitors #418

dominiqueclarke commented Dec 9, 2021

dominiqueclarke commented Dec 15, 2021

[Discussion] Improve alerting for browser monitors #418

[Discussion] Improve alerting for browser monitors #418

Comments

dominiqueclarke commented Dec 9, 2021

Problem

History

Fixes

Outstanding Issues

1. Gaps in monitor.timespan leading to flapping alerts

dominiqueclarke commented Dec 15, 2021

1. Gaps in `monitor.timespan` leading to flapping alerts