Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Improve alerting for browser monitors #418

Open
dominiqueclarke opened this issue Dec 9, 2021 · 1 comment
Open

[Discussion] Improve alerting for browser monitors #418

dominiqueclarke opened this issue Dec 9, 2021 · 1 comment
Labels
discuss Team:Uptime Label for the Uptime team

Comments

@dominiqueclarke
Copy link

Problem

Browser monitors are sufficiently different than lightweight checks, leading to unintended bugs in the alerting framework.

History

We have received a handful of SDH's for browser monitor alerts. Some issues include:

Fixes

Fixes have gone in to improve the experience, including

Outstanding Issues

1. Gaps in monitor.timespan leading to flapping alerts

Browser monitors are more likely to run longer than the scheduled interval. When this happens, it can create gaps in the monitor.timespan value for individual checks. Gaps in the timeline can cause unintentional flapping of triggered and resolved alert state.

Potential solutions:

  • Update the monitor.timespan to factor in the length of the synthetic check: [Heartbeat] Append synthetic journey duration to monitor.timespan for browser monitors beats#29102
    Our current alerting rules rely on monitor.timespan by looking back in history to see if there is a down check within the monitor.timespan range. By increasing the monitor.timespan to represent the greater of the time it takes to run the check or the time until the next scheduled check, we prevent gaps in the monitor timeline and can continue using existing architecture to resolve this issue.
  • Explicitly look for up monitors
    Explicitly looking for up monitors has come up a few times as the most accurate way of telling if a monitor is resolved. However, this logic adds significant complexity into the existing alerting architecture. A PR was constructed to achieve this goal, but it was later determined that querying by monitor.timespan would be a cleaner, less complex option for resolution. It's also important to note that we've had discussions about completely overhauling alerting in the past, which has contributed to the desire prevent adding additional complexity to the existing design if possible. Example PR for this solution: [Uptime] [Alerts] update monitor status alerts to persist when documents are not found kibana#100339
@dominiqueclarke
Copy link
Author

From the above potential solutions:

Additional suggestions:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Team:Uptime Label for the Uptime team
Projects
None yet
Development

No branches or pull requests

1 participant