Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Investigate using alerting and RAC to materialize Synthetics Errors #135138

Closed
Tracked by #135156 ...
dominiqueclarke opened this issue Jun 24, 2022 · 5 comments
Closed
Tracked by #135156 ...
Assignees
Labels
Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability

Comments

@dominiqueclarke
Copy link
Contributor

dominiqueclarke commented Jun 24, 2022

As part of Synthetics 1.0, we're introducing a notion of errors, discrete events that span multiple down checks. An error starts at the first down check for a particular error type, and ends when a down check contains a different error, or when a monitor returns to up.

Errors contain a duration and are mapped back to the number of down checks and the individual check groups the error is comprised of.

This notion of "Error" fits particularly well with the existing concept of "alert" within alerting and RAC. Alerts already contain a notion of a start time, a duration, and a resolution time. RAC documents for the same alert instance are edited rather than appended. Also, RAC enables alert creators to specify arbitrary fields on RAC documents.

For this reason, using alerting and RAC out of the box may solve our Errors requirements. This spike intends to investigate the following topics:

  • Investigate auto creating 1 alert for all monitors
    • Investigate auto-creating this alert automatically at plugin start time (Likely impossible due to permissions issues)
    • Investigate creating this alert automatically when monitor management is enabled (More reasonable)
  • Investigate reading RAC document
    • Investigate reading RAC documents from other rules, in order to materialize alerts from "errors"/alerts
    • Investigate reading RAC documents from the UI, to populate views
  • Retention
    • Investigate the retention story for alerting. See if any custom retention logic may be desired for errors.
  • Investigate the data model
    • Should contain count of number of down checks
    • Should contain list of check groups that the error spans
    • config.id-> saved object id
    • monitor.id -> custom monitor id
    • Should contain started at, duration, and resolution time (should be provided by RAC)
  • Investigate making errors configurable via advanced settings or otherwise, so the rule can be turned off when problems occur.
  • Investigate hiding this alert from the Observability RAC table as an "internal" alert
  • Investigate working with Kibana spaces
@dominiqueclarke dominiqueclarke added the Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability label Jun 24, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/uptime (Team:uptime)

@dominiqueclarke
Copy link
Contributor Author

dominiqueclarke commented Jun 29, 2022

TLDR: Alerting is not recommended for Synthetics errors

I've investigated using alerting to materialize error states. While this strategy originally looked promising, there's a few showstopping drawbacks.

Reliability of task manager

The logic for using alerting for errors required the rule executor, run every 1m, to fetch the last document for each monitor/location combination on each rule execution. If the last document is down, it'll either A.) create an alert, hereafter referred to as error, or B.) persist the error if the error is already active. The error was tied to an individual monitor/location/error combination and would resolve when either A.) A new error type was encountered or B.) the monitor reported an up status

This logic requires task manager to run perfectly in order to always catch the most recent test run for all monitors. In reality, task manager delays or pauses can cause misses of documents. This can cause inaccurate data, such as errors persisting too long due to missed up checks.

Support of the alerting team

Our desire was to use alerting "under the hood". This means that we'd want to hide this rule and any alerts generated from it from our users. Without hiding it, the error rule could be deleted at any times, and users would start seeing error alerts populated in the Observability alerts table. Unfortunately, there isn't currently a concept of hidden system or internal alert from the alerting team. We could develop this, but this type of use case isn't recommended by the alerting team and isn't likely to be supported or encouraged by them.

After discussing with @andrewvc, we agreed that the best way to have confidence in accurate data is to keep it as close to Heartbeat as possible. We renewed discussion of a Heartbeat-based approach, and @andrewvc to write up issues or spikes to move forward there.

@andrewvc
Copy link
Contributor

I've created a new issue to track the new proposed approach here: elastic/beats#32163

I propose we close this issue in favor of that one after discussing it

@dominiqueclarke
Copy link
Contributor Author

I've caught up with Brandon Kobel about why alerting (and Kibana task manager) isn't recommended from their perspective. Here's why:

It's slow

Task-Manager runs in Kibana, and using it to perform ETL will require at least one HTTP request to Elasticsearch to read the data, and one HTTP request to Elasticsearch to write the data. This latency can be rather significant. Additionally, most ETL operations will end up requiring many HTTP requests to and from Elasticsearch.

It doesn't run tasks at precise intervals

Task-Manager will only run 10 tasks at a time per Kibana node. It does so to minimize the impact on other Kibana operations. As a result, tasks are very commonly "delayed" and the system accepted this fact and didn't make any efforts to run tasks on precise intervals. Additionally, when Kibana has downtime, for example due to upgrades, there are large gaps between when tasks run. This makes it very difficult to use for ETL use-cases, and task runners must take this into consideration when designing how data is read/written to Elasticsearch.

There's a high potential for it to affect the health of other Kibana operations

ETL operations generally require many HTTP requests to retrieve data to Elasticsearch, which is then processed in Kibana before executing many HTTP request to insert the data in Elasticsearch. Each of these steps has a high likelihood of consuming a large amount of CPU and memory, which can block other HTTP requests and background-tasks from completing their operations in a reasonable amount of time. This is further exacerbated by the fact that Kibana is written using Node.js, and all JavaScript is executed on a single-thread, so it's very hard to write ETL code in a manner that it doesn't end up blocking the single thread.

In particular, this stood out to me

Additionally, when Kibana has downtime, for example due to upgrades, there are large gaps between when tasks run.

To account for this, we could attempt to write logic that performs "gap remediation". We can keep track of the last check group seen, and query for all the check groups from that time period to now. However, during that time period, multiple errors could have occurred and been resolved. There is no precedence for logic that accounts for "gap remediation" in this way, both creating an alert and auto resolving it at the same time. We could work through that, but as mentioned alerting/task manager may not be the right solution to start for all the reasons above.

@dominiqueclarke
Copy link
Contributor Author

Closed in favor of elastic/beats#32163

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability
Projects
None yet
Development

No branches or pull requests

3 participants