-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spike] Investigate using alerting and RAC to materialize Synthetics Errors #135138
Comments
Pinging @elastic/uptime (Team:uptime) |
TLDR: Alerting is not recommended for Synthetics errorsI've investigated using alerting to materialize error states. While this strategy originally looked promising, there's a few showstopping drawbacks. Reliability of task managerThe logic for using alerting for errors required the rule executor, run every 1m, to fetch the last document for each monitor/location combination on each rule execution. If the last document is down, it'll either A.) create an alert, hereafter referred to as error, or B.) persist the error if the error is already active. The error was tied to an individual monitor/location/error combination and would resolve when either A.) A new error type was encountered or B.) the monitor reported an up status This logic requires task manager to run perfectly in order to always catch the most recent test run for all monitors. In reality, task manager delays or pauses can cause misses of documents. This can cause inaccurate data, such as errors persisting too long due to missed up checks. Support of the alerting teamOur desire was to use alerting "under the hood". This means that we'd want to hide this rule and any alerts generated from it from our users. Without hiding it, the error rule could be deleted at any times, and users would start seeing error alerts populated in the Observability alerts table. Unfortunately, there isn't currently a concept of hidden system or internal alert from the alerting team. We could develop this, but this type of use case isn't recommended by the alerting team and isn't likely to be supported or encouraged by them. After discussing with @andrewvc, we agreed that the best way to have confidence in accurate data is to keep it as close to Heartbeat as possible. We renewed discussion of a Heartbeat-based approach, and @andrewvc to write up issues or spikes to move forward there. |
I've created a new issue to track the new proposed approach here: elastic/beats#32163 I propose we close this issue in favor of that one after discussing it |
I've caught up with Brandon Kobel about why alerting (and Kibana task manager) isn't recommended from their perspective. Here's why: It's slowTask-Manager runs in Kibana, and using it to perform ETL will require at least one HTTP request to Elasticsearch to read the data, and one HTTP request to Elasticsearch to write the data. This latency can be rather significant. Additionally, most ETL operations will end up requiring many HTTP requests to and from Elasticsearch. It doesn't run tasks at precise intervalsTask-Manager will only run 10 tasks at a time per Kibana node. It does so to minimize the impact on other Kibana operations. As a result, tasks are very commonly "delayed" and the system accepted this fact and didn't make any efforts to run tasks on precise intervals. Additionally, when Kibana has downtime, for example due to upgrades, there are large gaps between when tasks run. This makes it very difficult to use for ETL use-cases, and task runners must take this into consideration when designing how data is read/written to Elasticsearch. There's a high potential for it to affect the health of other Kibana operationsETL operations generally require many HTTP requests to retrieve data to Elasticsearch, which is then processed in Kibana before executing many HTTP request to insert the data in Elasticsearch. Each of these steps has a high likelihood of consuming a large amount of CPU and memory, which can block other HTTP requests and background-tasks from completing their operations in a reasonable amount of time. This is further exacerbated by the fact that Kibana is written using Node.js, and all JavaScript is executed on a single-thread, so it's very hard to write ETL code in a manner that it doesn't end up blocking the single thread. In particular, this stood out to me
To account for this, we could attempt to write logic that performs "gap remediation". We can keep track of the last check group seen, and query for all the check groups from that time period to now. However, during that time period, multiple errors could have occurred and been resolved. There is no precedence for logic that accounts for "gap remediation" in this way, both creating an alert and auto resolving it at the same time. We could work through that, but as mentioned alerting/task manager may not be the right solution to start for all the reasons above. |
Closed in favor of elastic/beats#32163 |
As part of Synthetics 1.0, we're introducing a notion of errors, discrete events that span multiple down checks. An error starts at the first down check for a particular error type, and ends when a down check contains a different error, or when a monitor returns to up.
Errors contain a duration and are mapped back to the number of down checks and the individual check groups the error is comprised of.
This notion of "Error" fits particularly well with the existing concept of "alert" within alerting and RAC. Alerts already contain a notion of a start time, a duration, and a resolution time. RAC documents for the same alert instance are edited rather than appended. Also, RAC enables alert creators to specify arbitrary fields on RAC documents.
For this reason, using alerting and RAC out of the box may solve our Errors requirements. This spike intends to investigate the following topics:
The text was updated successfully, but these errors were encountered: