Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC][Alerts as Data][TTL: 1w] Managing Bulk Mutable Alert Operations #96368

Open
spong opened this issue Apr 7, 2021 · 6 comments
Open

[RAC][Alerts as Data][TTL: 1w] Managing Bulk Mutable Alert Operations #96368

spong opened this issue Apr 7, 2021 · 6 comments
Labels
discuss Team:Detections and Resp Security Detection Response Team Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete

Comments

@spong
Copy link
Member

spong commented Apr 7, 2021

A dedicated breakout from the [RAC] Alerts as Data Schema Definition issue for discussion around the management of bulk mutable alert operations, including storage, ILM, data tiers, and update operations.

With the storage of workflow fields (status/assignee/etc) on alert documents, certain constraints around updating these fields become active with regards to the lifecycle of the alert document itself. For instance, if ILM is configured (currently 90d/50gb, all hot nodes), alerts may no longer be updated (or have impacted performance for bulk operations) once moved to certain data tiers. E.g. frozen tiers cannot be updated, cold tiers are not normally updated, and warm tiers intended to rarely be updated. So if ILM + data tiers are configured, alerts will need to be completely aged out if going into the frozen tier, and perhaps have certain update restrictions once moved to warm/cold tiers.

Additionally, as @dgieselaar raised in the main schema conversation there may be some caveats with updating data once it has been rolled over and is no longer in the write_index. @benwtrent confirmed this should not be an issue in this morning's sync (update_by_query being a two-part operation, doing the operations in the existing index even if not the write_index, whereas all new documents are directed to the current `write_index), aligning with the behavior we've seen within the Security Solution, but definitely something requiring more testing to iron out the intricacies and any potential performance implications.

Lastly comes any potential restrictions we may need to impose with regards to the performance implications of the above. Touchpoints where this would present itself are in the bulk modification of a large number of alerts (10's of thousands) either via the Select all on all pages and bulk action via the Alerts Table, the Close all matching alerts when adding exceptions, or via the API directly.

With that, I think the remaining open questions are:

  • If using ILM, would we want to restrict mutable alerts from aging out to different data tiers?
  • Do alerts have an active shelf-life? i.e. is an alert forever actionable, being able to be assigned and have its status updated regardless of it's age, or do we want to set a max lifespan before they're aged out and no longer actionable/auto-close?
  • What are the performance implications of updating 10's of thousands alerts not in the currently marked write_index, or in other data tiers? Any restrictions we need to add to the UI/API to accommodate these implications?
  • What about race conditions around long-running update operations? E.g. a user adds an exception for a specific host and closes 20k+ matching alerts, then another user comes along and modifies those underlying alerts (changes state, adds to case, etc), what is the desired outcome?

Adding a TTL of one-week so we can discuss here and hopefully finalize our decision in next week's sync. 🙂

@spong spong added discuss Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete labels Apr 7, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@tsg
Copy link
Contributor

tsg commented Apr 7, 2021

Thanks, @spong for raising these in a ticket. Taking a shot at the questions:

If using ILM, would we want to restrict mutable alerts from aging out to different data tiers?
Do alerts have an active shelf-life? i.e. is an alert forever actionable, being able to be assigned and have its status updated regardless of it's age, or do we want to set a max lifespan before they're aged out and no longer actionable/auto-close?

As mentioned in the call, I'm generally in favour of imposing some sort of TTL on the opened alerts, so they get automatically closed by default after a given period of time (e.g. 7d or 30d). I don't think there are use cases for keeping alerts open that long, but I could be wrong, I'd like to hear them if they are. The advantage in adding the TTL is that it would potentially remove a class of corner cases for us, for a relatively small price in functionality.

What are the performance implications of updating 10's of thousands alerts not in the currently marked write_index, or in other data tiers? Any restrictions we need to add to the UI/API to accommodate these implications?

I think we'll want to allow for updating of alerts that are not in the latest ILM index, because even if we do the TTL idea from above, we might still want to be able to trigger roll-overs on upgrade. So my suggestion would be then to store the index name together with the ID, so when we update we know exactly where to find the alert. There might be complications if the indexing names/structure somehow changes under us (e.g. when migrating via snapshot/restore).

Alternatively, I think we could do the update with just the ID against the index glob (alerts-observability-*) instead of the write alias. That would mean we let ES do the work of finding the document by ID.

What about race conditions around long-running update operations? E.g. a user adds an exception for a specific host and closes 20k+ matching alerts, then another user comes along and modifies those underlying alerts (changes state, adds to case, etc), what is the desired outcome?

I'd say this depends on the details of the operations involved. It is possible that in some cases the operation fails due to a conflict, because of the optimistic locking that Elasticsearch does on updates. We need to treat the error handling of conflicts with care, but I think we have enough tools at our disposal to take care of these cases, but either retrying or surfacing the conflict error to the user.

@dgieselaar
Copy link
Member

dgieselaar commented Apr 7, 2021

Some thoughts:

  • Observability is not considering workflow on alerts at the moment, in the sense that we have not made a decision there, so for the sake of this discussion I am going to leave that out of scope for now.

  • What class of corner cases does a TTL solve? And how do we manage the TTL? Do we run a background job that closes alerts? How does the background job deal with (for instance) alerts restored from a snapshot?

  • If we store the index name in the rule task state, what would happen if we for instance reindex alerts in a migration process? Or a rule is imported? Or a rule is moved from one space to another (not sure if that's possible today)? What scenarios should we consider?

  • There are several ways to update a document. You can index it with an explicit id, and it will override the document in Elasticsearch. You can also do a partial update, scripted update, or a delete by query. The former requires us to know the index + id up front, which might mean that we need to read every alert that we potentially want to update, if we can't store it in the task state. The latter require Elasticsearch to first read the current content of the document, and then store it as an update. None of these options will be fast, I think. Do we have somewhat of an understanding of how these (either read + index or update) compare in terms of performance to for instance a "blind" index?

  • Is it an issue of many rules are doing small bulk requests? Should a client buffer operations from multiple rules and batch those updates? Or is there a requirement from the rule that these operations should be completed before the rule execution ends? I assume the later would be the case if we want to store the index in the rule task state.

@dsadosky
Copy link

dsadosky commented Apr 7, 2021

In my experience, the TTL should be optional and up to the customer for each rule type. There are many cases where alerts will need to be open for longer than 7 days/30days as suggested. For complex issues triage and root cause can sometimes take weeks and if tied to a case or a Incident in tools like ServiceNow, we should not be aging out/closing events arbitrarily. With ILM/rolling off data, we may need to think outside the current constraints of ILM. Active alerts (anything open and/or attached to a open case/incident) should stay in HOT so that it can be updated. Anything closed could move to warm, cold, frozen and IMO could be moved to different tiers quicker than ILM would today (status based versus age/size). In most of my deployments, HOT is the only writable tier, everything else has been marked Read-only and force merged.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Team:Detections and Resp Security Detection Response Team Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete
Projects
None yet
Development

No branches or pull requests

6 participants