-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RAC][Alerts as Data][TTL: 1w] Managing Bulk Mutable Alert Operations #96368
Comments
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Thanks, @spong for raising these in a ticket. Taking a shot at the questions:
As mentioned in the call, I'm generally in favour of imposing some sort of TTL on the opened alerts, so they get automatically closed by default after a given period of time (e.g. 7d or 30d). I don't think there are use cases for keeping alerts open that long, but I could be wrong, I'd like to hear them if they are. The advantage in adding the TTL is that it would potentially remove a class of corner cases for us, for a relatively small price in functionality.
I think we'll want to allow for updating of alerts that are not in the latest ILM index, because even if we do the TTL idea from above, we might still want to be able to trigger roll-overs on upgrade. So my suggestion would be then to store the index name together with the ID, so when we update we know exactly where to find the alert. There might be complications if the indexing names/structure somehow changes under us (e.g. when migrating via snapshot/restore). Alternatively, I think we could do the update with just the ID against the index glob (
I'd say this depends on the details of the operations involved. It is possible that in some cases the operation fails due to a conflict, because of the optimistic locking that Elasticsearch does on updates. We need to treat the error handling of conflicts with care, but I think we have enough tools at our disposal to take care of these cases, but either retrying or surfacing the conflict error to the user. |
Some thoughts:
|
In my experience, the TTL should be optional and up to the customer for each rule type. There are many cases where alerts will need to be open for longer than 7 days/30days as suggested. For complex issues triage and root cause can sometimes take weeks and if tied to a case or a Incident in tools like ServiceNow, we should not be aging out/closing events arbitrarily. With ILM/rolling off data, we may need to think outside the current constraints of ILM. Active alerts (anything open and/or attached to a open case/incident) should stay in HOT so that it can be updated. Anything closed could move to warm, cold, frozen and IMO could be moved to different tiers quicker than ILM would today (status based versus age/size). In most of my deployments, HOT is the only writable tier, everything else has been marked Read-only and force merged. |
A dedicated breakout from the [RAC] Alerts as Data Schema Definition issue for discussion around the management of bulk mutable alert operations, including storage, ILM, data tiers, and update operations.
With the storage of workflow fields (status/assignee/etc) on alert documents, certain constraints around updating these fields become active with regards to the lifecycle of the alert document itself. For instance, if ILM is configured (currently 90d/50gb, all hot nodes), alerts may no longer be updated (or have impacted performance for bulk operations) once moved to certain data tiers. E.g. frozen tiers cannot be updated, cold tiers are not normally updated, and warm tiers intended to rarely be updated. So if ILM + data tiers are configured, alerts will need to be completely aged out if going into the frozen tier, and perhaps have certain update restrictions once moved to warm/cold tiers.
Additionally, as @dgieselaar raised in the main schema conversation there may be some caveats with updating data once it has been rolled over and is no longer in the
write_index
. @benwtrent confirmed this should not be an issue in this morning's sync (update_by_query
being a two-part operation, doing the operations in the existing index even if not thewrite_index
, whereas all new documents are directed to the current `write_index), aligning with the behavior we've seen within the Security Solution, but definitely something requiring more testing to iron out the intricacies and any potential performance implications.Lastly comes any potential restrictions we may need to impose with regards to the performance implications of the above. Touchpoints where this would present itself are in the bulk modification of a large number of alerts (10's of thousands) either via the
Select all on all pages
and bulk action via the Alerts Table, theClose all matching alerts
when adding exceptions, or via the API directly.With that, I think the remaining open questions are:
write_index
, or in other data tiers? Any restrictions we need to add to the UI/API to accommodate these implications?Adding a TTL of one-week so we can discuss here and hopefully finalize our decision in next week's sync. 🙂
The text was updated successfully, but these errors were encountered: