Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Expose SO _version to clients #74381

Open
madirey opened this issue Aug 5, 2020 · 7 comments
Open

[Alerting] Expose SO _version to clients #74381

madirey opened this issue Aug 5, 2020 · 7 comments
Labels
bug Fixes for quality problems that affect the customer experience dependencies Pull requests that update a dependency file enhancement New value added to drive a business result estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting NeededFor:Detections and Resp Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@madirey
Copy link
Contributor

madirey commented Aug 5, 2020

Currently, users of the security_solution plugin can run into a number of race conditions when multiple users attempt to update a rule at the same time. There's no way for us to circumvent these with OCC, as the _version field is not exposed for the alerts via the Alerting Client.

Here's one area where we could run into trouble:

The solution would be to expose the _version in the get and find APIs, and allow that to be passed in when updating. Then we can handle 409 conflict scenarios by refreshing and having the user try the rule update again.

@madirey madirey added bug Fixes for quality problems that affect the customer experience enhancement New value added to drive a business result Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Aug 5, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor

mikecote commented Aug 6, 2020

Note: alerting UIs should also take advantage of such implementation.

pmuellr added a commit to pmuellr/kibana that referenced this issue Sep 8, 2020
During development of elastic#75553,
some issues came up with the optimistic concurrency control (OCC) we
were using internally within the alertsClient, via the `version`
option/property of the saved object.  The referenced PR updates new
fields in the alert from the taskManager task after the alertType
executor runs.  In some alertsClient methods, OCC is used to update
the alert which are requested via user requests.  And so in some
cases, version conflict errors were coming up when the alert was
updated by task manager, in the middle of one of these methods.  Note:
the SIEM function test cases stress test this REALLY well.

In this PR, we remove OCC from methods that were currently using it,
namely `update()`, `updateApiKey()`, `enable()`, `disable()`, and the
`[un]mute[All,Instance]()` methods.  Of these methods, OCC is really
only _practically_ needed by `update()`, but even for that we don't
support OCC in the API, yet; see: issue elastic#74381 .

For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer.  That will be used by PR elastic#75553.
@mikecote
Copy link
Contributor

mikecote commented Sep 9, 2020

Relates to #50260

@pmuellr
Copy link
Member

pmuellr commented Sep 22, 2020

There was an indirect reference to this issue from a PR today - #77838 (comment)

I happened to think we haven't even proposed what we might want to do here. Given some context on what the referenced PR is "fixing", here are some questions:

  • do we support OCC in the all mutating APIs (ie, add an optional version option that maps to SO OCC); mutating APIs include update(), enable(), muteAll(), etc, and delete() (ES supports OCC for delete anyway)? Or do we have a subset of those APIs that support OCC - eg, update() supports it, none of the other methods support it?

  • what does it mean when an API that takes optional OCC but it's not passed on the call - do we not use OCC in the SO calls at all, and so just do overwrites? Or do we use it internally like we do today within each API (eg, enable() does a GET at the beginning and uses the version when doing it's final UPDATE). In the latter case, do we retry a few times (like the referenced PR does), or return the conflict exception immediately?

My take is that I think we just want OCC for the update() call, and when it's not passed, we retry like the referenced PR. When the OCC is passed, then we rethrow the conflict error immediately. The other mutating methods would work like the referenced PR - retry a few times.

My rationale is that update() is the only method that updates user-controlled data - the other mutating methods (except delete()) are updating framework information - and all of this is stored in the same SO. OCC is super important for user-controlled data, to ensure users making simultaneous don't overwrite themselves. Less clear for me that it's useful for framework-y calls - would someone really use OCC with enable()? Even if we supported that, this is going to be painful for callers who will be getting this because of the upcoming background updates to the alert execution status (to be stored in the alert SO). There will be a lot more opportunities for conflicts than just conflicts from other Kibana users.

@mikecote
Copy link
Contributor

Moving from 7.12 - Candidates to 7.x - Candidates.

@mikecote
Copy link
Contributor

mikecote commented Feb 4, 2021

Moving from 7.x - Candidates to 8.x - Candidates (Backlog) after the latest 7.x planning session.

@gmmorris
Copy link
Contributor

gmmorris commented Jul 1, 2021

Worth discussing this issue further given this comment by @mikecote .

It sounds like there's a concern that exposing the version will lead to bad UX because the rule could easily be updated in the background by the system (when executing it) and that will likely clash with user's updating other parts of the rule SO.

TBH It feels like this is our internal implementation leaking, and I think it's worth taking a closer look at the impact of the underlying implementation (updating the status on the SO there by changing the version) and see if this approach makes sense longer term... but obviously changing this now is very costly.

@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 14, 2021
@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience dependencies Pull requests that update a dependency file enhancement New value added to drive a business result estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting NeededFor:Detections and Resp Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

7 participants