-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] formalize alert status and add status fields to alert saved object #75553
Conversation
ce43e9a
to
079d27c
Compare
206a76b
to
a429e44
Compare
037d3c6
to
dfba7d5
Compare
I'm splitting out the alertClients occ/version changes into a separate PR, since it's not directly related to this code, and something that should be looked at independently - #76830 - this PR is now blocked on that PR |
During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we remove OCC from methods that were currently using it, namely `update()`, `updateApiKey()`, `enable()`, `disable()`, and the `[un]mute[All,Instance]()` methods. Of these methods, OCC is really only _practically_ needed by `update()`, but even for that we don't support OCC in the API, yet; see: issue elastic#74381 . For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be used by PR elastic#75553.
dfba7d5
to
17495ad
Compare
51008d7
to
51e20a0
Compare
During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we wrap all the methods using OCC with a function that will retry them, a short number of times, with a short delay in between. If the original method STILL has a conflict error, it will get thrown after the retry limit. In practice, this eliminated the version conflict calls that were occuring with the SIEM tests, once we started updating the saved object in the executor. For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be also used by PR elastic#75553. interim commits: - add jest tests for partially_update_alert - add jest tests for retry_if_conflicts - in-progress on jest tests for alertsClient with conflicts - get tests for alerts_client_conflict_retries running - add tests for retry logger messages - resolve some PR comments - fix alertsClient test to use version with muteAll/unmuteAll - change test to reflect new use of create api - fix alertsClient meta updates, add comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a branch based on this PR and local testing has looked great. Thanks for this!!! LGTM
update: na, not flaky - I had pulled in some code that was going to be a merge conflict, but didn't pull enough in; hopefully fixed now in c366d4b original comment: This smells a little flaky, though it's odd it in actions which shouldn't be affected by this change. Guessing it's failing here, for one of the ci errors: kibana/x-pack/test/alerting_api_integration/spaces_only/tests/actions/execute.ts Lines 61 to 73 in f993d2d
This is a test that uses an indexing action and executes it, and then tries to read the doc that should have gotten created - it seems like maybe the task got delayed, and we should wait a bit before checking? |
jenkins retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like as solid a solution as we can achieve, given the limitation we have imposed by SOs, so good work on that front. 👍
I've played around with it, tried to fail it, and it seems to be working as expected.
I'm a little concerned that we're not displaying anything in the UI though.
Shouldn't this PR express, at the very least, what Status the Alert is in on the List/Details pages?
|
||
export interface AlertExecutionStatus { | ||
status: AlertExecutionStatuses; | ||
date: Date; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename date
to something that is explicit about what date it is?
lastExecution
? lastUpdate
? etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does feel kind of "context free", viewed here, but it is a property of the executionStatus
object, so I thought using date
by itself would be fine; ie, you're always be referencing this piece of data as executionStatus.date
. Adding additional context on date
itself seemed like overkill and overly wordy. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it isn't obvious to me what this date actually is 🤷
I won't block on this, but my feeling is that we're adding to the cognitive load by not being clear what his date actually means.
I'm guessing it means "lastUpdate" of the status, but I'm still not 100% sure and to me that's a reason to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you know what the status
is? Should we change it to lastExecutionStatus
as well? :-) I think that was my thinking in trying to not add more context here to the prop names, when it feels like it's implied by it's containing property &| type.
Contextually, how people would end up accessing this, would look like the following for the two variants:
alert.executionStatus.date
alert.executionStatus.lastExecutionDate
I don't have really strong feels, just trying to cut down verbosity / ceremony / overkill where not actually needed.
But I just thought of a decent reason to do this - if for some reason we add some other date to this structure later, THEN it will certainly be confusing what the un-prefixed date would be.
So, I think lastExecutionDate
prolly works best for me, since that exactly describes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current branch is using lastExecutionDate
- thanks for prodding on this Gidi!
@@ -393,6 +393,11 @@ describe('create()', () => { | |||
"createdAt": "2019-02-12T21:01:22.479Z", | |||
"createdBy": "elastic", | |||
"enabled": true, | |||
"executionStatus": Object { | |||
"date": "2019-02-12T21:01:22.479Z", | |||
"error": null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that we set error
to null
because of the partial update issue?
I just want to double check I understand, as my instinct would have been to just omit it if there is no error... but obviously, being aware of the partial update issues, I'm assuming it's that. is it worth adding a comment in the code for future devs who might not be aware? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, it's a partial update issue; we need to make sure we remove a previous error if we're doing an update and there is no error this time. So it's typed in the raw alert as "null-able".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment in the raw alert definition for this ...
| 'muteAll' | ||
| 'mutedInstanceIds' | ||
| 'actions' | ||
| 'executionStatus' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit.
This list has grown a lot, should we switch it to the inverse (using Pick
instead of Omit
)?
It would mean, by default, new fields are not part of create
... 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not wrong, but I hate to make changes like this, in a PR like this. It's more of a general clean up thing, and this particular item seems hardly worth an issue by itself. Do we have some general "simple tech debt items" issue we could add this to?
@@ -228,6 +229,11 @@ export class AlertsClient { | |||
params: validatedAlertTypeParams as RawAlert['params'], | |||
muteAll: false, | |||
mutedInstanceIds: [], | |||
executionStatus: { | |||
status: 'unknown', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. It feels like we should be using enums for these rather than hard coded strings, no?
I get that the wrong string won't pass type checking, but we've used enums for equivalent fields so it feels like we should align.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda look at the "set of these strings" type as being the ceremony-free version of string enums. Is there an explicit value in using enums instead? The upside to not using them is not having to maintain the duplication of the enum keys / values, and not having to have access to that enum type in code that needs it. ie, less ceremony :-)
Would the enum type be exposed in the Alert objects themselves, or just an internal detail? I think we could type it in the Alert and RawAlert as enum safely, but not quite sure. Another reason I generally avoid enums, because I always have to go read the chapter on them in the TS handbook :-)
const rawAlertWithoutExecutionStatus: Partial<Omit<RawAlert, 'executionStatus'>> = { | ||
...rawAlert, | ||
}; | ||
delete rawAlertWithoutExecutionStatus.executionStatus; | ||
const executionStatus = alertExecutionStatusFromRaw(rawAlert.executionStatus); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit.
Could we just extract executionStatus
on line 966 like we do with createdAt
, meta
and scheduledTaskId
?
That avoids the need for the delete
and the creation of rawAlertWithoutExecutionStatus
.
We could also rename rawAlert
to be explicit about it not containing the Execution Status if you feel that's important. 🤷
It just feels more in line with the rest of the code there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was sooooo hard - I tried multiple approaches to this, including doing what you suggested. Not happy with that delete
, and I this seemed like the best alternative.
The main problem is the types of the props in the RawAlert and Alert pretty much match up, so that ...rawAlertWithoutExecutionStatus
fills in most of the bits without any typing errors. However, the types of the executionStatus in RawAlert and Alert are different. And because we're dealing with partials coming in and going out, it might not be there. So when doing something like what you suggest, you'll end up with an error TS thinks the type of the returned executionStatus is RawAlertExecutionStatus | AlertExecutionStatus, which clearly isn't kosher.
It's worth a comment I think, will help the next person looking at this save a few minutes when they try to fix it. heh
|
||
export function isAlertSavedObjectNotFoundError(err: Error, alertId: string) { | ||
// if this is an error with a reason, the actual error needs to be extracted | ||
if (isErrorWithReason(err)) { | ||
err = err.error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit.
This might just be me, but I always find it hard to track the reassignment of function arguments and try to avoid it.
Could we use an intermediary variable? 😬
|
||
after(async () => await objectRemover.removeAll()); | ||
|
||
it('should be "unknown" for newly created alert', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about unknown
as the status for newly created alerts.
perhaps something that expresses that we know why it doesn't have data yet? The word unknown
suggests we don't know why there is no status... that could cause confusion and concern.
This would also help distinguish between a status of unknown
and a reason of unknown
which are likely to be confused in the future.
Would scheduled
or pending
work better? Something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long internal discussion on this one :-). Looks like we'll make it something more concrete than unknown
, eg pending
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm "finalizing" on pending
for now, we could still change this before FF next week if needed.
function trues(length: number): boolean[] { | ||
return booleans(length, true); | ||
} | ||
|
||
function booleans(length: number, value: boolean): boolean[] { | ||
return '' | ||
.padStart(length) | ||
.split('') | ||
.map((e) => value); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, @pmuellr, you know I love how old school you are :)
We now have fill
;)
function trues(length: number): boolean[] { | |
return booleans(length, true); | |
} | |
function booleans(length: number, value: boolean): boolean[] { | |
return '' | |
.padStart(length) | |
.split('') | |
.map((e) => value); | |
} | |
function trues(length: number): boolean[] { | |
return new Array(number).fill(true); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dang, I thought there was something like that now, and couldn't find it!!! I hate writing stuf like this, though it is certainly kinda fun to abuse padStart()
and split()
this way :-) (I used to do a lot of REXX programming where the only primitive data type was a string (but you could math on it), so we found all kind of "interesting" ways to do things with string functions).
async function delay(millis: number): Promise<void> { | ||
await new Promise((resolve) => setTimeout(resolve, millis)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish this was just a built in function, we redefine it in every test suite we have 😂
|
||
export async function buildUp(getService: FtrProviderContext['getService']) { | ||
const spacesService = getService('spaces'); | ||
for (const space of Object.values(Spaces)) { | ||
if (space.id === 'default') continue; | ||
|
||
const { id, name, disabledFeatures } = space; | ||
await spacesService.create({ id, name, disabledFeatures }); | ||
} | ||
} | ||
|
||
export async function tearDown(getService: FtrProviderContext['getService']) { | ||
const esArchiver = getService('esArchiver'); | ||
await esArchiver.unload('empty_kibana'); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused... why is this here?
Didn't we merge this into Main already? 😮
We might need a fresh merge from Main into the PR... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh dang, was going to comment in the PR on this.
There was a merge conflict w/master in this area, and rather than merge master, I just copied the code in here. I was still rebasing at the time, didn't want to have to merge master, but forgot to add a note about it.
I did actually have to merge master this morning and guess what! It's gone!
@elasticmachine merge upstream |
I was thinking adding something here, but given the current size, and nearness to FF, seemed best not to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as @pmuellr has confirmed that adding UI is out of scope for this PR and we've discussed renaming unknown
on initial creation.
Those were the only blockers for me. 👍
commit 8775610 has changes from the PR comments, except for the status |
@elasticmachine merge upstream |
💚 Build SucceededMetrics [docs]distributable file count
page load bundle size
Saved Objects .kibana field count
History
To update your PR or re-run it, just comment with: |
…d object (elastic#75553) resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. The data is added to the alert as the `executionStatus` field, with the following shape: ```ts interface AlertExecutionStatus { status: 'ok' | 'active' | 'error' | 'pending' | 'unknown'; lastExecutionDate: Date; error?: { reason: 'read' | 'decrypt' | 'execute' | 'unknown'; message: string; }; } ```
…d object (#75553) (#79227) resolves #51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. The data is added to the alert as the `executionStatus` field, with the following shape: ```ts interface AlertExecutionStatus { status: 'ok' | 'active' | 'error' | 'pending' | 'unknown'; lastExecutionDate: Date; error?: { reason: 'read' | 'decrypt' | 'execute' | 'unknown'; message: string; }; } ```
I remembered some additional functional tests that should have been in PR elastic#75553 One is to ensure the error field gets cleared from the saved object, after the error status is updated with a non-error status. I always worry a bit about partial updates. The other is to do some negative find tests with the new fields. The existing tests are all positive, but would return the same results if for some reason the filters were ignored (presumably a bug). The negative tests ensure the filters actually filter things out as well. Also a bit of refactoring / cleanup of the tests.
resolves #51099
Summary
This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.
These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.
Checklist
Delete any items that are not applicable to this PR.
For maintainers
TODO