Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] formalize alert status and add status fields to alert saved object #75553

Merged
merged 9 commits into from
Oct 1, 2020

Conversation

pmuellr
Copy link
Member

@pmuellr pmuellr commented Aug 20, 2020

resolves #51099

Summary

This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.

These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.

Checklist

Delete any items that are not applicable to this PR.

  • Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
    • test to ensure we haven't broken AAD when the status gets updated in the alert
    • test to ensure that when the owner of an alert's priv's change so they can no longer write to alerts, the status of any alert they created/updated still have their status updated in the alert (assuming there are cases that the alert would still be running "successfully")

For maintainers

TODO

  • jest tests for all status values and status error reason values
  • FT for all status values and status error reason values we can test
  • AAD tests

@pmuellr pmuellr added Feature:Alerting v8.0.0 release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 labels Aug 20, 2020
@spong spong requested a review from dhurley14 August 21, 2020 15:06
@pmuellr pmuellr force-pushed the alerting/status-in-so branch 3 times, most recently from ce43e9a to 079d27c Compare August 27, 2020 01:14
@pmuellr pmuellr force-pushed the alerting/status-in-so branch 3 times, most recently from 206a76b to a429e44 Compare September 2, 2020 20:31
@pmuellr
Copy link
Member Author

pmuellr commented Sep 8, 2020

I'm splitting out the alertClients occ/version changes into a separate PR, since it's not directly related to this code, and something that should be looked at independently - #76830 - this PR is now blocked on that PR

pmuellr added a commit to pmuellr/kibana that referenced this pull request Sep 8, 2020
During development of elastic#75553,
some issues came up with the optimistic concurrency control (OCC) we
were using internally within the alertsClient, via the `version`
option/property of the saved object.  The referenced PR updates new
fields in the alert from the taskManager task after the alertType
executor runs.  In some alertsClient methods, OCC is used to update
the alert which are requested via user requests.  And so in some
cases, version conflict errors were coming up when the alert was
updated by task manager, in the middle of one of these methods.  Note:
the SIEM function test cases stress test this REALLY well.

In this PR, we remove OCC from methods that were currently using it,
namely `update()`, `updateApiKey()`, `enable()`, `disable()`, and the
`[un]mute[All,Instance]()` methods.  Of these methods, OCC is really
only _practically_ needed by `update()`, but even for that we don't
support OCC in the API, yet; see: issue elastic#74381 .

For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer.  That will be used by PR elastic#75553.
pmuellr added a commit to pmuellr/kibana that referenced this pull request Sep 22, 2020
During development of elastic#75553,
some issues came up with the optimistic concurrency control (OCC) we
were using internally within the alertsClient, via the `version`
option/property of the saved object. The referenced PR updates new
fields in the alert from the taskManager task after the alertType
executor runs. In some alertsClient methods, OCC is used to update
the alert which are requested via user requests. And so in some
cases, version conflict errors were coming up when the alert was
updated by task manager, in the middle of one of these methods. Note:
the SIEM function test cases stress test this REALLY well.

In this PR, we wrap all the methods using OCC with a function that
will retry them, a short number of times, with a short delay in
between. If the original method STILL has a conflict error, it
will get thrown after the retry limit.  In practice, this eliminated
the version conflict calls that were occuring with the SIEM tests,
once we started updating the saved object in the executor.

For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer. That will be also used by PR elastic#75553.

interim commits:

- add jest tests for partially_update_alert
- add jest tests for retry_if_conflicts
- in-progress on jest tests for alertsClient with conflicts
- get tests for alerts_client_conflict_retries running
- add tests for retry logger messages
- resolve some PR comments
- fix alertsClient test to use version with muteAll/unmuteAll
- change test to reflect new use of create api
- fix alertsClient meta updates, add comments
Copy link
Contributor

@dhurley14 dhurley14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a branch based on this PR and local testing has looked great. Thanks for this!!! LGTM

@pmuellr
Copy link
Member Author

pmuellr commented Sep 30, 2020

update: na, not flaky - I had pulled in some code that was going to be a merge conflict, but didn't pull enough in; hopefully fixed now in c366d4b

original comment:


This smells a little flaky, though it's odd it in actions which shouldn't be affected by this change.

Guessing it's failing here, for one of the ci errors:

const reference = `actions-execute-1:${Spaces.space1.id}`;
const response = await supertest
.post(`${getUrlPrefix(Spaces.space1.id)}/api/actions/action/${createdAction.id}/_execute`)
.set('kbn-xsrf', 'foo')
.send({
params: {
reference,
index: ES_TEST_INDEX_NAME,
message: 'Testing 123',
},
});
expect(response.status).to.eql(200);

This is a test that uses an indexing action and executes it, and then tries to read the doc that should have gotten created - it seems like maybe the task got delayed, and we should wait a bit before checking?

@pmuellr
Copy link
Member Author

pmuellr commented Sep 30, 2020

jenkins retest this please

Copy link
Contributor

@YulNaumenko YulNaumenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like as solid a solution as we can achieve, given the limitation we have imposed by SOs, so good work on that front. 👍

I've played around with it, tried to fail it, and it seems to be working as expected.

I'm a little concerned that we're not displaying anything in the UI though.
Shouldn't this PR express, at the very least, what Status the Alert is in on the List/Details pages?


export interface AlertExecutionStatus {
status: AlertExecutionStatuses;
date: Date;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename date to something that is explicit about what date it is?
lastExecution? lastUpdate? etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does feel kind of "context free", viewed here, but it is a property of the executionStatus object, so I thought using date by itself would be fine; ie, you're always be referencing this piece of data as executionStatus.date. Adding additional context on date itself seemed like overkill and overly wordy. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it isn't obvious to me what this date actually is 🤷
I won't block on this, but my feeling is that we're adding to the cognitive load by not being clear what his date actually means.
I'm guessing it means "lastUpdate" of the status, but I'm still not 100% sure and to me that's a reason to clarify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you know what the status is? Should we change it to lastExecutionStatus as well? :-) I think that was my thinking in trying to not add more context here to the prop names, when it feels like it's implied by it's containing property &| type.

Contextually, how people would end up accessing this, would look like the following for the two variants:

  • alert.executionStatus.date
  • alert.executionStatus.lastExecutionDate

I don't have really strong feels, just trying to cut down verbosity / ceremony / overkill where not actually needed.

But I just thought of a decent reason to do this - if for some reason we add some other date to this structure later, THEN it will certainly be confusing what the un-prefixed date would be.

So, I think lastExecutionDate prolly works best for me, since that exactly describes it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current branch is using lastExecutionDate - thanks for prodding on this Gidi!

@@ -393,6 +393,11 @@ describe('create()', () => {
"createdAt": "2019-02-12T21:01:22.479Z",
"createdBy": "elastic",
"enabled": true,
"executionStatus": Object {
"date": "2019-02-12T21:01:22.479Z",
"error": null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that we set error to null because of the partial update issue?
I just want to double check I understand, as my instinct would have been to just omit it if there is no error... but obviously, being aware of the partial update issues, I'm assuming it's that. is it worth adding a comment in the code for future devs who might not be aware? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, it's a partial update issue; we need to make sure we remove a previous error if we're doing an update and there is no error this time. So it's typed in the raw alert as "null-able".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment in the raw alert definition for this ...

Comment on lines 122 to +125
| 'muteAll'
| 'mutedInstanceIds'
| 'actions'
| 'executionStatus'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit.
This list has grown a lot, should we switch it to the inverse (using Pick instead of Omit)?
It would mean, by default, new fields are not part of create... 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not wrong, but I hate to make changes like this, in a PR like this. It's more of a general clean up thing, and this particular item seems hardly worth an issue by itself. Do we have some general "simple tech debt items" issue we could add this to?

@@ -228,6 +229,11 @@ export class AlertsClient {
params: validatedAlertTypeParams as RawAlert['params'],
muteAll: false,
mutedInstanceIds: [],
executionStatus: {
status: 'unknown',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. It feels like we should be using enums for these rather than hard coded strings, no?
I get that the wrong string won't pass type checking, but we've used enums for equivalent fields so it feels like we should align.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda look at the "set of these strings" type as being the ceremony-free version of string enums. Is there an explicit value in using enums instead? The upside to not using them is not having to maintain the duplication of the enum keys / values, and not having to have access to that enum type in code that needs it. ie, less ceremony :-)

Would the enum type be exposed in the Alert objects themselves, or just an internal detail? I think we could type it in the Alert and RawAlert as enum safely, but not quite sure. Another reason I generally avoid enums, because I always have to go read the chapter on them in the TS handbook :-)

Comment on lines 970 to 974
const rawAlertWithoutExecutionStatus: Partial<Omit<RawAlert, 'executionStatus'>> = {
...rawAlert,
};
delete rawAlertWithoutExecutionStatus.executionStatus;
const executionStatus = alertExecutionStatusFromRaw(rawAlert.executionStatus);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit.
Could we just extract executionStatus on line 966 like we do with createdAt, meta and scheduledTaskId?
That avoids the need for the delete and the creation of rawAlertWithoutExecutionStatus.
We could also rename rawAlert to be explicit about it not containing the Execution Status if you feel that's important. 🤷

It just feels more in line with the rest of the code there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was sooooo hard - I tried multiple approaches to this, including doing what you suggested. Not happy with that delete, and I this seemed like the best alternative.

The main problem is the types of the props in the RawAlert and Alert pretty much match up, so that ...rawAlertWithoutExecutionStatus fills in most of the bits without any typing errors. However, the types of the executionStatus in RawAlert and Alert are different. And because we're dealing with partials coming in and going out, it might not be there. So when doing something like what you suggest, you'll end up with an error TS thinks the type of the returned executionStatus is RawAlertExecutionStatus | AlertExecutionStatus, which clearly isn't kosher.

It's worth a comment I think, will help the next person looking at this save a few minutes when they try to fix it. heh


export function isAlertSavedObjectNotFoundError(err: Error, alertId: string) {
// if this is an error with a reason, the actual error needs to be extracted
if (isErrorWithReason(err)) {
err = err.error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit.
This might just be me, but I always find it hard to track the reassignment of function arguments and try to avoid it.
Could we use an intermediary variable? 😬


after(async () => await objectRemover.removeAll());

it('should be "unknown" for newly created alert', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about unknown as the status for newly created alerts.
perhaps something that expresses that we know why it doesn't have data yet? The word unknown suggests we don't know why there is no status... that could cause confusion and concern.
This would also help distinguish between a status of unknown and a reason of unknown which are likely to be confused in the future.

Would scheduled or pending work better? Something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long internal discussion on this one :-). Looks like we'll make it something more concrete than unknown, eg pending or similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm "finalizing" on pending for now, we could still change this before FF next week if needed.

Comment on lines 327 to 336
function trues(length: number): boolean[] {
return booleans(length, true);
}

function booleans(length: number, value: boolean): boolean[] {
return ''
.padStart(length)
.split('')
.map((e) => value);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, @pmuellr, you know I love how old school you are :)
We now have fill ;)

Suggested change
function trues(length: number): boolean[] {
return booleans(length, true);
}
function booleans(length: number, value: boolean): boolean[] {
return ''
.padStart(length)
.split('')
.map((e) => value);
}
function trues(length: number): boolean[] {
return new Array(number).fill(true);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dang, I thought there was something like that now, and couldn't find it!!! I hate writing stuf like this, though it is certainly kinda fun to abuse padStart() and split() this way :-) (I used to do a lot of REXX programming where the only primitive data type was a string (but you could math on it), so we found all kind of "interesting" ways to do things with string functions).

Comment on lines +338 to +340
async function delay(millis: number): Promise<void> {
await new Promise((resolve) => setTimeout(resolve, millis));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish this was just a built in function, we redefine it in every test suite we have 😂

Comment on lines 19 to 33

export async function buildUp(getService: FtrProviderContext['getService']) {
const spacesService = getService('spaces');
for (const space of Object.values(Spaces)) {
if (space.id === 'default') continue;

const { id, name, disabledFeatures } = space;
await spacesService.create({ id, name, disabledFeatures });
}
}

export async function tearDown(getService: FtrProviderContext['getService']) {
const esArchiver = getService('esArchiver');
await esArchiver.unload('empty_kibana');
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused... why is this here?
Didn't we merge this into Main already? 😮
We might need a fresh merge from Main into the PR... 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh dang, was going to comment in the PR on this.

There was a merge conflict w/master in this area, and rather than merge master, I just copied the code in here. I was still rebasing at the time, didn't want to have to merge master, but forgot to add a note about it.

I did actually have to merge master this morning and guess what! It's gone!

@pmuellr
Copy link
Member Author

pmuellr commented Oct 1, 2020

@elasticmachine merge upstream

@pmuellr
Copy link
Member Author

pmuellr commented Oct 1, 2020

I'm a little concerned that we're not displaying anything in the UI though.

I was thinking adding something here, but given the current size, and nearness to FF, seemed best not to.

Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as @pmuellr has confirmed that adding UI is out of scope for this PR and we've discussed renaming unknown on initial creation.

Those were the only blockers for me. 👍

@pmuellr
Copy link
Member Author

pmuellr commented Oct 1, 2020

commit 8775610 has changes from the PR comments, except for the status unknown change, which is still waiting (I did an early stand-alone commit for that before this last one, to see if it caught all the occurances). Will likely make one more commit to the name we decide on, then merge ...

@pmuellr
Copy link
Member Author

pmuellr commented Oct 1, 2020

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

distributable file count

id before after diff
default 45823 45825 +2

page load bundle size

id before after diff
alerts 89.3KB 89.8KB +469.0B

Saved Objects .kibana field count

id before after diff
alert 24 30 +6

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@pmuellr pmuellr merged commit 117b577 into elastic:master Oct 1, 2020
pmuellr added a commit to pmuellr/kibana that referenced this pull request Oct 1, 2020
…d object (elastic#75553)

resolves elastic#51099

This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.

These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.

The data is added to the alert as the `executionStatus` field, with the
following shape:

```ts
interface AlertExecutionStatus {
  status: 'ok' | 'active' | 'error' | 'pending' | 'unknown';
  lastExecutionDate: Date;
  error?: {
    reason: 'read' | 'decrypt' | 'execute' | 'unknown';
    message: string;
  };
}
```
pmuellr added a commit that referenced this pull request Oct 2, 2020
…d object (#75553) (#79227)

resolves #51099

This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.

These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.

The data is added to the alert as the `executionStatus` field, with the
following shape:

```ts
interface AlertExecutionStatus {
  status: 'ok' | 'active' | 'error' | 'pending' | 'unknown';
  lastExecutionDate: Date;
  error?: {
    reason: 'read' | 'decrypt' | 'execute' | 'unknown';
    message: string;
  };
}
```
pmuellr added a commit to pmuellr/kibana that referenced this pull request Oct 2, 2020
I remembered some additional functional tests that should have been in
PR elastic#75553

One is to ensure the error field gets cleared from the saved object, after
the error status is updated with a non-error status.  I always worry a bit
about partial updates.

The other is to do some negative find tests with the new fields.  The existing
tests are all positive, but would return the same results if for some reason
the filters were ignored (presumably a bug).  The negative tests ensure the
filters actually filter things out as well.

Also a bit of refactoring / cleanup of the tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alert statuses
6 participants