Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update alert documents when the write index changes #110788

Merged
merged 5 commits into from
Sep 3, 2021

Conversation

mgiota
Copy link
Contributor

@mgiota mgiota commented Sep 1, 2021

Fixes #110519

📝 Summary

The proposed solution in the above ticket to remove require_alias: true didn't work, because the mappings wouldn't be installed correctly. This is because of the current logic where resources are installed only when the bulk operation fails https://github.com/elastic/kibana/blob/master/x-pack/plugins/rule_registry/server/rule_data_client/rule_data_client.ts#L139. If the flag is missing and we would like to index some new data, resources wouldn't be installed and alerts table wouldn't render any data.

So the fix was to:

  • keep the require_alias: true
  • and disable it only when updating a document
await ruleDataClient.getWriter().bulk({
      body: allEventsToIndex.flatMap(({ event, indexName }) => [
        indexName
          ? { index: { _id: event[ALERT_UUID]!, _index: indexName, require_alias: false } }
          : { index: { _id: event[ALERT_UUID]! } },
        event,
      ]),
    });

How to test

For both scenarios here's the command you can use in Dev tools to get the alerts that are being indexed

GET .alerts-observability*/_search
{
  "fields": [
    "*"
  ],
  "_source": false,
  "sort": [
    {
      "kibana.alert.start": {
        "order": "desc" 
      }
    }
  ]
}

Scenario 1 (an ongoing alert gets written in old index after rollover)

  • Create a new rule and generate some data that should trigger an alert
  • Verify one new alert is written in the correct index .internal.alerts-observability.logs.alerts-default-000001
  • Wait for the next trigger of the alert and verify that alert is updated and no new alert is created
  • In Devtools do a rollover POST .alerts-observability.logs.alerts-default/_rollover
  • Verify you can see two indices GET .alerts-observability.logs.alerts-default
  • Wait for the alert to trigger again
  • Verify the new alert is still written in old new index .internal.alerts-observability.logs.alerts-default-000001

Scenario 2 (an ongoing alert SHOULD be written in the old index after rollover and after a new rule type was created)

  • Create a new rule and generate some data that should trigger an alert
  • Verify one new alert is written in the correct index .internal.alerts-observability.logs.alerts-default-000001
  • Wait for the next trigger of the alert and verify that alert is updated and no new alert is created
  • In Devtools do a rollover POST .alerts-observability.logs.alerts-default/_rollover
  • Verify you can see two indices GET .alerts-observability.logs.alerts-default
  • Create a new rule and wait for the new alert to trigger
  • Verify the new alert is written in the new index .internal.alerts-observability.logs.alerts-default-000002
  • Verify the old alert is written in the old index .internal.alerts-observability.logs.alerts-default-000001 => We just spotted a bug 🐞 🐛 , where ILM policy deleted the old indices after rollover and the old alert got indexed in .internal.alerts-observability.logs.alerts-default-000002 instead [RAC] Alert ILM policy shouldn't delete old indices after rollover #111029

@mgiota mgiota marked this pull request as draft September 1, 2021 12:15
@mgiota mgiota added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.15.0 bug Fixes for quality problems that affect the customer experience v7.16.0 v8.0.0 Theme: rac label obsolete release_note:skip Skip the PR/issue when compiling release notes labels Sep 1, 2021
@kibanamachine
Copy link
Contributor

⏳ Build in-progress, with failures

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

4 similar comments
@kibanamachine
Copy link
Contributor

⏳ Build in-progress, with failures

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

⏳ Build in-progress, with failures

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

⏳ Build in-progress, with failures

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

⏳ Build in-progress, with failures

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@mgiota mgiota marked this pull request as ready for review September 2, 2021 12:07
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@mgiota mgiota self-assigned this Sep 2, 2021
@kibanamachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @mgiota

@mgiota
Copy link
Contributor Author

mgiota commented Sep 2, 2021

I tested above two scenarios and here are the findings:

Scenario 1
I took a video and as you can see after rollover the existing alert gets written in the old index as expected.
https://user-images.githubusercontent.com/2852703/131854460-e4aa01fb-0aa2-4e19-879c-ffdca4ae677b.mov.

Scenario 2 🐞
When I created a 2nd rule the 2nd alert was written in the correct index .internal.alerts-observability.logs.alerts-default-000002
Screenshot 2021-09-02 at 16 24 31

But the old alert got written in the new index as well .internal.alerts-observability.logs.alerts-default-000002, where it shouldn't.
Screenshot 2021-09-02 at 16 43 00

This is a new bug that we spotted and it is because ILM policy deleted the old indices after rollover.

UPDATE
Above scenario is not always reproducible. It turns out the ILM policy deletes old indices, but ES might not evaluate the policy immediately, which can make testing a bit harder

@weltenwort
Copy link
Member

weltenwort commented Sep 2, 2021

Isn't the whole point of this bugfix to ensure that it's being written to the old index again? Otherwise the duplication I described in the issue would occur, which we are trying to avoid. Maybe I'm misunderstanding. The video and the screenshots. Which is happening when?

@mgiota
Copy link
Contributor Author

mgiota commented Sep 2, 2021

@weltenwort I updated my comment above as per our discussion. I hope it is more clear now for other reviewers. And yes the expected behavior is for old alerts to keep being written in the old index again after a rollover.

@mgiota
Copy link
Contributor Author

mgiota commented Sep 2, 2021

A bug I spotted is the reason field (same for threshold and actual value) are blank when the alert recovers
Screenshot 2021-09-02 at 19 37 31

@weltenwort
Copy link
Member

Hm, seems like we have some edge cases to smooth out. I'll take a look asap.

@weltenwort
Copy link
Member

So after some collaborative investigation we realized this fix uncovered a different problem related to the ILM policy associated with the alerting indices by default. We'll track and resolve that separately. 😌

@mgiota
Copy link
Contributor Author

mgiota commented Sep 2, 2021

Here's the new ticket #111029. It should also solve the issue with the empty reason field for recovered alerts I pasted above

@Kerry350
Copy link
Contributor

Kerry350 commented Sep 3, 2021

Functionality looks good 👍

However, just to throw a spanner in the works, I didn't see this behaviour:

We just spotted a bug 🐞 🐛 , where ILM policy deleted the old indices after rollover

In both scenario 1 and 2 I didn't have my old index deleted.

Checking the code now.

Test results

Scenario 1
  • Create a new rule and generate some data that should trigger an alert

Screenshot 2021-09-03 at 12 33 39

  • Verify one new alert is written in the correct index .internal.alerts-observability.logs.alerts-default-000001

Screenshot 2021-09-03 at 12 34 00

  • Wait for the next trigger of the alert and verify that alert is updated and no new alert is created

Screenshot 2021-09-03 at 12 36 17

  • In Devtools do a rollover POST .alerts-observability.logs.alerts-default/_rollover
  • Verify you can see two indices GET .alerts-observability.logs.alerts-default (I used GET _cat/indices/.alerts-observability.logs.alerts-default but same idea)

Screenshot 2021-09-03 at 12 37 23

  • Wait for the alert to trigger again
  • Verify the new alert is still written in old new index .internal.alerts-observability.logs.alerts-default-000001

Screenshot 2021-09-03 at 12 38 51

Scenario 2
  • Create a new rule and generate some data that should trigger an alert

Screenshot 2021-09-03 at 12 42 37

  • Verify one new alert is written in the correct index .internal.alerts-observability.logs.alerts-default-000001

Screenshot 2021-09-03 at 12 43 01

  • Wait for the next trigger of the alert and verify that alert is updated and no new alert is created

Screenshot 2021-09-03 at 12 43 36

  • In Devtools do a rollover POST .alerts-observability.logs.alerts-default/_rollover
  • Verify you can see two indices GET .alerts-observability.logs.alerts-default

Screenshot 2021-09-03 at 12 44 08

  • Create a new rule and wait for the new alert to trigger

Screenshot 2021-09-03 at 12 52 53

  • Verify the new alert is written in the new index .internal.alerts-observability.logs.alerts-default-000002
  • Verify the old alert is written in the old index .internal.alerts-observability.logs.alerts-default-000001

Screenshot 2021-09-03 at 12 53 57
Screenshot 2021-09-03 at 12 54 50

(Timestamps show the updates continuing after rollover without deleting the old index)

@mgiota
Copy link
Contributor Author

mgiota commented Sep 3, 2021

@Kerry350 I was also not able to reproduce the bug in Scenario 2. It looks like unconditionally ILM deleted the indices for me when I was testing. I will update the description.

Thanks for checking it out so thoroughly

@mgiota
Copy link
Contributor Author

mgiota commented Sep 3, 2021

@Kerry350 I will enable automerge, since I will be off for a couple of hours.

@mgiota mgiota enabled auto-merge (squash) September 3, 2021 12:20
@Kerry350
Copy link
Contributor

Kerry350 commented Sep 3, 2021

@mgiota

I will enable automerge, since I will be off for a couple of hours.

👍

I'll keep an eye too in case anything goes wrong.

@mgiota mgiota merged commit e2ee263 into elastic:master Sep 3, 2021
Copy link
Contributor

@Kerry350 Kerry350 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Kerry350 pushed a commit to Kerry350/kibana that referenced this pull request Sep 3, 2021
* first draft(work in progress)

* add back missing await

* disable require_alias flag only when we update

* cleanup
@Kerry350
Copy link
Contributor

Kerry350 commented Sep 3, 2021

I approved before noticing the auto-backport label was missing. I've created the backports manually.

Kerry350 pushed a commit to Kerry350/kibana that referenced this pull request Sep 3, 2021
* first draft(work in progress)

* add back missing await

* disable require_alias flag only when we update

* cleanup
@weltenwort
Copy link
Member

weltenwort commented Sep 3, 2021

Thanks for the review! I think auto-backport also works retroactively.

Copy link
Contributor

@banderror banderror left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Comment on lines +289 to +291
indexName
? { index: { _id: event[ALERT_UUID]!, _index: indexName, require_alias: false } }
: { index: { _id: event[ALERT_UUID]! } },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I'd just like to ask if you thought about using create + update instead of index and decided to keep index for both operations. It looks like this executor collects the full set of document fields, so it's safer to use index (create or replace the whole doc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we considered update but figured using index gives tighter control over the full document content (e.g. allows for removal of fields). We might refactor to not fetch the full content and use update in the future, but it looked like too much of a change for a late 7.15.0 bug-fix.

Kerry350 added a commit that referenced this pull request Sep 3, 2021
* first draft(work in progress)

* add back missing await

* disable require_alias flag only when we update

* cleanup

Co-authored-by: mgiota <giota85@gmail.com>
@kibanamachine
Copy link
Contributor

Looks like this PR has backport PRs but they still haven't been merged. Please merge them ASAP to keep the branches relatively in sync.

@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Sep 7, 2021
Kerry350 added a commit that referenced this pull request Sep 7, 2021
* first draft(work in progress)

* add back missing await

* disable require_alias flag only when we update

* cleanup

Co-authored-by: mgiota <giota85@gmail.com>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Sep 7, 2021
@mgiota mgiota deleted the 110519_update_documents branch January 4, 2022 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience release_note:skip Skip the PR/issue when compiling release notes Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Theme: rac label obsolete v7.15.0 v7.16.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RAC] [Observability] Alert documents are not updated as expected when the write index changes
6 participants