Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC][Rule Registry] Implement versioning and backing indices #109276

Conversation

banderror
Copy link
Contributor

@banderror banderror commented Aug 19, 2021

Ticket: #109293

🚨 This PR is critical for Observability 7.15 🚨

Summary

This PR fixes the indexing implementation in rule_registry. It implements the suggestions for backwards compatibility described in the ticket:

  • changes the naming scheme and introduces the concept of "backing indices", so that names of the concrete ("backing") indices != names of their aliases
  • adds versioning based on the current Kibana version

TODO:

  • Change index naming (implement the concept of backing indices)
  • Include Kibana version into the index template metadata
  • Include Kibana version into the document fields
  • Remove version from IndexOptions (parameters provided by solutions/plugins when initializing alerts-as-data indices)
  • Fix CI

Checklist

@banderror banderror added Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. labels Aug 19, 2021
@banderror banderror self-assigned this Aug 19, 2021
@banderror banderror force-pushed the rule-registry-index-versioning-and-backing-indices branch from cdd14cf to e7a8b2b Compare August 19, 2021 17:27
@banderror banderror added auto-backport Deprecated - use backport:version if exact versions are needed bug Fixes for quality problems that affect the customer experience release_note:skip Skip the PR/issue when compiling release notes Theme: rac label obsolete labels Aug 19, 2021
@banderror banderror force-pushed the rule-registry-index-versioning-and-backing-indices branch from e2fad72 to e8427ec Compare August 19, 2021 23:13
@banderror banderror force-pushed the rule-registry-index-versioning-and-backing-indices branch from a11a013 to 5cf15e9 Compare August 20, 2021 10:47
@banderror banderror marked this pull request as ready for review August 20, 2021 13:56
@banderror banderror requested review from a team as code owners August 20, 2021 13:56
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@banderror banderror requested a review from a team August 20, 2021 13:57
@banderror
Copy link
Contributor Author

Security Solution Cypress Tests (Chrome) seem to be hanging due to issues in Cypress tests themselves, unrelated to this PR. I think this PR is ready and can be reviewed while I'm trying to figure out what's going on with Cypress.

@banderror banderror requested a review from kobelb August 20, 2021 14:00
@banderror banderror added the impact:critical This issue should be addressed immediately due to a critical level of impact on the product. label Aug 20, 2021
Copy link
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uptime changes LGTM and the existing CI concerns don't seem to overlap.

@banderror banderror force-pushed the rule-registry-index-versioning-and-backing-indices branch from 5b29455 to aefd13e Compare August 25, 2021 14:49
@banderror
Copy link
Contributor Author

@kobelb thank you for the approval and your feedback 👍

I'm not 100% sure that we need to completely remove the upgrade/rollover logic from 7.15, but I share your general concerns about it, and in particular with regards to its "compatibility" with the backwards compatibility ideas ("further investigation is required to determine how this approach will work with the backward compatibility layer that we've proposed adding").

I agree that we should have to think and work on that more, we just synced within the team and we'll schedule a meeting for that.

@kobelb
Copy link
Contributor

kobelb commented Aug 25, 2021

Thanks, @banderror, please feel free to pull me into these meetings.

I'm not 100% sure that we need to completely remove the upgrade/rollover logic from 7.15

Do we have any mitigations put in place to prevent us from accidentally introducing incompatible mappings and preventing the rollover?

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
ruleRegistry 117 118 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 4.4MB 4.4MB +50.0B
observability 568.1KB 568.1KB +50.0B
total +100.0B
Unknown metric groups

API count

id before after diff
ruleRegistry 140 141 +1

History

  • 💚 Build #148462 succeeded 5b29455e8a9791d8ad87ac3cf0ab4b890833928c
  • 💚 Build #148048 succeeded 225956971057fafaf1f8c7c8c1c7275a79eb7323
  • 💚 Build #147723 succeeded d6769b88769fb2a4f3a58266d11f161aa3e9e3ad
  • 💚 Build #147517 succeeded 6c14378da24c068bc5cc2f34f2215fa440c436b0
  • 💔 Build #147240 failed 5cf15e9950daceeccf9c48051eb0cc6c4d302263

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @banderror

@banderror
Copy link
Contributor Author

Do we have any mitigations put in place to prevent us from accidentally introducing incompatible mappings and preventing the rollover?

@kobelb that's a difficult question, let me try to answer. Right now we don't have any; if we change our component templates in the code (either one of the common ones like "technical mappings", or one of the solution-specific ones) in an incompatible manner, the logic will do the rollover.

  • This can't happen on upgrade from 7.14.x to 7.15.0 because there won't be any existing RAC indices.
  • If the user upgrades Kibana from 7.15.0 to any further version, RAC indices will exist at this point (in general).
    • If the mappings are incompatible, we will do the rollover, which will create a new concrete index, and new documents with the new version will start being written in the new index. We'll have 2 concrete indices, each containing documents of only a single version which will be equal to the version specified in the index _meta and mappings._meta. To me it seems like a good case, it should work well with the ideas for backwards compatibility (we should be able to apply runtime fields or field aliases to indices corresponding to old versions).
    • If the mappings are compatible, that's the problem. We won't do the rollover, and just update the mappings of the existing concrete index with the new 7.15.0+ mappings. Index _meta and mappings._meta of this concrete index will be updated to the new version. We will start writing new documents to the same index and end up having documents of 2 versions in the same concrete index.

Imho the problem is when we don't do the rollover when we should.

Do you see a case when the problem is in doing the rollover when we shouldn't?

@banderror
Copy link
Contributor Author

I will go ahead and merge this PR. Thanks everyone, that was a great feedback 👍

We will definitely need to address Brandon's concerns in 7.15, I'm just not sure how exactly right now.

  • If we completely remove the upgrade logic, we will need to keep in mind that any changes in the schema won't be reflected in the app (including in dev mode) until the next release, when this upgrade logic will be properly implemented. Patch versions will be a risk (what if we need to adjust the schema in 7.15.1?).
  • If we leave it unchanged, we will introduce a risk of writing documents of 2+ different versions into the same concrete index (which will have a single version in its _meta).
  • Maybe I guess we need to fix it?..

@banderror banderror merged commit a299604 into elastic:master Aug 25, 2021
@banderror banderror deleted the rule-registry-index-versioning-and-backing-indices branch August 25, 2021 17:54
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Aug 25, 2021
…c#109276)

**Ticket:** elastic#109293

🚨 **This PR is critical for Observability 7.15** 🚨

## Summary

This PR fixes the indexing implementation in `rule_registry`. It implements the suggestions for backwards compatibility described in the ticket:

- changes the naming scheme and introduces the concept of "backing indices", so that names of the concrete ("backing") indices != names of their aliases
- adds versioning based on the current Kibana version

TODO:

- [x] Change index naming (implement the concept of backing indices)
- [x] Include Kibana version into the index template metadata
- [x] Include Kibana version into the document fields
- [x] Remove `version` from `IndexOptions` (parameters provided by solutions/plugins when initializing alerts-as-data indices)
- [x] Fix CI

### Checklist

- [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials
- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Aug 25, 2021
…c#109276)

**Ticket:** elastic#109293

🚨 **This PR is critical for Observability 7.15** 🚨

## Summary

This PR fixes the indexing implementation in `rule_registry`. It implements the suggestions for backwards compatibility described in the ticket:

- changes the naming scheme and introduces the concept of "backing indices", so that names of the concrete ("backing") indices != names of their aliases
- adds versioning based on the current Kibana version

TODO:

- [x] Change index naming (implement the concept of backing indices)
- [x] Include Kibana version into the index template metadata
- [x] Include Kibana version into the document fields
- [x] Remove `version` from `IndexOptions` (parameters provided by solutions/plugins when initializing alerts-as-data indices)
- [x] Fix CI

### Checklist

- [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials
- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.15
7.x

The backport PRs will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Aug 25, 2021
… (#110125)

**Ticket:** #109293

🚨 **This PR is critical for Observability 7.15** 🚨

## Summary

This PR fixes the indexing implementation in `rule_registry`. It implements the suggestions for backwards compatibility described in the ticket:

- changes the naming scheme and introduces the concept of "backing indices", so that names of the concrete ("backing") indices != names of their aliases
- adds versioning based on the current Kibana version

TODO:

- [x] Change index naming (implement the concept of backing indices)
- [x] Include Kibana version into the index template metadata
- [x] Include Kibana version into the document fields
- [x] Remove `version` from `IndexOptions` (parameters provided by solutions/plugins when initializing alerts-as-data indices)
- [x] Fix CI

### Checklist

- [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials
- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

Co-authored-by: Georgii Gorbachev <georgii.gorbachev@elastic.co>
kibanamachine added a commit that referenced this pull request Aug 25, 2021
… (#110126)

**Ticket:** #109293

🚨 **This PR is critical for Observability 7.15** 🚨

## Summary

This PR fixes the indexing implementation in `rule_registry`. It implements the suggestions for backwards compatibility described in the ticket:

- changes the naming scheme and introduces the concept of "backing indices", so that names of the concrete ("backing") indices != names of their aliases
- adds versioning based on the current Kibana version

TODO:

- [x] Change index naming (implement the concept of backing indices)
- [x] Include Kibana version into the index template metadata
- [x] Include Kibana version into the document fields
- [x] Remove `version` from `IndexOptions` (parameters provided by solutions/plugins when initializing alerts-as-data indices)
- [x] Fix CI

### Checklist

- [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials
- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

Co-authored-by: Georgii Gorbachev <georgii.gorbachev@elastic.co>
@banderror
Copy link
Contributor Author

For visibility, here's a summary of the topics and decisions from the meeting with @kobelb about removing index upgrade logic. The bullets are ordered from "most immediate tasks" to "longer term goals":

  • (Now) We'll immediately add a feature flag around the logic that finds existing Alerts as Data indices and upgrades the mappings or rolls the index if the mappings can't be upgraded in place. We'll suggest that all devs enable this feature flag for development work so that their AAD indices will pull in any new changes to the schema intra-release.

  • (Before the next release that makes any mapping changes, possibly as soon as 7.15.1) We'll work on improving the index upgrade logic because upgrading the mappings in place is potentially incompatible with the plan to add field aliases and runtime fields to old indices for backwards compatibility. If the mapping on a single index can change over time, it's hard to define what aliases/runtime fields would need to be added to make it compatible with the new schema.

  • (Immediately after 7.15) We'll add tests that compare the 7.15 shipped schema to the 7.16+ schema in development to ensure that the existing schema can be upgraded properly.

  • (Starting soon, but may take a while) We'll document the use case and desired behavior around index upgrade logic as well as the plans for backwards compatibility in an RFC and coordinate with the Elasticsearch team. We'd like the ES team's input on possible new ES features to support our use case and thoughts on the tradeoff between oversharding due to aggressive rollover vs the risk of updating mappings in place.

  • (When we need it) Implement the logic to apply the aliases/runtime fields to old AAD indices per version. If the schema does not require major changes, this may not be needed for a while.

Other open questions:

  • Should we update the mapping in place if we're only adding a new field? Pro: limits oversharding. Con: divides alerts in the index into old/new, harder to apply some types of backwards compatibility changes
  • Right now the in-place index upgrade only applies new mappings. Can/should we update index settings as well?

Thanks to @marshallmain for putting this summary together.
I will update the parent ticket (#101016) with these decisions and create new tickets.

@weltenwort
Copy link
Member

Thanks for the summary. I wonder if the plan to not update mappings in place is at all compatible with mutating the alert documents. How can we reliably update the documents in old indices if they don't have the most recent mappings.

Maybe we should consider a "hybrid" approach in which an "update" means we write a document to the current write index (which we can assume to have the most recent mapping) and we delete the document from the index it was previously in? That would be almost like a reindex-on-write approach. The main risk I see is that we don't have a transaction to make this operation atomic, so we might have intermittently missing or duplicate alert docs. 🤔 What do you think?

@banderror
Copy link
Contributor Author

I wonder if the plan to not update mappings in place is at all compatible with mutating the alert documents. How can we reliably update the documents in old indices if they don't have the most recent mappings.

@weltenwort I think so, because it's a temporary measure that will be in place between 7.15.0 and most likely 7.15.x. Please check these for more info and let me know if you still think that there can be issues with this feature flag:

@banderror
Copy link
Contributor Author

Maybe we should consider a "hybrid" approach in which an "update" means we write a document to the current write index (which we can assume to have the most recent mapping) and we delete the document from the index it was previously in? That would be almost like a reindex-on-write approach. The main risk I see is that we don't have a transaction to make this operation atomic, so we might have intermittently missing or duplicate alert docs. 🤔 What do you think?

Interesting. Sounds like a reasonable thing to do in Observability because old alerts can be recovered, and it doesn't make a lot of sense to keep them in old indices. And you're saying you need to update the @timestamp, correct? I'm not sure about Security because we never update timestamps, only the status. I need to think about that more.

@marshallmain
Copy link
Contributor

@weltenwort We would have the ability to update mappings in place by applying runtime fields and aliases, but I think generally we wouldn't be applying this update to the write index alone. Instead these updated mappings would be applied as kind of a pseudo-migration, where we want to make all existing indices forwards-compatible with e.g. field renames or reformatting field values. Even if we update the write index mapping in place, we'd still have the issue with older non-write indices not having the most recent mappings. So it may make sense to just update all the existing index mappings in place with backwards compatibility updates and separate that logic from the logic of "making a write index with the correct new mappings".

Even if we do update-on-write, any application logic will still have to handle the possibility of a document having outdated mappings indefinitely so I don't think updating on write would solve our issue. I think in the case where we add a new field mapping, application logic will just have to define what to do in the case where the field is missing - or if the field is absolutely critical for all alerts (even old ones) to have, then we'd have to add it with a runtime field or a proper migration on startup.

@marshallmain
Copy link
Contributor

Also, updating a document will always cause it to be reindexed in its current index so we could allow new regular field mappings to be added to old indices in addition to runtime fields and aliases and that would be a simple way of implementing update-on-write.

@weltenwort
Copy link
Member

Instead these updated mappings would be applied as kind of a pseudo-migration, where we want to make all existing indices forwards-compatible with e.g. field renames or reformatting field values.

Yes, that was my comment during the PR review as well. I agree with your analysis. But the discovery of #110519 got me thinking about other ways to reduce the risk of future update conflicts. Moving the updated documents to the most recent write index would mean we need to make the old indices backwards-compatible only for reading but not for writing. Not sure how much of an advantage it would be, but I wanted to put the idea out there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed bug Fixes for quality problems that affect the customer experience impact:critical This issue should be addressed immediately due to a critical level of impact on the product. release_note:skip Skip the PR/issue when compiling release notes Team:APM All issues that need APM UI Team support Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability Theme: rac label obsolete v7.15.0 v7.16.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.