Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Platform][Alerting] Reducing the cost of change in data migrations #96291

Open
gmmorris opened this issue Apr 6, 2021 · 12 comments
Open

[Platform][Alerting] Reducing the cost of change in data migrations #96291

gmmorris opened this issue Apr 6, 2021 · 12 comments
Labels
discuss Feature:RAC label obsolete Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Apr 6, 2021

Context

As the RAC initiative progresses, we're making decisions about the shape of the Alerts-as-data indices and various Saved Objects.
These decisions are based on what we know now about the needs of Security, O11y and Stack Rules (including ML and Maps), but I think it's safe to assume we're going to get some thing wrong, and other things are likely to change (as new requirements come in, and we broaden the scope to include other features as part of 8.x).

As thing stand, the cost of change to the shape of data (both in system/hidden indices and SOs) is so high that we tend to think of these as prohibitive. This fear has historically made it hard for the Alerting team to make decisions about the correct generic approach, slowing us down.

I'd like to use this issue to discuss how we can change that.

Key Challenges

Below are the key challenges, making the cost of change higher than we'd like it to be:

  1. There is no OOTB method for migrating system/hidden indices. SIEM currently have a curl-based manual index migration, but there is no system in place to ensure this migration is called correctly as part of an upgrade (as we have with SO migrations). If we need to migrate a system/hidden index, the new should provide a platform level service for this.
  2. As Alerts-as-data grows in use, we expect many customers to accumulate huge data indices full of alerts. It doesn't seem reasonable to expect historical indices to be migrated when this schema changes, but neither can we afford to lock ourselves into a schema that was designed in 7.14 forever. How do we support incremental changes to these schema from minor to minor while still supporting historical data?
  3. There are no cross-type Saved Object migrations. For instance, each rule SavedObject has a taskId field that points at a task SO type. We need to pluck the API keys from the rule, and migrate it onto the task that is referenced by the taskId field. We have no way of doing this.

cc @kobelb @spong @tsg @alexh97 @mfinkle @elastic/kibana-core

@gmmorris gmmorris added discuss Feature:Alerting Feature:RAC label obsolete Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Apr 6, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mfinkle
Copy link

mfinkle commented Apr 6, 2021

I can't comment on much here, but for Point 2, I do think we should think we could leverage Runtime Fields to define as much of the "schema" as possible to mitigate against inevitable Alert schema changes. We could keep the main body of the Alert in a jSON blob or something, and use Runtime Fields to define the specific fields pulling data from the blob. As the schema (in the blob) changes, we'd update the Runtime Field definitions.

For Point 3, we could see where cross-type SO migrations fall on the Kibana Core roadmap (if we've considered them yet)

@gmmorris
Copy link
Contributor Author

gmmorris commented Apr 6, 2021

I can't comment on much here, but for Point 2, I do think we should think we could leverage Runtime Fields to define as much of the "schema" as possible to mitigate against inevitable Alert schema changes. We could keep the main body of the Alert in a jSON blob or something, and use Runtime Fields to define the specific fields pulling data from the blob. As the schema (in the blob) changes, we'd update the Runtime Field definitions.

That's an interesting thought, which hadn't actually crossed my mind.
@spong - How feasible would such an approach be given the kinds of aggregations, and searches you expect to run on the data in alerts-as-data? 🤔

@gmmorris
Copy link
Contributor Author

gmmorris commented Apr 6, 2021

For Point 3, we could see where cross-type SO migrations fall on the Kibana Core roadmap (if we've considered them yet)

🙏
I don't think this is needed until the 8.x timeframe. (am I right, @mikecote ?)

@mikecote
Copy link
Contributor

mikecote commented Apr 7, 2021

For Point 3, we could see where cross-type SO migrations fall on the Kibana Core roadmap (if we've considered them yet)

🙏
I don't think this is needed until the 8.x timeframe. (am I right, @mikecote ?)

Correct, I don't think we'll need this before 8.x.

@spong
Copy link
Member

spong commented Apr 8, 2021

Thank you so much for raising this @gmmorris! This has been something that's been a consistent pain point on the Detections side as we've continued to add more and more features since debuting back in 7.6. It was in preparation for GA (and a full backlog of features itching to change the mapping :) that lead us to create the Detection Alerts Migration API to at least provide some helper to our users in the event we would need to make a migration.

For the most part we've been able to just add the new mappings with the only annoyance being that a privileged user would need to visit the Security App to trigger a rollover for the new mappings to be available, which left us in an interesting place with exposing new features that might not be supported till the right user visited the app. This shouldn't be an issue with RAC though since we're using the kibana_system_user for index mgmt (and a rollover can be triggered on rule-registry start post-upgrade). We did have one incorrectly mapped field, but with where we were at we were able to get by without a migration. @tsg has recommended field aliases to get around similar issues, but we haven't had to go that route just yet.

Either way, a platform level service for performing migrations on upgrade would go a long way in both ensuring the flexibility is there if needed, and putting us in a position to quickly iterate towards the ideal implementation without the need to over-engineer it up front in order to support what we think we may need for the next two years without a breaking change.

That's an interesting thought, which hadn't actually crossed my mind.
@spong - How feasible would such an approach be given the kinds of aggregations, and searches you expect to run on the data in alerts-as-data? 🤔

Frankly, we haven't leveraged too many aggregations on top of alerts just yet (only the Alerts Histogram/Last Alert component), so it hasn't really been an issue. That said, we'll probably start introducing them more and more as we start providing advanced alert triage flows with groupings. Either way, I do like the approach of using runtime fields as the trialing of new fields before they get added to the main schema. I'm working on a runtime field reference rule, so should hopefully know a bit more here soon with regards to what that may look like. (This will be crucial for users to add their own fields to alerts since we won't be supporting composable index templates.)

@pgayvallet
Copy link
Contributor

There are no cross-type Saved Object migrations. For instance, each rule SavedObject has a taskId field that points at a task SO type. We need to pluck the API keys from the rule, and migrate it onto the task that is referenced by the taskId field. We have no way of doing this.

That's a different need than what is specified in #91143, right? #91143 is about migrating type A objects to type B, where what you're asking here is to be able to access arbitrary objects from type B during the migration of objects of type A, right?

@gmmorris
Copy link
Contributor Author

gmmorris commented Apr 8, 2021

There are no cross-type Saved Object migrations. For instance, each rule SavedObject has a taskId field that points at a task SO type. We need to pluck the API keys from the rule, and migrate it onto the task that is referenced by the taskId field. We have no way of doing this.

That's a different need than what is specified in #91143, right? #91143 is about migrating type A objects to type B, where what you're asking here is to be able to access arbitrary objects from type B during the migration of objects of type A, right?

Correct 😬
It's something that only came up in this issue, and AFAIK it's the only instance we've had of this need so far.

@gmmorris
Copy link
Contributor Author

gmmorris commented Apr 8, 2021

Thank you so much for raising this @gmmorris! This has been something that's been a consistent pain point on the Detections side as we've continued to add more and more features since debuting back in 7.6. It was in preparation for GA (and a full backlog of features itching to change the mapping :) that lead us to create the Detection Alerts Migration API to at least provide some helper to our users in the event we would need to make a migration.

Thanks for the context :)

For the most part we've been able to just add the new mappings with the only annoyance being that a privileged user would need to visit the Security App to trigger a rollover for the new mappings to be available,...

Could you expand on this @spong ?
Did you actually migrate the historical data itself? Or just the mappings?

Judging by the docs, I'm guessing it's only the mappings:

While the process is neither destructive nor interferes with existing data, it may be resource-intensive. As such, it is recommended that you plan your migrations accordingly.

I don't think we'd have much of an option of migrating actual data (as the amount of data is going to be huge), but I do wonder if there are changes we will find hard to make i the future if we're limited to only mappings changes. 🤔
I don't have much experience at migrating large systems built on the stack... any thoughts here from past experience?

...which left us in an interesting place with exposing new features that might not be supported till the right user visited the app.

This is exactly why I feel we need to discuss formalising this kind of migration.
Otherwise we basically break alerting until there's user intervention and we have no clear way to express this to the user.

This shouldn't be an issue with RAC though since we're using the kibana_system_user for index mgmt (and a rollover can be triggered on rule-registry start post-upgrade). We did have one incorrectly mapped field, but with where we were at we were able to get by without a migration. @tsg has recommended field aliases to get around similar issues, but we haven't had to go that route just yet.

Either way, a platform level service for performing migrations on upgrade would go a long way in both ensuring the flexibility is there if needed, and putting us in a position to quickly iterate towards the ideal implementation without the need to over-engineer it up front in order to support what we think we may need for the next two years without a breaking change.

++

Any chance you could enumerate specific features you'd expect this platform service to provide?

That's an interesting thought, which hadn't actually crossed my mind.
@spong - How feasible would such an approach be given the kinds of aggregations, and searches you expect to run on the data in alerts-as-data? 🤔

Frankly, we haven't leveraged too many aggregations on top of alerts just yet (only the Alerts Histogram/Last Alert component), so it hasn't really been an issue. That said, we'll probably start introducing them more and more as we start providing advanced alert triage flows with groupings. Either way, I do like the approach of using runtime fields as the trialing of new fields before they get added to the main schema. I'm working on a runtime field reference rule, so should hopefully know a bit more here soon with regards to what that may look like. (This will be crucial for users to add their own fields to alerts since we won't be supporting composable index templates.)

Brilliant, that's a good start.
Is it worth adding an action item on the Alerts-as-data issue to identify parts of the schema that we should/could express as runtime field instead of on the mappings? 🤔
Or would you rather leave that as something to do at a later stage (assuming we can modify mappings after the fact as part of upgrades)?

@rudolf
Copy link
Contributor

rudolf commented Apr 8, 2021

Runtime fields and aliases can be helpful ways to reduce the impact of this problem, but for any domain there's always inevitably something new you learn that requires you to model your data in a completely different way where just changing the mappings isn't enough, you also need to change your documents.

Ultimately we want to help developers shift this complexity to the migration system so that their business logic is free from if/else clauses dealing with all the past modelling mistakes.

To solve (1) and (2) I've been thinking of a slightly different migration algorithm that's eventually consistent in order to give us the performance over large data sets: #96626

For (3) the naive solution is to allow performing an async read for every transformed document, but this would be very inefficient. The more performant solution is what we documented in #34996 which would basically allow the migration to collect some data before it starts and then use that data for transforming every document. This means you need to be able to fit all the data you need in memory but this is probably sufficient for most of the migrations we're doing.

@pgayvallet
Copy link
Contributor

pgayvallet commented Apr 12, 2021

@rudolf should we reopen #34996 then (or is it too soon in the discussion)?

@gmmorris
Copy link
Contributor Author

gmmorris commented Apr 12, 2021

Runtime fields and aliases can be helpful ways to reduce the impact of this problem, but for any domain there's always inevitably something new you learn that requires you to model your data in a completely different way where just changing the mappings isn't enough, you also need to change your documents.

Ultimately we want to help developers shift this complexity to the migration system so that their business logic is free from if/else clauses dealing with all the past modelling mistakes.

To solve (1) and (2) I've been thinking of a slightly different migration algorithm that's eventually consistent in order to give us the performance over large data sets: #96626

I tend to agree with all of that. :)

Regarding #96626: I think it's a great idea. Given the problems we've seen these past few weeks with the growing task and action SOs, this would reduce the likelihood of breaking a customer upgrade.
One thing to note though - this issues is not, predominantly, about SOs, as Alerts-as-data are hidden/system ECS indices, rather than SO... so there's both the aspect of eventually consistent migrations in SOs, and formalizing migrations of freeform / ECS indices.

@gmmorris gmmorris added RFC non-issue Indicates to automation that a pull request should not appear in the release notes Meta Theme: rac label obsolete and removed Feature:Alerting RFC non-issue Indicates to automation that a pull request should not appear in the release notes labels Jul 1, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:RAC label obsolete Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete
Projects
No open projects
Development

No branches or pull requests

8 participants