Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SO migration v2 requires manual cleanup steps in case of failure to retry the migration #101354

Open
pgayvallet opened this issue Jun 4, 2021 · 7 comments
Labels
discuss project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@pgayvallet
Copy link
Contributor

pgayvallet commented Jun 4, 2021

Currently, the v2 migration algorithm is not recovering in case of failure: the state after a failure differs from the state before initiating the migration.

  • We're putting a write block on the source index that is not removed in case of failure
  • We're creating a temp (${indexPrefix}_${kibanaVersion}_reindex_temp) index that is not removed in case of failure
    • when retrying the migration, if the temp index is present, we re-use it without purging it, meaning that it may keep documents that are no longer present in the source index, if some documents were deleted between the migration attempts)
    • note that we're also setting a write lock on the temp index when the 'client-side' reindex is performed, that is also not removed in case of later failure
  • We're creating the target index (${indexPrefix}_${kibanaVersion}_001) and not removing it in case of failure (note: it's very unlikely that the target index is created in case of failure, as the failure is almost guaranteed to occur in a previous stage)

This is very problematic when the migration fails, especially when this occurs because of corrupted saved objects, as it requires additional manual steps from the customer or from support to be able to fix and retry the migration.

For instance, if a migration fails because of a corrupted saved object, a manual intervention to cleanup the state (remove the source's write lock and eventually delete the temp and target indices) is required. Needless to say, this is far from ideal.

Also, if the customer tries to rollback to the previous version after a migration failure, it's still necessary to manually remove the write lock on the previous version's index, as not doing so results on the previous Kibana index being read-only (I don't think I need to explain the consequences here)

We recently had quite a few support issues because of that, where support and/or the customer had difficulties understanding how to get back to a state where retrying the migration was possible.

We would ideally have an idempotent migration, where a failure just gets back the the exact state before the migration was attempted.

In practice however, due to multi-instances setups, this is far from easy, as all instances may be performing the migration in parallel and are not aware of the progression and/or status of each other node.

From #100631,

I think we need to be careful to only release the write block in scenarios where we know that other nodes are also going to fail the migration. If we release it on any failure, then we can introduce data loss. For example:

  • 3 Kibana nodes: v7.13 still running, two v7.14 nodes start and begin migrations
  • One of the 7.14 nodes fails due to shard limit reached, it releases the write block
  • The other 7.14 node happens to succeed due to a shard in another index being deleted and continues the migration
  • 7.13 node continues writing data to 7.13 index (where .kibana alias points to)
  • 7.14 node finished migration, losing writes from the 7.13 node.

Basically, I feel like these problems are caused by an intrinsic flaw in the v2 migration design, where we decided (or, more like being technically forced) to allow multiple instances to run the migration optimistically in parallel instead of electing a single 'master' to perform the migration.

As each node isn't aware of potential other nodes being in different stages of the migration, it's very hard to perform those cleanup steps because we can't safely assume that other instances will also fails at the same stage.

To be honest, I'm not sure how this can really be properly handled without changing the core design of the migration system to switch to a single instance performing the migration (which is something that we considered not doable during the initial discussions of the v2 migration design)

cc @kobelb

@pgayvallet pgayvallet added discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Jun 4, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@joshdover
Copy link
Contributor

One goal of Saved Object migrations v2 was that the migration should be able to fail at any step and then be reattempted without user intervention. In essence, it should be idempotent, but as you noted it currently is not.

  • We're putting a write block on the source index that is not removed in case of failure

This aspect doesn't actually block the migrations from running a second time, but it does create an issue for users who need to delete some corrupted documents. I think a better solution to how we handle corrupted documents (see #100768) is a better solution than changing how this works.

  • We're creating a temp (${indexPrefix}_${kibanaVersion}_reindex_temp) index that is not removed in case of failure

    • when retrying the migration, if the temp index is present, we re-use it without purging it, meaning that it may keep documents that are no longer present in the source index, if some documents were deleted between the migration attempts)

Yep this is likely a problem that could be really frustrating when a corrupted document is encountered later in the process. If the user deletes the document from the original source index and reattempts the migration, I believe it will fail again since that document still exists in the temp index.

I can think of two solutions here, which are essentially the same idea (isolating migration attempts) but implemented two different ways:

  • Add a _001 suffix to the index name and increment it for each node that attempts the migration
    • If creating this index fails, increment the suffix and try again
    • This would result in needing to have enough disk & shard capacity in ES to accommodate every node that attempts a migration
    • Would result in potentially many dangling indices on migration failures, occupying more space & shards in ES. We could delete these all safely once this step has been successfully completed by any node.
  • Generate a UUID for each migration attempt and add it to the documents indexed during this step.
    • This would allow each node / migration attempt to filter only for the documents that it created
    • Avoids needing to create a new index for every attempt, so avoids further oversharding issues, but still has the disk space cost of the option above
    • This option may not actually be viable if we need to read the references field during the client side migration step OR we'd need to also rewrite all of the IDs in that field as well.

I think the first option is likely easier and safer to implement, but it does require potentially using more shards in Elasticsearch on recurring failures.

  • note that we're also setting a write lock on the temp index when the 'client-side' reindex is performed, that is also not removed in case of later failure

I think we could stop using this write block if we went with either option above for isolating each migration attempt / node from one another

  • We're creating the target index (${indexPrefix}_${kibanaVersion}_001) and not removing it in case of failure (note: it's very unlikely that the target index is created in case of failure, as the failure is almost guaranteed to occur in a previous stage)

Similar to above, we should probably start incrementing this suffix on each node if this index already exists. Which ever node 'wins' at completing the client side migration step and moving the .kibana alias shouldn't delete the target index, but the others should be able to safely do so if moving the alias fails (due to it already have been moved).

@pgayvallet
Copy link
Contributor Author

pgayvallet commented Jun 4, 2021

Note that from the RFC (https://github.com/elastic/kibana/blob/master/rfcs/text/0013_saved_object_migrations.md#40-assumptions-and-tradeoffs), this seems to be the expected behavior of the algorithm

See also 4.2.1.3 Upgrade and rollback procedure

Now, and to quote @kobelb

that doesn't negate the issue that you've brought up though, if we're making users mad/sad, we should do something

@joshdover
Copy link
Contributor

I read over the relevant RFC sections again a few times, and I actually don't think this is the expected behavior. Specifically, I think we're failing to meet section 4.2 Automatically retry migrations until they succeed since depending on which step fails, the system cannot currently automatically retry the migration.

Section 4.2.1.3 Upgrade and rollback procedure discusses how to downgrade Kibana, but I don't believe it's intended to be the steps for putting the cluster back into a state that the migration system can handle retrying a failed upgrade. At least that was not my understanding during the design phase of this project.

I believe the algorithm should be able to fail at any step and be able to re-attempt the migration once the root cause has been addressed (eg. corrupt SO has been deleted/fixed) without any further manual intervention. This goal isn't explicitly stated in the RFC, but if it is achievable, it seems like the best path forward IMO.

@pgayvallet What do you think of the isolation idea I posted above as a solution to this problem? It does not address the issue where users need to remove the write block in order to delete/modify corrupted SOs but it would greatly simplify the rest of the process.

@pgayvallet
Copy link
Contributor Author

I read over the relevant RFC sections again a few times, and I actually don't think this is the expected behavior.

I think I have a different understanding of this section of the RFC every time I read it... However, you may be right, given

Idempotent migrations don't require coordination making the algorithm significantly less complex and will never require manual intervention to retry

What do you think of the isolation idea I posted above as a solution to this problem?

To be honest, even if I don't really have better propositions, I'm not thrilled by either one 😅

It does not address the issue where users need to remove the write block in order to delete/modify corrupted SOs

Note that atm, this is the main cause of SDH for 7.13+ migrations, so it's probably what we should focus on.

Add a _001 suffix to the index name and increment it for each node that attempts the migration

Without the possibility to perform atomic operations against ES, my gut feeling it that this would not be good enough. E.g what if, at the end of the migration, 2 nodes are checking if the alias has already been updated at the same time? I think the whole point of the direction rudolf went what to avoid that kind of edgy situations.

Also the amount of residual indices customers would need to cleanup in case of multiple failures seems like a major con

Generate a UUID for each migration attempt and add it to the documents indexed during this step

Add it when exactly? During the source->temp 'client-side' reindex? I'm not sure how this would help to skip documents during the temp -> target ES-side reindex. Correct me if I'm wrong, but that would force us to also use client-side reindex for the second reindex, which seems to have a significant impact in term of performances.

@joshdover
Copy link
Contributor

Without the possibility to perform atomic operations against ES, my gut feeling it that this would not be good enough. E.g what if, at the end of the migration, 2 nodes are checking if the alias has already been updated at the same time? I think the whole point of the direction rudolf went what to avoid that kind of edgy situations.

Elasticsearch actually does support atomic operations for adding/removing aliases and we already use this functionality today. See the example from the documentation. Essentially, the operation will fail if any of the "actions" are invalid, like asking an alias to be removed from an index that it doesn't currently point to because another node already updated it.

Add it when exactly? During the source->temp 'client-side' reindex? I'm not sure how this would help to skip documents during the temp -> target ES-side reindex. Correct me if I'm wrong, but that would force us to also use client-side reindex for the second reindex, which seems to have a significant impact in term of performances.

Yeah good point, it would require we only do client-side reindexes. I definitely prefer the option of using completely separate indices since it eliminates a lot of possibility for bugs.

I don't love either option either, but given the coordination-free design we chose I don't see any better alternatives yet. I think there are ways we could smooth over some of the drawbacks of creating separate indices per migration attempt. For instance, we could automatically delete them in many failure scenarios since we wouldn't have to worry about another node writing to the index at the same time.

@pgayvallet
Copy link
Contributor Author

but given the coordination-free design we chose I don't see any better alternatives yet

I don't either. But I agree that the 'totally distinct index per instance' approach may be an acceptable compromise. We probably want to dig a little into the complete workflow and a more detailed design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

3 participants