-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SO migration v2 requires manual cleanup steps in case of failure to retry the migration #101354
Comments
Pinging @elastic/kibana-core (Team:Core) |
One goal of Saved Object migrations v2 was that the migration should be able to fail at any step and then be reattempted without user intervention. In essence, it should be idempotent, but as you noted it currently is not.
This aspect doesn't actually block the migrations from running a second time, but it does create an issue for users who need to delete some corrupted documents. I think a better solution to how we handle corrupted documents (see #100768) is a better solution than changing how this works.
Yep this is likely a problem that could be really frustrating when a corrupted document is encountered later in the process. If the user deletes the document from the original source index and reattempts the migration, I believe it will fail again since that document still exists in the temp index. I can think of two solutions here, which are essentially the same idea (isolating migration attempts) but implemented two different ways:
I think the first option is likely easier and safer to implement, but it does require potentially using more shards in Elasticsearch on recurring failures.
I think we could stop using this write block if we went with either option above for isolating each migration attempt / node from one another
Similar to above, we should probably start incrementing this suffix on each node if this index already exists. Which ever node 'wins' at completing the client side migration step and moving the |
Note that from the RFC (https://github.com/elastic/kibana/blob/master/rfcs/text/0013_saved_object_migrations.md#40-assumptions-and-tradeoffs), this seems to be the expected behavior of the algorithm See also Now, and to quote @kobelb
|
I read over the relevant RFC sections again a few times, and I actually don't think this is the expected behavior. Specifically, I think we're failing to meet section 4.2 Automatically retry migrations until they succeed since depending on which step fails, the system cannot currently automatically retry the migration. Section 4.2.1.3 Upgrade and rollback procedure discusses how to downgrade Kibana, but I don't believe it's intended to be the steps for putting the cluster back into a state that the migration system can handle retrying a failed upgrade. At least that was not my understanding during the design phase of this project. I believe the algorithm should be able to fail at any step and be able to re-attempt the migration once the root cause has been addressed (eg. corrupt SO has been deleted/fixed) without any further manual intervention. This goal isn't explicitly stated in the RFC, but if it is achievable, it seems like the best path forward IMO. @pgayvallet What do you think of the isolation idea I posted above as a solution to this problem? It does not address the issue where users need to remove the write block in order to delete/modify corrupted SOs but it would greatly simplify the rest of the process. |
I think I have a different understanding of this section of the RFC every time I read it... However, you may be right, given
To be honest, even if I don't really have better propositions, I'm not thrilled by either one 😅
Note that atm, this is the main cause of SDH for 7.13+ migrations, so it's probably what we should focus on.
Without the possibility to perform atomic operations against ES, my gut feeling it that this would not be good enough. E.g what if, at the end of the migration, 2 nodes are checking if the alias has already been updated at the same time? I think the whole point of the direction rudolf went what to avoid that kind of edgy situations. Also the amount of residual indices customers would need to cleanup in case of multiple failures seems like a major con
Add it when exactly? During the source->temp 'client-side' reindex? I'm not sure how this would help to skip documents during the |
Elasticsearch actually does support atomic operations for adding/removing aliases and we already use this functionality today. See the example from the documentation. Essentially, the operation will fail if any of the "actions" are invalid, like asking an alias to be removed from an index that it doesn't currently point to because another node already updated it.
Yeah good point, it would require we only do client-side reindexes. I definitely prefer the option of using completely separate indices since it eliminates a lot of possibility for bugs. I don't love either option either, but given the coordination-free design we chose I don't see any better alternatives yet. I think there are ways we could smooth over some of the drawbacks of creating separate indices per migration attempt. For instance, we could automatically delete them in many failure scenarios since we wouldn't have to worry about another node writing to the index at the same time. |
I don't either. But I agree that the 'totally distinct index per instance' approach may be an acceptable compromise. We probably want to dig a little into the complete workflow and a more detailed design. |
Currently, the v2 migration algorithm is not recovering in case of failure: the state after a failure differs from the state before initiating the migration.
write
block on the source index that is not removed in case of failuretemp
(${indexPrefix}_${kibanaVersion}_reindex_temp
) index that is not removed in case of failuretemp
index is present, we re-use it without purging it, meaning that it may keep documents that are no longer present in the source index, if some documents were deleted between the migration attempts)write lock
on the temp index when the 'client-side' reindex is performed, that is also not removed in case of later failuretarget
index (${indexPrefix}_${kibanaVersion}_001
) and not removing it in case of failure (note: it's very unlikely that thetarget
index is created in case of failure, as the failure is almost guaranteed to occur in a previous stage)This is very problematic when the migration fails, especially when this occurs because of corrupted saved objects, as it requires additional manual steps from the customer or from support to be able to fix and retry the migration.
For instance, if a migration fails because of a corrupted saved object, a manual intervention to cleanup the state (remove the source's write lock and eventually delete the
temp
andtarget
indices) is required. Needless to say, this is far from ideal.Also, if the customer tries to rollback to the previous version after a migration failure, it's still necessary to manually remove the
write lock
on the previous version's index, as not doing so results on the previous Kibana index being read-only (I don't think I need to explain the consequences here)We recently had quite a few support issues because of that, where support and/or the customer had difficulties understanding how to get back to a state where retrying the migration was possible.
We would ideally have an idempotent migration, where a failure just gets back the the exact state before the migration was attempted.
In practice however, due to multi-instances setups, this is far from easy, as all instances may be performing the migration in parallel and are not aware of the progression and/or status of each other node.
From #100631,
Basically, I feel like these problems are caused by an intrinsic flaw in the v2 migration design, where we decided (or, more like being technically forced) to allow multiple instances to run the migration optimistically in parallel instead of electing a single 'master' to perform the migration.
As each node isn't aware of potential other nodes being in different stages of the migration, it's very hard to perform those cleanup steps because we can't safely assume that other instances will also fails at the same stage.
To be honest, I'm not sure how this can really be properly handled without changing the core design of the migration system to switch to a single instance performing the migration (which is something that we considered not doable during the initial discussions of the v2 migration design)
cc @kobelb
The text was updated successfully, but these errors were encountered: