Update SO migration documentation: release write block on source index in case of corrupt document failure #100631

pgayvallet · 2021-05-26T06:12:00Z

atm, when a failure occurs during or after the client-side reindex from the source to temp, the write block on the source that was enabled during the SET_SOURCE_WRITE_BLOCK step is not released.

When the failure was caused by corrupted of invalid SO documents, this adds a necessary step during the manual intervention, as the block needs to be manually released before trying to fix or remove the faulty document(s). We need to update our documentation on how to remedy this situation: https://www.elastic.co/guide/en/kibana/master/upgrade-migrations.html#_corrupt_saved_objects

Original issue content:

atm, when a failure occurs during or after the client-side reindex from the source to temp, the write block on the source that was enabled during the SET_SOURCE_WRITE_BLOCK step is not released.

When the failure was caused by corrupted of invalid SO documents, this adds an unnecessary step during the manual intervention, as the block needs to be manually released before trying to fix or remove the faulty document(s).

In case of migration failure, we should ensure that the indices are back to their initial state by removing the write block during the cleanup step.

Note: we are also enabling write block on

the temp index during the temp->target reindex, but as this is meant to be a temporary, 'internal' index, it's probably alright to not handle that for it
the 'legacy' index for legacy index migration. If it would make sense to also release the block if such legacy index migration occurs, it's still way less impacting

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-05-26T06:12:01Z

Pinging @elastic/kibana-core (Team:Core)

joshdover · 2021-05-26T12:59:36Z

In case of migration failure, we should ensure that the indices are back to their initial state by removing the write block during the cleanup step.

I think we need to be careful to only release the write block in scenarios where we know that other nodes are also going to fail the migration. If we release it on any failure, then we can introduce data loss. For example:

3 Kibana nodes: v7.13 still running, two v7.14 nodes start and begin migrations
One of the 7.14 nodes fails due to shard limit reached, it releases the write block
The other 7.14 node happens to succeed due to a shard in another index being deleted and continues the migration
7.13 node continues writing data to 7.13 index (where .kibana alias points to)
7.14 node finished migration, losing writes from the 7.13 node.

Another example besides a shard failure where this could happen is in the scenario where some nodes have different SO types registered than others. The nodes that have all the SO types registered will continue the migration, however the other nodes may give up.

In order to do this safely, we really can only release the write block if we know for sure that all other nodes are also going to fail the migration. I'm not sure there's a scenario where we can guarantee this though?

joshdover · 2021-05-27T12:33:03Z

We discussed this sync during sprint planning yesterday and decided that we indeed cannot safely remove the write block due issues like the one listed above.

This issue is now scoped to only update the documentation on how to handle the corrupt document case to include a step for removing the write block. These docs can be found here: https://www.elastic.co/guide/en/kibana/master/upgrade-migrations.html#_corrupt_saved_objects

pgayvallet added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels May 26, 2021

mshustov added the triage_needed label May 26, 2021

mshustov mentioned this issue May 26, 2021

Add an explicit CLEANUP step to migrations v2 #100650

Closed

joshdover changed the title ~~SO migration: release write block on source index in case of failure~~ Update SO migration doucmentation: release write block on source index in case of corrupt document failure May 26, 2021

joshdover added documentation and removed triage_needed labels May 27, 2021

joshdover changed the title ~~Update SO migration doucmentation: release write block on source index in case of corrupt document failure~~ Update SO migration documentation: release write block on source index in case of corrupt document failure May 27, 2021

joshdover mentioned this issue May 27, 2021

Minimize downtime while remedying corrupt document migration failures #100768

Open

pgayvallet mentioned this issue Jun 4, 2021

SO migration v2 requires manual cleanup steps in case of failure to retry the migration #101354

Open

lukeelmers self-assigned this Jun 22, 2021

lukeelmers mentioned this issue Jun 22, 2021

[docs][migrations v2] Update SO migration docs to include removal of index write block when handling corrupt SOs. #103014

Merged

lukeelmers closed this as completed in #103014 Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update SO migration documentation: release write block on source index in case of corrupt document failure #100631

Update SO migration documentation: release write block on source index in case of corrupt document failure #100631

pgayvallet commented May 26, 2021 •

edited by joshdover

Loading

elasticmachine commented May 26, 2021

joshdover commented May 26, 2021 •

edited

Loading

joshdover commented May 27, 2021

Update SO migration documentation: release write block on source index in case of corrupt document failure #100631

Update SO migration documentation: release write block on source index in case of corrupt document failure #100631

Comments

pgayvallet commented May 26, 2021 • edited by joshdover Loading

elasticmachine commented May 26, 2021

joshdover commented May 26, 2021 • edited Loading

joshdover commented May 27, 2021

pgayvallet commented May 26, 2021 •

edited by joshdover

Loading

joshdover commented May 26, 2021 •

edited

Loading