Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SO migration documentation: release write block on source index in case of corrupt document failure #100631

Closed
pgayvallet opened this issue May 26, 2021 · 3 comments · Fixed by #103014
Assignees
Labels
documentation project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@pgayvallet
Copy link
Contributor

pgayvallet commented May 26, 2021

atm, when a failure occurs during or after the client-side reindex from the source to temp, the write block on the source that was enabled during the SET_SOURCE_WRITE_BLOCK step is not released.

When the failure was caused by corrupted of invalid SO documents, this adds a necessary step during the manual intervention, as the block needs to be manually released before trying to fix or remove the faulty document(s). We need to update our documentation on how to remedy this situation: https://www.elastic.co/guide/en/kibana/master/upgrade-migrations.html#_corrupt_saved_objects


Original issue content:

atm, when a failure occurs during or after the client-side reindex from the source to temp, the write block on the source that was enabled during the SET_SOURCE_WRITE_BLOCK step is not released.

When the failure was caused by corrupted of invalid SO documents, this adds an unnecessary step during the manual intervention, as the block needs to be manually released before trying to fix or remove the faulty document(s).

In case of migration failure, we should ensure that the indices are back to their initial state by removing the write block during the cleanup step.

Note: we are also enabling write block on

  • the temp index during the temp->target reindex, but as this is meant to be a temporary, 'internal' index, it's probably alright to not handle that for it
  • the 'legacy' index for legacy index migration. If it would make sense to also release the block if such legacy index migration occurs, it's still way less impacting
@pgayvallet pgayvallet added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels May 26, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@joshdover
Copy link
Contributor

joshdover commented May 26, 2021

In case of migration failure, we should ensure that the indices are back to their initial state by removing the write block during the cleanup step.

I think we need to be careful to only release the write block in scenarios where we know that other nodes are also going to fail the migration. If we release it on any failure, then we can introduce data loss. For example:

  1. 3 Kibana nodes: v7.13 still running, two v7.14 nodes start and begin migrations
  2. One of the 7.14 nodes fails due to shard limit reached, it releases the write block
  3. The other 7.14 node happens to succeed due to a shard in another index being deleted and continues the migration
  4. 7.13 node continues writing data to 7.13 index (where .kibana alias points to)
  5. 7.14 node finished migration, losing writes from the 7.13 node.

Another example besides a shard failure where this could happen is in the scenario where some nodes have different SO types registered than others. The nodes that have all the SO types registered will continue the migration, however the other nodes may give up.

In order to do this safely, we really can only release the write block if we know for sure that all other nodes are also going to fail the migration. I'm not sure there's a scenario where we can guarantee this though?

@joshdover joshdover changed the title SO migration: release write block on source index in case of failure Update SO migration doucmentation: release write block on source index in case of corrupt document failure May 26, 2021
@joshdover
Copy link
Contributor

We discussed this sync during sprint planning yesterday and decided that we indeed cannot safely remove the write block due issues like the one listed above.

This issue is now scoped to only update the documentation on how to handle the corrupt document case to include a step for removing the write block. These docs can be found here: https://www.elastic.co/guide/en/kibana/master/upgrade-migrations.html#_corrupt_saved_objects

@joshdover joshdover changed the title Update SO migration doucmentation: release write block on source index in case of corrupt document failure Update SO migration documentation: release write block on source index in case of corrupt document failure May 27, 2021
@lukeelmers lukeelmers self-assigned this Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
5 participants