Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize downtime while remedying corrupt document migration failures #100768

Open
joshdover opened this issue May 27, 2021 · 3 comments
Open

Minimize downtime while remedying corrupt document migration failures #100768

joshdover opened this issue May 27, 2021 · 3 comments
Labels
enhancement New value added to drive a business result project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc triage_needed

Comments

@joshdover
Copy link
Contributor

joshdover commented May 27, 2021

As noted in #100631, when a upgrade migration fails due to a corrupt document in the index, the source index will be left in an unusable state due to the write block being left in place. Unfortunately, we don't have a safe way of automatically cleaning up this write block in the case of failure since other Kibana instances may be able to successfully continue the migration and removing the write block before they're complete could lead to data loss.

What we can do is provide a better experience for admins to handle this situation in order to minimize any downtime they may encounter while addressing the root cause. Possible options:

@joshdover joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc triage_needed enhancement New value added to drive a business result project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels May 27, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@joshdover
Copy link
Contributor Author

Also of note, that we could safely remove the write block in only the case of corrupt documents in 8.0 when we plan to stop supporting the scenario where Kibana instances are configured with different plugins enabled.

Though, based on the conversation in #100171 (comment), this may already be the case. If so, we could safely remove the write block in 7.x when corrupt objects are detected.

@ppf2
Copy link
Member

ppf2 commented Aug 3, 2021

Not sure if this will only be addressed in 8.0. It will be nice if we can backport this to 7.x. We had some failed production migrations due to Kibana leaving write blocks in place when encountering Elasticsearch exceptions (not isolated to document corruption during migration). As a result, on Cloud, it successfully upgraded Elasticsearch to a later 7.x version while leaving Kibana in an older 7.minor.

Example:

[.kibana_task_manager] SET_SOURCE_WRITE_BLOCK -> CREATE_REINDEX_TEMP. took: 96ms.

[.kibana_task_manager] [validation_exception]: Validation Failed: 1: this action would add [2] shards, but this cluster currently has [2000]/[2000] maximum normal shards open;

[.kibana_task_manager] migration failed, dumping execution log:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc triage_needed
Projects
None yet
Development

No branches or pull requests

3 participants