Minimize downtime while remedying corrupt document migration failures #100768

joshdover · 2021-05-27T12:45:25Z

As noted in #100631, when a upgrade migration fails due to a corrupt document in the index, the source index will be left in an unusable state due to the write block being left in place. Unfortunately, we don't have a safe way of automatically cleaning up this write block in the case of failure since other Kibana instances may be able to successfully continue the migration and removing the write block before they're complete could lead to data loss.

What we can do is provide a better experience for admins to handle this situation in order to minimize any downtime they may encounter while addressing the root cause. Possible options:

Provide a dry run feature to allow admins to easily detect corrupt objects prior to upgrading Dry run migrations #55404
Provide a CLI for reseting the index state to so that old Kibana versions can continue working while the admin investigates the root cause
Provide an interactive migration mode - [Migrations V2] interactive migrations #100685
"Quarantine" corrupt objects - this idea has many problems (such as breaking Kibana in hard to anticipate ways) and was previously abandoned in Tag invalid objects instead of failing migrations #55406
Add support for read-time migrations that don't block upgrades

elasticmachine · 2021-05-27T12:45:27Z

Pinging @elastic/kibana-core (Team:Core)

joshdover · 2021-05-27T12:47:20Z

Also of note, that we could safely remove the write block in only the case of corrupt documents in 8.0 when we plan to stop supporting the scenario where Kibana instances are configured with different plugins enabled.

Though, based on the conversation in #100171 (comment), this may already be the case. If so, we could safely remove the write block in 7.x when corrupt objects are detected.

ppf2 · 2021-08-03T17:47:39Z

Not sure if this will only be addressed in 8.0. It will be nice if we can backport this to 7.x. We had some failed production migrations due to Kibana leaving write blocks in place when encountering Elasticsearch exceptions (not isolated to document corruption during migration). As a result, on Cloud, it successfully upgraded Elasticsearch to a later 7.x version while leaving Kibana in an older 7.minor.

Example:

[.kibana_task_manager] SET_SOURCE_WRITE_BLOCK -> CREATE_REINDEX_TEMP. took: 96ms.

[.kibana_task_manager] [validation_exception]: Validation Failed: 1: this action would add [2] shards, but this cluster currently has [2000]/[2000] maximum normal shards open;

[.kibana_task_manager] migration failed, dumping execution log:

joshdover mentioned this issue Jun 4, 2021

SO migration v2 requires manual cleanup steps in case of failure to retry the migration #101354

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize downtime while remedying corrupt document migration failures #100768

Minimize downtime while remedying corrupt document migration failures #100768

joshdover commented May 27, 2021 •

edited

Loading

elasticmachine commented May 27, 2021

joshdover commented May 27, 2021

ppf2 commented Aug 3, 2021 •

edited

Loading

Minimize downtime while remedying corrupt document migration failures #100768

Minimize downtime while remedying corrupt document migration failures #100768

Comments

joshdover commented May 27, 2021 • edited Loading

elasticmachine commented May 27, 2021

joshdover commented May 27, 2021

ppf2 commented Aug 3, 2021 • edited Loading

joshdover commented May 27, 2021 •

edited

Loading

ppf2 commented Aug 3, 2021 •

edited

Loading