Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 migrations should exit process on corrupt saved object document #91465

Merged
merged 5 commits into from
Feb 22, 2021

Conversation

rudolf
Copy link
Contributor

@rudolf rudolf commented Feb 16, 2021

Summary

Fixes #65612

Previously v2 migrations would show an error and keep on retrying if a corrupt saved object document is encountered instead of failing the migration and exiting Kibana.

This PR fixes it to fail the migration (and exit Kibana process) if a corrupt saved object document is encountered. Corrupt saved object documents are documents that can't be serialized and as a result Kibana cannot migrate these documents. Having unmigrated documents in index can cause all sorts of unexpected issues which are hard to diagnose such as a migration failing because the unmigrated corrupt document is incompatible with the target mappings.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@rudolf
Copy link
Contributor Author

rudolf commented Feb 19, 2021

@elasticmachine merge upstream

@rudolf rudolf changed the title migrations-v2-corrupts-docs v2 migrations should exit process on corrupt saved object document Feb 19, 2021
@rudolf rudolf added bug Fixes for quality problems that affect the customer experience Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Feb 19, 2021
@@ -58,12 +58,12 @@ describe('migrateRawDocs', () => {
expect(transform).toHaveBeenNthCalledWith(2, obj2);
});

test('passes invalid docs through untouched and logs error', async () => {
test('throws when encountering a corrupt saved object document', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that v1 migrations will also start failing on corrupt saved objects. When v1 migrations encounter a corrupt saved object it will perform a migration on every restart which can cause data loss in a multi-instance Kibana setup if only one Kibana gets restarted at a time. So although it might block some users from upgrading, deleting the saved object will save them from worse problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When v1 migrations encounter a corrupt saved object it will perform a migration on every restart which can cause data loss in a multi-instance Kibana setup if only one Kibana gets restarted at a time.

Isn't this our recommended way to upgrade kibana? restarting one instance at a time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, all kibana's need to be shutdown before upgrading otherwise v1 migrations will cause data loss.

from https://www.elastic.co/guide/en/kibana/current/upgrade.html

Shut down all Kibana instances. Running more than one Kibana version against the same Elasticseach index is unsupported. Upgrading while older Kibana instances are running can cause data loss or upgrade failures.

@rudolf rudolf marked this pull request as ready for review February 19, 2021 09:04
@rudolf rudolf requested a review from a team as a code owner February 19, 2021 09:04
@rudolf rudolf added release_note:skip Skip the PR/issue when compiling release notes release_note:breaking and removed release_note:skip Skip the PR/issue when compiling release notes release_note:breaking labels Feb 19, 2021
@afharo
Copy link
Member

afharo commented Feb 19, 2021

I get it that users will have to delete the doc via the Cloud Console or connecting to the ES instance, right? How does this go with the system indices limitation?

Should we allow Kibana to start, so we allow the users to access the index via the Dev Console? 🤔

@pgayvallet
Copy link
Contributor

Should we allow Kibana to start, so we allow the users to access the index via the Dev Console

We are going to block access to system indices for the dev console in the near future, so I don't think this will be an option.

} else {
logger.error(e);

dumpExecutionLog(logger, logMessagePrefix, executionLog);
if (e.message.startsWith('Unable to migrate the corrupt saved object document')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I would use error sub-classing instead of relying on an error message for the handling behavior change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was being lazy :(

@rudolf
Copy link
Contributor Author

rudolf commented Feb 22, 2021

I get it that users will have to delete the doc via the Cloud Console or connecting to the ES instance, right? How does this go with the system indices limitation?
Should we allow Kibana to start, so we allow the users to access the index via the Dev Console? 🤔

Yeah that's a good question.

The existing index won't be a system index, so before upgrading to Kibana with system indices, a user can delete the corrupt saved object from their "normal" index. However, if they've already started a v2 migration and it failed because of a corrupt saved object then that corrupt saved object will already be inside the system index. So we will either have to provide a way to force a clean migration that deletes existing indices and starts from scratch, or an option to run migrations in so that it will automatically delete corrupt documents. I'm pretty sure in 99.9% of the cases there's nothing valuable inside a corrupt saved object, but just dropping (even if there's an error log) feels risky, but if it's a runtime option users can decide what behaviour they want.

@rudolf rudolf enabled auto-merge (squash) February 22, 2021 13:36
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@rudolf rudolf merged commit dc475c9 into elastic:master Feb 22, 2021
gmmorris added a commit to gmmorris/kibana that referenced this pull request Feb 22, 2021
* master:
  Ability to filter alerts by string parameters (elastic#92036)
  [APM] Fix for flaky correlations API test (elastic#91673) (elastic#92094)
  [Enterprise Search] Migrate shared role mapping components (elastic#91723)
  [file_upload] move ml Importer classes to file_upload plugin (elastic#91559)
  [Discover] Always show the "hide missing fields" toggle (elastic#91889)
  v2 migrations should exit process on corrupt saved object document (elastic#91465)
  [ML] Data Frame Analytics exploration page: filters improvements (elastic#91748)
  [ML] Data Frame Analytics: Improved error handling for scatterplot matrix. (elastic#91993)
  [coverage] speed up merging results of functional tests (elastic#92111)
  Adds a Reason indicator to the onClose handler in AddAlert and EditAlert (elastic#92149)
rudolf added a commit to rudolf/kibana that referenced this pull request Feb 22, 2021
…lastic#91465)

* Fail migrations if a corrupt saved object is encountered

* Update test description

* Use an error class instead of string matching

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
rudolf added a commit to rudolf/kibana that referenced this pull request Feb 22, 2021
…lastic#91465)

* Fail migrations if a corrupt saved object is encountered

* Update test description

* Use an error class instead of string matching

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
rudolf added a commit that referenced this pull request Feb 22, 2021
…91465) (#92274)

* Fail migrations if a corrupt saved object is encountered

* Update test description

* Use an error class instead of string matching

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
rudolf added a commit that referenced this pull request Feb 22, 2021
…91465) (#92273)

* Fail migrations if a corrupt saved object is encountered

* Update test description

* Use an error class instead of string matching

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@rudolf rudolf deleted the migrations-v2-corrupts-docs branch February 23, 2021 10:51
@rudolf rudolf mentioned this pull request Mar 22, 2021
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient release_note:skip Skip the PR/issue when compiling release notes v7.12.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Corrupt Saved Object documents can cause a migration every time Kibana starts up
5 participants