Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SO Migration] fix reindex race on multi-instance mode #104516

Merged
merged 8 commits into from
Jul 7, 2021

Conversation

pgayvallet
Copy link
Contributor

@pgayvallet pgayvallet commented Jul 6, 2021

Summary

Fix #99211
Integration test inspired from #100171

When in multi-instance mode, the leading node adds the write-block on the temp index as soon as it finishes the source->temp reindex. This was potentially causing the trailing nodes to fail if they were still performing the same reindex, as the temp index was now write-blocked. If this had theoretically no impact on the migration (as long as the leading node successfully completed the next steps), it was causing the trailing nodes to fail, requiring a restart (and also outputting migration failure logs when the migration was probably correctly completed).

This PR fixes it by having the bulkOverwriteTransformedDocuments action identify cluster_block_exception failures, and returning a specific left value in such case (instead of throwing), to allow the model to directly jump to the next step instead of failing.

Checklist

@pgayvallet pgayvallet added bug Fixes for quality problems that affect the customer experience project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.14.0 release_note:skip Skip the PR/issue when compiling release notes labels Jul 6, 2021
Comment on lines -74 to +80
const errors = (res.body.items ?? []).filter(
(item) => item.index?.error?.type !== 'version_conflict_engine_exception'
);
const errors = (res.body.items ?? [])
.filter((item) => item.index?.error)
.map((item) => item.index!.error!)
.filter(({ type }) => type !== 'version_conflict_engine_exception');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not even sure how this was working previously with non-error items, as

>>({}).index?.error?.type !== 'version_conflict_engine_exception'
<< true

but In doubt, I added more steps for the sake of readability.

Comment on lines +85 to +89
if (errors.every(isWriteBlockException)) {
return Either.left({
type: 'target_index_had_write_block' as const,
});
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very likely that if any write_block exception is encountered, all the objects encountered it, but just in case another error was returned, we check that all errors are effectively write block exceptions.

Comment on lines +260 to +264
8. Reindex the source index into the new temporary index using a 'client-side' reindex, by reading batches of documents from the source, migrating them, and indexing them into the temp index.
1. Use `op_type=index` so that multiple instances can perform the reindex in parallel (last node running will override the documents, with no effect as the input data is the same)
2. Ignore `version_conflict_engine_exception` exceptions as they just mean that another node was indexing the same documents
3. If a `target_index_had_write_block` exception is encountered for all document of a batch, assume that another node already completed the temporary index reindex, and jump to the next step
4. If a document transform throws an exception, add the document to a failure list and continue trying to transform all other documents (without writing them to the temp index). If any failures occured, log the complete list of documents that failed to transform, then fail the migration.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We forgot to update the RFC for the client-side reindex. Fixed it, and added the new target_index_had_write_block special case,

Comment on lines +14 to +19
export const isWriteBlockException = ({ type, reason }: EsErrorCause): boolean => {
return (
type === 'cluster_block_exception' &&
reason.match(/index \[.+] blocked by: \[FORBIDDEN\/8\/.+ \(api\)\]/) !== null
);
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the type of operation, the reason identifying a write block can vary

e.g

index [.kibana_dolly] blocked by: [FORBIDDEN/8/index write (api)]
index [.kibana_dolly] blocked by: [FORBIDDEN/8/moving to block index write (api)]

I extracted the function previously present in wait_for_reindex_task.ts and made it more generic to match any FORBIDDEN/8/*** (api) text.

Comment on lines 191 to 198
).resolves.toMatchInlineSnapshot(`
Object {
"_tag": "Left",
"left": Object {
"type": "target_index_had_write_block",
},
}
`);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fan of snapshots to test resolved results (especially as the resolved object is small), but this is what is done in the other tests of the file, and I didn't want to fix them all in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for all the tests

expect(res.right).toEqual(
expect.objectContaining({
existing_index_with_docs: {
aliases: {},
mappings: expect.anything(),
settings: expect.anything(),
},
})
);
});
});
so I wouldn't use snapshots in this case.

Comment on lines -1499 to +1506
it('rejects if there are errors', async () => {
it('resolves left if there are write_block errors', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to also add a IT test to assert that it still rejects for other errors, but it seems that using an non-existing index surprisingly leads to a timeout (which is handled by catchRetryableEsClientErrors) instead of a more final error, and I couldn't find a way to trigger another kind of error from ES.

If anyone has an idea, I'll take it. Else it's probably fine as it's covered in unit tests anyway.

Comment on lines +184 to +189
it('migrates saved objects normally when multiple Kibana instances are started at the same time', async () => {
const setupContracts = await Promise.all([rootA.setup(), rootB.setup(), rootC.setup()]);

setupContracts.forEach((setup) => setup.savedObjects.registerType(fooType));

await startWithDelay([rootA, rootB, rootC], 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests with 0, 1, 5, 20sec

@pgayvallet pgayvallet marked this pull request as ready for review July 7, 2021 11:50
@pgayvallet pgayvallet requested a review from a team as a code owner July 7, 2021 11:50
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

};

afterAll(async () => {
await new Promise((resolve) => setTimeout(resolve, 10000));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a mistic delay wandering through all the files 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea... I guess I did as everyone else, wondering what that was for, and then in doubt, still copied it 😄 . Do you know if it's safe to delete those?

Copy link
Contributor

@mshustov mshustov Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I'd test it and remove it for the tests at once

);
const errors = (res.body.items ?? [])
.filter((item) => item.index?.error)
.map((item) => item.index!.error!)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow! so many ?, ! 😅 I'd write it as

.map((item) => item.index?.error)
.filter(Boolean)
.filter(({ type }) => type !== 'version_conflict_engine_exception');

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the problem is that TS is stupid with map/filter.

.map((item) => item.index?.error)
.filter(Boolean)

Is not sufficient to have the | undefined part removed from error. The third line complains that ErrorContainer | undefined does not have a type property, which is why I kinda was forced to filter first then force-cast using !

describe('isWriteBlockError', () => {
it('returns true for a `index write` cluster_block_exception', () => {
expect(
isWriteBlockException({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add an integration test instead? Since it's easy to reproduce for an ES instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already an integration test for the action using the function, but it doesn't hurt to do it for the helper itself.

Comment on lines 191 to 198
).resolves.toMatchInlineSnapshot(`
Object {
"_tag": "Left",
"left": Object {
"type": "target_index_had_write_block",
},
}
`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for all the tests

expect(res.right).toEqual(
expect.objectContaining({
existing_index_with_docs: {
aliases: {},
mappings: expect.anything(),
settings: expect.anything(),
},
})
);
});
});
so I wouldn't use snapshots in this case.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@pgayvallet pgayvallet merged commit d64c3fb into elastic:master Jul 7, 2021
pgayvallet added a commit to pgayvallet/kibana that referenced this pull request Jul 7, 2021
* fix reindex race condition

* fix some IT tests

* fix reindex cause detection

* add integration test

* update RFC

* review comments

* add integration test for isWriteBlockException
# Conflicts:
#	rfcs/text/0013_saved_object_migrations.md
pgayvallet added a commit to pgayvallet/kibana that referenced this pull request Jul 7, 2021
* fix reindex race condition

* fix some IT tests

* fix reindex cause detection

* add integration test

* update RFC

* review comments

* add integration test for isWriteBlockException
# Conflicts:
#	rfcs/text/0013_saved_object_migrations.md
pgayvallet added a commit that referenced this pull request Jul 7, 2021
… (#104761)

* [SO Migration] fix reindex race on multi-instance mode (#104516)

* fix reindex race condition

* fix some IT tests

* fix reindex cause detection

* add integration test

* update RFC

* review comments

* add integration test for isWriteBlockException
# Conflicts:
#	rfcs/text/0013_saved_object_migrations.md

* fix dataset for 7.14
pgayvallet added a commit that referenced this pull request Jul 7, 2021
#104760)

* [SO Migration] fix reindex race on multi-instance mode (#104516)

* fix reindex race condition

* fix some IT tests

* fix reindex cause detection

* add integration test

* update RFC

* review comments

* add integration test for isWriteBlockException
# Conflicts:
#	rfcs/text/0013_saved_object_migrations.md

* fix dataset for 7.15
jloleysens added a commit to jloleysens/kibana that referenced this pull request Jul 8, 2021
…-of-max-results

* 'master' of github.com:elastic/kibana: (36 commits)
  Lower Kibana app bundle limits (elastic#104688)
  [Security Solutions] Fixes bug with the filter query compatibility for transforms (elastic#104559)
  [RAC] Add mapping update logic to RuleDataClient (elastic#102586)
  Fix import workpad (elastic#104722)
  [canvas] Fix Storybook service decorator (elastic#104750)
  [Detection Rules] Add 7.14 rules (elastic#104772)
  [Enterprise Search] Fix beta notification in sidebar (elastic#104763)
  Fix engine routes that are meta engine or non-meta-engine specific (elastic#104757)
  [Fleet] Fix policy revision number getting bumped for no reason (elastic#104696)
  persistable state migrations (elastic#103680)
  [Fleet] Fix add agent in the package policy table (elastic#104749)
  [DOCS] Creates separate doc for security in production (elastic#103973)
  [SO Migration] fix reindex race on multi-instance mode (elastic#104516)
  [Security Solution] Update text in Endpoint Admin pages (elastic#104649)
  [package testing] Decrease timeout to 2 hours (elastic#104668)
  Fix background styling of waterfall chart sidebar tooltip. (elastic#103997)
  [Fleet + Integrations UI] Integrations UI Cleanup (elastic#104641)
  [Fleet] Link to download page of current stack version on Agent install instructions (elastic#104494)
  [Workplace Search] Fix Media Type field preview is unformatted bug (elastic#104684)
  [ML] add marker body (elastic#104672)
  ...

# Conflicts:
#	x-pack/plugins/fleet/public/search_provider.test.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient release_note:skip Skip the PR/issue when compiling release notes Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.14.0 v7.15.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix race conditions in SO Migrations v2 and add integration test
5 participants