Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reaction to blob store corruptions #111954

Conversation

DaveCTurner
Copy link
Contributor

Today there are a couple of assertions that can trip if the contents of
a snapshot repostiory are corrupted. It makes sense to assert the
integrity of snapshots in most tests, but we must also (a) protect
against these corruptions in production and (b) allow some tests to
verify the behaviour of the system when the repository is corrupted.

This commit introduces a flag to disable certain assertions, converts
the relevant assertions into production failures too, and introduces a
high-level test to verify that we do detect all relevant corruptions
without tripping any other assertions.

Extracted from #93735 as this change makes sense in its own right.
Relates #52622.

Today there are a couple of assertions that can trip if the contents of
a snapshot repostiory are corrupted. It makes sense to assert the
integrity of snapshots in most tests, but we must also (a) protect
against these corruptions in production and (b) allow some tests to
verify the behaviour of the system when the repository is corrupted.

This commit introduces a flag to disable certain assertions, converts
the relevant assertions into production failures too, and introduces a
high-level test to verify that we do detect all relevant corruptions
without tripping any other assertions.

Extracted from elastic#93735 as this change makes sense in its own right.
Relates elastic#52622.
@DaveCTurner DaveCTurner added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.16.0 labels Aug 18, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 18, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)


try (var ignored = new BlobStoreIndexShardSnapshotsIntegritySuppressor()) {
final var exception = safeAwait(randomFrom(corruptionDetectors));
logger.info(Strings.format("--> corrupted [%s] and caught exception", corruptedFile), exception);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this potentially be considered an example of use randomised testing for coverage?

It seems we've enumerated the ways in which we expect to detect each type of corruption, then we're randomly picking one to execute. Would it not be better to test them all? or does the execution of one have side-effects precluding the others running?

More a question than a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehh kinda, tho the coverage is extremely low even if we did check all these paths on each test run. The assertions we were previously hitting would only trip if you corrupted a very specific byte in exactly the right way.

Really this is just to verify that a corruption doesn't trip assertions so that we can proceed with #93735 that will introduce a way to catch all these problems at once.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 1222496 into elastic:main Aug 19, 2024
15 checks passed
@DaveCTurner DaveCTurner deleted the 2024/08/18/blob-store-corruption-detection branch August 19, 2024 05:35
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Aug 19, 2024
* upstream/main:
  Fail `indexDocs()` on rejection (elastic#111962)
  Move repo analyzer to its own package (elastic#111963)
  Add generated evaluators for DateNanos conversion functions (elastic#111961)
  Clean the last traces from global retention in templates (elastic#111669)
  Fix known issue docs for elastic#111866 (elastic#111956)
  x-pack/plugin/otel: introduce x-pack-otel plugin (elastic#111091)
  Improve reaction to blob store corruptions (elastic#111954)
  Introduce `StreamingXContentResponse` (elastic#111933)
  Revert "Add 8.15.0 known issue for memory locking in Windows (elastic#111949)"
  Test get-snapshots API with missing details (elastic#111903)
  Add 8.15.0 known issue for memory locking in Windows (elastic#111949)

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 27, 2024
Makes these utility methods available to other test suites (to be added
in future PRs).

Relates elastic#111954
DaveCTurner added a commit that referenced this pull request Aug 28, 2024
Makes these utility methods available to other test suites (to be added
in future PRs).

Relates #111954
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
Today there are a couple of assertions that can trip if the contents of
a snapshot repostiory are corrupted. It makes sense to assert the
integrity of snapshots in most tests, but we must also (a) protect
against these corruptions in production and (b) allow some tests to
verify the behaviour of the system when the repository is corrupted.

This commit introduces a flag to disable certain assertions, converts
the relevant assertions into production failures too, and introduces a
high-level test to verify that we do detect all relevant corruptions
without tripping any other assertions.

Extracted from elastic#93735 as this change makes sense in its own right.
Relates elastic#52622.
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
Makes these utility methods available to other test suites (to be added
in future PRs).

Relates elastic#111954
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Sep 5, 2024
Today there are a couple of assertions that can trip if the contents of
a snapshot repostiory are corrupted. It makes sense to assert the
integrity of snapshots in most tests, but we must also (a) protect
against these corruptions in production and (b) allow some tests to
verify the behaviour of the system when the repository is corrupted.

This commit introduces a flag to disable certain assertions, converts
the relevant assertions into production failures too, and introduces a
high-level test to verify that we do detect all relevant corruptions
without tripping any other assertions.

Extracted from elastic#93735 as this change makes sense in its own right.
Relates elastic#52622.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants