Improve reaction to blob store corruptions #111954

DaveCTurner · 2024-08-18T16:35:54Z

Today there are a couple of assertions that can trip if the contents of
a snapshot repostiory are corrupted. It makes sense to assert the
integrity of snapshots in most tests, but we must also (a) protect
against these corruptions in production and (b) allow some tests to
verify the behaviour of the system when the repository is corrupted.

This commit introduces a flag to disable certain assertions, converts
the relevant assertions into production failures too, and introduces a
high-level test to verify that we do detect all relevant corruptions
without tripping any other assertions.

Extracted from #93735 as this change makes sense in its own right.
Relates #52622.

Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Extracted from elastic#93735 as this change makes sense in its own right. Relates elastic#52622.

elasticsearchmachine · 2024-08-18T16:36:18Z

Pinging @elastic/es-distributed (Team:Distributed)

nicktindall · 2024-08-19T01:30:52Z

...internalClusterTest/java/org/elasticsearch/repositories/blobstore/BlobStoreCorruptionIT.java

+
+        try (var ignored = new BlobStoreIndexShardSnapshotsIntegritySuppressor()) {
+            final var exception = safeAwait(randomFrom(corruptionDetectors));
+            logger.info(Strings.format("--> corrupted [%s] and caught exception", corruptedFile), exception);


Could this potentially be considered an example of use randomised testing for coverage?

It seems we've enumerated the ways in which we expect to detect each type of corruption, then we're randomly picking one to execute. Would it not be better to test them all? or does the execution of one have side-effects precluding the others running?

More a question than a comment.

Ehh kinda, tho the coverage is extremely low even if we did check all these paths on each test run. The assertions we were previously hitting would only trip if you corrupted a very specific byte in exactly the right way.

Really this is just to verify that a corruption doesn't trip assertions so that we can proceed with #93735 that will introduce a way to catch all these problems at once.

nicktindall

LGTM

* upstream/main: Fail `indexDocs()` on rejection (elastic#111962) Move repo analyzer to its own package (elastic#111963) Add generated evaluators for DateNanos conversion functions (elastic#111961) Clean the last traces from global retention in templates (elastic#111669) Fix known issue docs for elastic#111866 (elastic#111956) x-pack/plugin/otel: introduce x-pack-otel plugin (elastic#111091) Improve reaction to blob store corruptions (elastic#111954) Introduce `StreamingXContentResponse` (elastic#111933) Revert "Add 8.15.0 known issue for memory locking in Windows (elastic#111949)" Test get-snapshots API with missing details (elastic#111903) Add 8.15.0 known issue for memory locking in Windows (elastic#111949) # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Makes these utility methods available to other test suites (to be added in future PRs). Relates elastic#111954

Makes these utility methods available to other test suites (to be added in future PRs). Relates #111954

Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Extracted from elastic#93735 as this change makes sense in its own right. Relates elastic#52622.

Makes these utility methods available to other test suites (to be added in future PRs). Relates elastic#111954

Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Extracted from elastic#93735 as this change makes sense in its own right. Relates elastic#52622.

DaveCTurner added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.16.0 labels Aug 18, 2024

DaveCTurner requested review from nicktindall and ywangd August 18, 2024 16:36

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 18, 2024

Temporarily fix docs bug

c584908

nicktindall reviewed Aug 19, 2024

View reviewed changes

nicktindall approved these changes Aug 19, 2024

View reviewed changes

DaveCTurner merged commit 1222496 into elastic:main Aug 19, 2024
15 checks passed

DaveCTurner deleted the 2024/08/18/blob-store-corruption-detection branch August 19, 2024 05:35

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 27, 2024

Extract BlobStoreCorruptionUtils

ac2efd5

Makes these utility methods available to other test suites (to be added in future PRs). Relates elastic#111954

DaveCTurner mentioned this pull request Aug 27, 2024

Extract BlobStoreCorruptionUtils #112229

Merged

DaveCTurner added a commit that referenced this pull request Aug 28, 2024

Extract BlobStoreCorruptionUtils (#112229)

d832e6e

Makes these utility methods available to other test suites (to be added in future PRs). Relates #111954

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024

Extract BlobStoreCorruptionUtils (elastic#112229)

5735e38

Makes these utility methods available to other test suites (to be added in future PRs). Relates elastic#111954

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reaction to blob store corruptions #111954

Improve reaction to blob store corruptions #111954

DaveCTurner commented Aug 18, 2024

elasticsearchmachine commented Aug 18, 2024

nicktindall Aug 19, 2024

DaveCTurner Aug 19, 2024

nicktindall left a comment

Improve reaction to blob store corruptions #111954

Improve reaction to blob store corruptions #111954

Conversation

DaveCTurner commented Aug 18, 2024

elasticsearchmachine commented Aug 18, 2024

nicktindall Aug 19, 2024

Choose a reason for hiding this comment

DaveCTurner Aug 19, 2024

Choose a reason for hiding this comment

nicktindall left a comment

Choose a reason for hiding this comment