Introduce repository integrity verification API #112348

DaveCTurner · 2024-08-29T12:30:31Z

Adds an API which scans all the metadata (and optionally the raw data)
in a snapshot repository to look for corruptions or other
inconsistencies.

Closes #52622
Closes ES-8560

Adds an API which scans all the metadata (and optionally the raw data) in a snapshot repository to look for corruptions or other inconsistencies.

github-actions · 2024-08-29T12:30:42Z

Documentation preview:

✨ Changed pages

elasticsearchmachine · 2024-08-29T12:30:54Z

Hi @DaveCTurner, I've created a changelog YAML for you.

elasticsearchmachine · 2024-08-29T12:43:24Z

Hi @DaveCTurner, I've updated the changelog YAML for you.

DaveCTurner · 2024-08-29T13:43:32Z

I expect there will be other things we will think of that can be verified here, but I'd rather avoid further scope creep if possible so would prefer to open new issues for follow-up work instead of blocking this PR on them.

elasticsearchmachine · 2024-08-29T14:48:42Z

Pinging @elastic/es-distributed (Team:Distributed)

docs/reference/snapshot-restore/apis/verify-repo-integrity-api.asciidoc

...tories/blobstore/testkit/integrity/TransportRepositoryVerifyIntegrityCoordinationAction.java

nicktindall

LGTM

There's lots about the actual verifications that I don't fully understand (not being familiar enough with snapshot or repository structure), but everything else makes sense to me.

ywangd

I left some comments and questions. Nothing major, please feel free to pick and choose.

Intuitively, I think a nesting format groupped by snapshot, then by index and then by shard feels more useful, e.g. you'll see full result of one snapshot before analyzing the next one. But it may not work with the streaming API. Or maybe you have different reasons.

I think we could use both API spec and REST test for the new API. They can be follow-ups. I also wonder whether we should mark it as experimental to give us some room for future tweaking?

...c/main/java/org/elasticsearch/repositories/blobstore/testkit/integrity/IndexDescription.java

...tories/blobstore/testkit/integrity/TransportRepositoryVerifyIntegrityCoordinationAction.java

...ories/blobstore/testkit/integrity/TransportRepositoryVerifyIntegrityResponseChunkAction.java

...sitories/blobstore/testkit/integrity/TransportRepositoryVerifyIntegrityMasterNodeAction.java

ywangd · 2024-09-06T09:09:27Z

.../org/elasticsearch/repositories/blobstore/testkit/integrity/RepositoryIntegrityVerifier.java

+                    Strings.format(
+                        """
+                            Cannot verify the integrity of all index snapshots because this repository contains too many shard snapshot \
+                            failures: there are [%d] shard snapshot failures but [?%s] is set to [%d]. \


Suggested change

failures: there are [%d] shard snapshot failures but [?%s] is set to [%d]. \

failures: there are [%d] shard snapshot failures but [%s] is set to [%d]. \

I'd rather keep the ? to indicate it's a query param rather than a setting or something else.

I thought it was a typo. I didn't know we have this logging convention for query parameters.

ywangd · 2024-09-06T09:31:16Z

.../org/elasticsearch/repositories/blobstore/testkit/integrity/RepositoryIntegrityVerifier.java

+                    new RepositoryVerifyIntegrityResponseChunk.Builder(
+                        responseChunkWriter,
+                        RepositoryVerifyIntegrityResponseChunk.Type.INDEX_RESTORABILITY
+                    ).indexRestorability(indexId, totalSnapshotCounter.get(), restorableSnapshotCounter.get()).write(l);


I wonder whether we should include some brief information about the latest restorable snapshot for this index. The situation would be quite different if the latest restorable snapshot is the a day ago vs a month ago?

I know what you mean, but I'm not sure that's all that useful in practice (at least in the situations I've wanted to use this API so far). Let's leave it for now.

.../org/elasticsearch/repositories/blobstore/testkit/integrity/RepositoryIntegrityVerifier.java

ywangd · 2024-09-06T09:57:41Z

.../org/elasticsearch/repositories/blobstore/testkit/integrity/RepositoryIntegrityVerifier.java

+                        }
+                    }
+
+                    blobContentsListeners(indexDescription, shardContainerContents, fileInfo).addListener(


I think we can skip this entirely if requestParams.verifyBlobContents() == false?

No, we track the sizes of the data either way, we just don't verify it in that case. Comment added in 9352efe.

We can keep it as is. I was thinking that we don't need to track the file size when verifyBlobContents == false. IIUC, blobBytesVerified is only used for status report and we don't report it when throttledNanos == 0, i.e. verifyBlobContents == false.

DaveCTurner · 2024-09-08T15:38:12Z

Thanks @ywangd - most comments addressed inline.

Intuitively, I think a nesting format groupped by snapshot, then by index and then by shard feels more useful, e.g. you'll see full result of one snapshot before analyzing the next one. But it may not work with the streaming API. Or maybe you have different reasons.

I know what you mean, but we can't do that without keeping track of unreasonably large amounts of data during the analysis (or re-reading substantial numbers of blobs).

I think we could use both API spec and REST test for the new API. They can be follow-ups. I also wonder whether we should mark it as experimental to give us some room for future tweaking?

Yeah, hard to imagine anyone calling this from a client tbh but I'll add these in a follow-up.

Waiting until Yang's approval

ywangd

LGTM 👍

There might be value to allow verification for a subset of snapshots, e.g. snapshots within last 30 days. So it can potentially be used more frequently for integrity checking. In anycase, it does not need to be this PR.

ywangd · 2024-09-09T06:30:50Z

docs/reference/snapshot-restore/apis/verify-repo-integrity-api.asciidoc

+
+If you suspect the integrity of the contents of one of your snapshot
+repositories, cease all write activity to this repository immediately, set its
+`read_only` option to `true`, and use this API to verify its integrity. Until


Do we want to enforce ready_only is set in the API? I am not sure whether it is worthwhile to report some anomalies after a length check just because there were concurrent writes?

I don't want to enforce that, no. We might want to run this on e.g. a Cloud repo, and getting the right permissions to modify its metadata is hard to do.

ywangd · 2024-09-09T06:37:30Z

...search/repositories/blobstore/testkit/integrity/RepositoryVerifyIntegrityResponseStream.java

+                                            .originalRepositoryGeneration() == repositoryVerifyIntegrityResponse.finalRepositoryGeneration()
+                                                ? "pass"
+                                                : "inconclusive due to concurrent writes"
+                                        : "fail"


It might be helpful to also have some nuances for fail depending on whether final repository generation has changed?

ywangd · 2024-09-09T06:48:18Z

.../org/elasticsearch/repositories/blobstore/testkit/integrity/RepositoryVerifyIntegrityIT.java

+        ).orElseThrow(AssertionError::new);
+    }
+
+    public void testSuccess() throws IOException {


I think we can use a test for cancellation.

ywangd · 2024-09-09T06:59:21Z

.../org/elasticsearch/repositories/blobstore/testkit/integrity/RepositoryIntegrityVerifier.java

+                    Strings.format(
+                        """
+                            Cannot verify the integrity of all index snapshots because this repository contains too many shard snapshot \
+                            failures: there are [%d] shard snapshot failures but [?%s] is set to [%d]. \


I thought it was a typo. I didn't know we have this logging convention for query parameters.

ywangd · 2024-09-09T07:15:51Z

.../org/elasticsearch/repositories/blobstore/testkit/integrity/RepositoryIntegrityVerifier.java

+                        }
+                    }
+
+                    blobContentsListeners(indexDescription, shardContainerContents, fileInfo).addListener(


We can keep it as is. I was thinking that we don't need to track the file size when verifyBlobContents == false. IIUC, blobBytesVerified is only used for status report and we don't report it when throttledNanos == 0, i.e. verifyBlobContents == false.

ywangd · 2024-09-09T08:13:39Z

.../org/elasticsearch/repositories/blobstore/testkit/integrity/RepositoryIntegrityVerifier.java

+                    // NB this next step doesn't matter for restorability, it is just verifying that the shard gen blob matches the shard
+                    // snapshot blob
+                    verifyShardGenerationConsistency(blobStoreIndexShardSnapshot, shardGenerationConsistencyListener);


Is it worth to treat these anomalies differently and not reporting fail for them? Does not have to be part of this PR anyway.

No, it's still an integrity problem, just not one that affects restorability.

Adds an API which scans all the metadata (and optionally the raw data) in a snapshot repository to look for corruptions or other inconsistencies. Closes #52622 Closes ES-8560

Introduce repository integrity verification API

58a75c2

Adds an API which scans all the metadata (and optionally the raw data) in a snapshot repository to look for corruptions or other inconsistencies.

DaveCTurner added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. v8.16.0 labels Aug 29, 2024

Update docs/changelog/112348.yaml

13ca43a

DaveCTurner mentioned this pull request Aug 29, 2024

Add repository metadata integrity check API #93735

Closed

Update docs/changelog/112348.yaml

a471a6a

DaveCTurner added 2 commits August 29, 2024 14:24

Fix comment

336b24d

Include docs

f914e6b

DaveCTurner marked this pull request as ready for review August 29, 2024 14:48

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 29, 2024

nicktindall reviewed Aug 30, 2024

View reviewed changes

docs/reference/snapshot-restore/apis/verify-repo-integrity-api.asciidoc Outdated Show resolved Hide resolved

nicktindall reviewed Sep 4, 2024

View reviewed changes

...tories/blobstore/testkit/integrity/TransportRepositoryVerifyIntegrityCoordinationAction.java Outdated Show resolved Hide resolved

nicktindall previously approved these changes Sep 4, 2024

View reviewed changes

ywangd reviewed Sep 6, 2024

View reviewed changes

DaveCTurner added 9 commits September 8, 2024 15:39

Merge branch 'main' into 2024/08/29/verify-repo-integrity

8620cc3

Duplicated docs

7915b0c

Include timestamps in log

ec689e4

TODO done

46ab633

Better docs

1a97205

Fix sequence diag

2435c41

Nullable indexMetadataBlob

50fb745

responseBuilder -> responseStream

abccf6d

More rename

7e25b66

DaveCTurner added 7 commits September 8, 2024 16:03

Rename action & comments

bb64a4d

Comment on connection reuse

b276f3e

Visibility

6f905f8

Reorder key

bf131c3

Comments

621d187

comment

9352efe

Comment about undefined shard gen

287a369

ywangd approved these changes Sep 9, 2024

View reviewed changes

DaveCTurner added 3 commits September 11, 2024 08:40

Merge branch 'main' into 2024/08/29/verify-repo-integrity

2c75062

Fix test

bc9a7ad

Add YAML test

1bc2972

DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 11, 2024

ML test fix

fe44a14

DaveCTurner requested a review from a team as a code owner September 11, 2024 12:15

elasticsearchmachine merged commit f79fb8c into elastic:main Sep 11, 2024
15 checks passed

DaveCTurner deleted the 2024/08/29/verify-repo-integrity branch September 11, 2024 13:18

l-trotta mentioned this pull request Sep 24, 2024

Approximate snapshot verify integrity mapping elastic/elasticsearch-specification#2930

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce repository integrity verification API #112348

Introduce repository integrity verification API #112348

DaveCTurner commented Aug 29, 2024 •

edited

Loading

github-actions bot commented Aug 29, 2024

elasticsearchmachine commented Aug 29, 2024

elasticsearchmachine commented Aug 29, 2024

DaveCTurner commented Aug 29, 2024

elasticsearchmachine commented Aug 29, 2024

nicktindall left a comment

ywangd left a comment

ywangd Sep 6, 2024

DaveCTurner Sep 8, 2024

ywangd Sep 9, 2024

ywangd Sep 6, 2024

DaveCTurner Sep 8, 2024

ywangd Sep 6, 2024

DaveCTurner Sep 8, 2024

ywangd Sep 9, 2024

DaveCTurner commented Sep 8, 2024

ywangd left a comment

ywangd Sep 9, 2024

DaveCTurner Sep 11, 2024

ywangd Sep 9, 2024

ywangd Sep 9, 2024

ywangd Sep 9, 2024

ywangd Sep 9, 2024

ywangd Sep 9, 2024

DaveCTurner Sep 11, 2024

	failures: there are [%d] shard snapshot failures but [?%s] is set to [%d]. \
	failures: there are [%d] shard snapshot failures but [%s] is set to [%d]. \

Introduce repository integrity verification API #112348

Introduce repository integrity verification API #112348

Conversation

DaveCTurner commented Aug 29, 2024 • edited Loading

github-actions bot commented Aug 29, 2024

elasticsearchmachine commented Aug 29, 2024

elasticsearchmachine commented Aug 29, 2024

DaveCTurner commented Aug 29, 2024

elasticsearchmachine commented Aug 29, 2024

nicktindall left a comment

Choose a reason for hiding this comment

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner commented Sep 8, 2024

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner commented Aug 29, 2024 •

edited

Loading