-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce repository integrity verification API #112348
Introduce repository integrity verification API #112348
Conversation
Adds an API which scans all the metadata (and optionally the raw data) in a snapshot repository to look for corruptions or other inconsistencies.
Documentation preview: |
Hi @DaveCTurner, I've created a changelog YAML for you. |
Hi @DaveCTurner, I've updated the changelog YAML for you. |
I expect there will be other things we will think of that can be verified here, but I'd rather avoid further scope creep if possible so would prefer to open new issues for follow-up work instead of blocking this PR on them. |
Pinging @elastic/es-distributed (Team:Distributed) |
docs/reference/snapshot-restore/apis/verify-repo-integrity-api.asciidoc
Outdated
Show resolved
Hide resolved
...tories/blobstore/testkit/integrity/TransportRepositoryVerifyIntegrityCoordinationAction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There's lots about the actual verifications that I don't fully understand (not being familiar enough with snapshot or repository structure), but everything else makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments and questions. Nothing major, please feel free to pick and choose.
Intuitively, I think a nesting format groupped by snapshot, then by index and then by shard feels more useful, e.g. you'll see full result of one snapshot before analyzing the next one. But it may not work with the streaming API. Or maybe you have different reasons.
I think we could use both API spec and REST test for the new API. They can be follow-ups. I also wonder whether we should mark it as experimental to give us some room for future tweaking?
...c/main/java/org/elasticsearch/repositories/blobstore/testkit/integrity/IndexDescription.java
Outdated
Show resolved
Hide resolved
...tories/blobstore/testkit/integrity/TransportRepositoryVerifyIntegrityCoordinationAction.java
Outdated
Show resolved
Hide resolved
...ories/blobstore/testkit/integrity/TransportRepositoryVerifyIntegrityResponseChunkAction.java
Outdated
Show resolved
Hide resolved
...sitories/blobstore/testkit/integrity/TransportRepositoryVerifyIntegrityMasterNodeAction.java
Outdated
Show resolved
Hide resolved
...sitories/blobstore/testkit/integrity/TransportRepositoryVerifyIntegrityMasterNodeAction.java
Outdated
Show resolved
Hide resolved
Strings.format( | ||
""" | ||
Cannot verify the integrity of all index snapshots because this repository contains too many shard snapshot \ | ||
failures: there are [%d] shard snapshot failures but [?%s] is set to [%d]. \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failures: there are [%d] shard snapshot failures but [?%s] is set to [%d]. \ | |
failures: there are [%d] shard snapshot failures but [%s] is set to [%d]. \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep the ?
to indicate it's a query param rather than a setting or something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was a typo. I didn't know we have this logging convention for query parameters.
new RepositoryVerifyIntegrityResponseChunk.Builder( | ||
responseChunkWriter, | ||
RepositoryVerifyIntegrityResponseChunk.Type.INDEX_RESTORABILITY | ||
).indexRestorability(indexId, totalSnapshotCounter.get(), restorableSnapshotCounter.get()).write(l); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should include some brief information about the latest restorable snapshot for this index. The situation would be quite different if the latest restorable snapshot is the a day ago vs a month ago?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what you mean, but I'm not sure that's all that useful in practice (at least in the situations I've wanted to use this API so far). Let's leave it for now.
.../org/elasticsearch/repositories/blobstore/testkit/integrity/RepositoryIntegrityVerifier.java
Show resolved
Hide resolved
.../org/elasticsearch/repositories/blobstore/testkit/integrity/RepositoryIntegrityVerifier.java
Show resolved
Hide resolved
} | ||
} | ||
|
||
blobContentsListeners(indexDescription, shardContainerContents, fileInfo).addListener( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can skip this entirely if requestParams.verifyBlobContents() == false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we track the sizes of the data either way, we just don't verify it in that case. Comment added in 9352efe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep it as is. I was thinking that we don't need to track the file size when verifyBlobContents == false
. IIUC, blobBytesVerified
is only used for status report and we don't report it when throttledNanos == 0
, i.e. verifyBlobContents == false
.
Thanks @ywangd - most comments addressed inline.
I know what you mean, but we can't do that without keeping track of unreasonably large amounts of data during the analysis (or re-reading substantial numbers of blobs).
Yeah, hard to imagine anyone calling this from a client tbh but I'll add these in a follow-up. |
Waiting until Yang's approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There might be value to allow verification for a subset of snapshots, e.g. snapshots within last 30 days. So it can potentially be used more frequently for integrity checking. In anycase, it does not need to be this PR.
|
||
If you suspect the integrity of the contents of one of your snapshot | ||
repositories, cease all write activity to this repository immediately, set its | ||
`read_only` option to `true`, and use this API to verify its integrity. Until |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to enforce ready_only
is set in the API? I am not sure whether it is worthwhile to report some anomalies after a length check just because there were concurrent writes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to enforce that, no. We might want to run this on e.g. a Cloud repo, and getting the right permissions to modify its metadata is hard to do.
.originalRepositoryGeneration() == repositoryVerifyIntegrityResponse.finalRepositoryGeneration() | ||
? "pass" | ||
: "inconclusive due to concurrent writes" | ||
: "fail" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to also have some nuances for fail
depending on whether final repository generation has changed?
).orElseThrow(AssertionError::new); | ||
} | ||
|
||
public void testSuccess() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use a test for cancellation.
Strings.format( | ||
""" | ||
Cannot verify the integrity of all index snapshots because this repository contains too many shard snapshot \ | ||
failures: there are [%d] shard snapshot failures but [?%s] is set to [%d]. \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was a typo. I didn't know we have this logging convention for query parameters.
} | ||
} | ||
|
||
blobContentsListeners(indexDescription, shardContainerContents, fileInfo).addListener( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep it as is. I was thinking that we don't need to track the file size when verifyBlobContents == false
. IIUC, blobBytesVerified
is only used for status report and we don't report it when throttledNanos == 0
, i.e. verifyBlobContents == false
.
// NB this next step doesn't matter for restorability, it is just verifying that the shard gen blob matches the shard | ||
// snapshot blob | ||
verifyShardGenerationConsistency(blobStoreIndexShardSnapshot, shardGenerationConsistencyListener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth to treat these anomalies differently and not reporting fail
for them? Does not have to be part of this PR anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's still an integrity problem, just not one that affects restorability.
Adds an API which scans all the metadata (and optionally the raw data) in a snapshot repository to look for corruptions or other inconsistencies. Closes #52622 Closes ES-8560
Adds an API which scans all the metadata (and optionally the raw data)
in a snapshot repository to look for corruptions or other
inconsistencies.
Closes #52622
Closes ES-8560