Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce repository integrity verification API #112348

Merged

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Aug 29, 2024

Adds an API which scans all the metadata (and optionally the raw data)
in a snapshot repository to look for corruptions or other
inconsistencies.

Closes #52622
Closes ES-8560

Adds an API which scans all the metadata (and optionally the raw data)
in a snapshot repository to look for corruptions or other
inconsistencies.
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. v8.16.0 labels Aug 29, 2024
Copy link
Contributor

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've updated the changelog YAML for you.

@DaveCTurner
Copy link
Contributor Author

I expect there will be other things we will think of that can be verified here, but I'd rather avoid further scope creep if possible so would prefer to open new issues for follow-up work instead of blocking this PR on them.

@DaveCTurner DaveCTurner marked this pull request as ready for review August 29, 2024 14:48
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 29, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

nicktindall
nicktindall previously approved these changes Sep 4, 2024
Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

There's lots about the actual verifications that I don't fully understand (not being familiar enough with snapshot or repository structure), but everything else makes sense to me.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments and questions. Nothing major, please feel free to pick and choose.

Intuitively, I think a nesting format groupped by snapshot, then by index and then by shard feels more useful, e.g. you'll see full result of one snapshot before analyzing the next one. But it may not work with the streaming API. Or maybe you have different reasons.

I think we could use both API spec and REST test for the new API. They can be follow-ups. I also wonder whether we should mark it as experimental to give us some room for future tweaking?

Strings.format(
"""
Cannot verify the integrity of all index snapshots because this repository contains too many shard snapshot \
failures: there are [%d] shard snapshot failures but [?%s] is set to [%d]. \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
failures: there are [%d] shard snapshot failures but [?%s] is set to [%d]. \
failures: there are [%d] shard snapshot failures but [%s] is set to [%d]. \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep the ? to indicate it's a query param rather than a setting or something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was a typo. I didn't know we have this logging convention for query parameters.

new RepositoryVerifyIntegrityResponseChunk.Builder(
responseChunkWriter,
RepositoryVerifyIntegrityResponseChunk.Type.INDEX_RESTORABILITY
).indexRestorability(indexId, totalSnapshotCounter.get(), restorableSnapshotCounter.get()).write(l);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should include some brief information about the latest restorable snapshot for this index. The situation would be quite different if the latest restorable snapshot is the a day ago vs a month ago?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean, but I'm not sure that's all that useful in practice (at least in the situations I've wanted to use this API so far). Let's leave it for now.

}
}

blobContentsListeners(indexDescription, shardContainerContents, fileInfo).addListener(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip this entirely if requestParams.verifyBlobContents() == false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we track the sizes of the data either way, we just don't verify it in that case. Comment added in 9352efe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep it as is. I was thinking that we don't need to track the file size when verifyBlobContents == false. IIUC, blobBytesVerified is only used for status report and we don't report it when throttledNanos == 0, i.e. verifyBlobContents == false.

@DaveCTurner
Copy link
Contributor Author

Thanks @ywangd - most comments addressed inline.

Intuitively, I think a nesting format groupped by snapshot, then by index and then by shard feels more useful, e.g. you'll see full result of one snapshot before analyzing the next one. But it may not work with the streaming API. Or maybe you have different reasons.

I know what you mean, but we can't do that without keeping track of unreasonably large amounts of data during the analysis (or re-reading substantial numbers of blobs).

I think we could use both API spec and REST test for the new API. They can be follow-ups. I also wonder whether we should mark it as experimental to give us some room for future tweaking?

Yeah, hard to imagine anyone calling this from a client tbh but I'll add these in a follow-up.

@DaveCTurner DaveCTurner dismissed nicktindall’s stale review September 8, 2024 15:38

Waiting until Yang's approval

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

There might be value to allow verification for a subset of snapshots, e.g. snapshots within last 30 days. So it can potentially be used more frequently for integrity checking. In anycase, it does not need to be this PR.


If you suspect the integrity of the contents of one of your snapshot
repositories, cease all write activity to this repository immediately, set its
`read_only` option to `true`, and use this API to verify its integrity. Until
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to enforce ready_only is set in the API? I am not sure whether it is worthwhile to report some anomalies after a length check just because there were concurrent writes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to enforce that, no. We might want to run this on e.g. a Cloud repo, and getting the right permissions to modify its metadata is hard to do.

.originalRepositoryGeneration() == repositoryVerifyIntegrityResponse.finalRepositoryGeneration()
? "pass"
: "inconclusive due to concurrent writes"
: "fail"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to also have some nuances for fail depending on whether final repository generation has changed?

).orElseThrow(AssertionError::new);
}

public void testSuccess() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use a test for cancellation.

Strings.format(
"""
Cannot verify the integrity of all index snapshots because this repository contains too many shard snapshot \
failures: there are [%d] shard snapshot failures but [?%s] is set to [%d]. \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was a typo. I didn't know we have this logging convention for query parameters.

}
}

blobContentsListeners(indexDescription, shardContainerContents, fileInfo).addListener(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep it as is. I was thinking that we don't need to track the file size when verifyBlobContents == false. IIUC, blobBytesVerified is only used for status report and we don't report it when throttledNanos == 0, i.e. verifyBlobContents == false.

Comment on lines +533 to +535
// NB this next step doesn't matter for restorability, it is just verifying that the shard gen blob matches the shard
// snapshot blob
verifyShardGenerationConsistency(blobStoreIndexShardSnapshot, shardGenerationConsistencyListener);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth to treat these anomalies differently and not reporting fail for them? Does not have to be part of this PR anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's still an integrity problem, just not one that affects restorability.

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 11, 2024
@DaveCTurner DaveCTurner requested a review from a team as a code owner September 11, 2024 12:15
@elasticsearchmachine elasticsearchmachine merged commit f79fb8c into elastic:main Sep 11, 2024
15 checks passed
@DaveCTurner DaveCTurner deleted the 2024/08/29/verify-repo-integrity branch September 11, 2024 13:18
davidkyle pushed a commit that referenced this pull request Sep 12, 2024
Adds an API which scans all the metadata (and optionally the raw data)
in a snapshot repository to look for corruptions or other
inconsistencies.

Closes #52622 Closes
ES-8560
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrity checks for snapshots
4 participants