-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add repository metadata integrity check API #92373
Add repository metadata integrity check API #92373
Conversation
9b6c512
to
f861f21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly good to me.
I wonder if we risk a concurrent SLM (or other) deletion of files causing false negatives? And perhaps also a concurrent snapshot creation could cause the same?
I am not entirely sure if we could perhaps disregard such explicit concurrent changes somehow? Happy to see it as a refinement too, but then we need this to be clear in documentation.
server/src/main/java/org/elasticsearch/repositories/blobstore/MetadataVerifier.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
protected void masterOperation(Task task, Request request, ClusterState state, ActionListener<Response> listener) throws Exception { | ||
// TODO add mechanism to block blob deletions while this is running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes you are right that this will yield false errors if blobs are deleted out from underneath us. I need advice from @original-brownbear about whether there's a nice way to prevent that. I mean we could make it a stop-the-world thing like repo cleanup but that seems overkill. In principle we should even be able to let snapshotting continue as long as we just held back any finalisation-time deletes. I just worry that this could be a bigger change to the state machine, which IMO isn't worth making any more complex just for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also think modifying the state machine for this seems like the wrong choice.
For deletions, I wonder if we could recheck that any missing blobs are still referenced/expected after finding that they are missing? I am not entirely sure of the order, but I think we delete the reference to the file before deleting the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of potential anomalies is huge, and not all of them relate to missing blobs, so it's not clear that we can go back through the list and drop all the spurious ones.
I've made it so that it checks (and reports) the repo generation at the end of the analysis, so that at least you can see that the repo changed underneath us and therefore that some anomalies may be expected. I think this plus some docs about it will be sufficient for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:thu: sounds like a pragmatic solution.
1c96e71
to
a42310e
Compare
a42310e
to
f9024eb
Compare
Having now used this in anger a bit, I have concluded that reporting problems in logs and the REST response is not the right approach. The volume of information just becomes unmanageable on PiB-scale repositories. I think we should write the results to an index, and it's also super-useful to report some summary information which we could do using the same index. |
My WIP implementation has drifted sufficiently far from this branch that I think it best to open a new one at #93735 and close this. |
Relates #52622