-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrity checks for snapshots #52622
Comments
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
This already happens. All files (including snapshot metadata) in the snapshot come with Lucene footers that contain a checksum of the file contents. We will not silently skip corrupted data if any file in a shard is corrupted and properly fail the restore for this shard.
This one is trickier. We can't really do this for every new snapshot as it would entail downloading all the files that we are reusing from previous snapshots wouldn't it? In the real world this could be massive amounts of data and would not be feasible. |
Can we detect this and inform the user, at least? This way, they would have a chance/choice to create a new backup, potentially in a different location. In general, what is the recommended way to ensure snapshot integrity then? RAID0 or other methods on the backup drive? Is this something we can document better in the snapshot/restore documentation? |
Sorry was out for 2 weeks, continuing here:
How would we detect it? The only way to do so is to literally checksum every file in the repository, we can't do this as part of any other operation on the repository.
We don't make any active recommendations here. Obviously, as with any other backup, you need to ensure that the volume you write to is safe. For the cloud provider backed blob stores and HDFS (assume appropriate replication settings) that can probably be assumed.
See above, I'm not sure what we could possibly document here. The fact that whatever volume the snapshots go to must be safe/durable seems to be a given to me. Given specific instructions on how to set up NFS is way beyond the scope of our docs I'd say. |
Yes, I had been thinking of checksums. But I am unable to say how practical this is.
I am not aware of this discussion. But some check, even manual, is better than none, imo.
I understand that we do not want to make active recommendations. But I think we should at least inform our users that they are expected to ensure the integrity (and possibly point to some common solutions such as RAID). |
Let's build the manual integrity check here then :) I think that's the best we can do really. There is no way of doing an automated check that wouldn't significantly increase the cost of the snapshot functionality for Cloud backed repositories.
I think I disagree here. Who would ever expect that snapshots will continue to work if the underlying storage medium gets corrupted over time? :) This isn't just a thing specific to snapshots, no matter what a user would use an NFS setup that randomly corrupts data for would probably fail in the long run. I'd take a stable and consistent NFS as a given here tbh. |
I do agree with this, but this raises further questions about how exactly they should do that. Today we don't have a good answer for that -- "avoid storage that might corrupt your data" isn't really helpful, and we don't have a good procedure for taking backups of the repository itself. I opened #54944 for this point. |
Pinging @elastic/es-distributed (Team:Distributed) |
How does ES check the snapshot integrity? Is there a way to write a script to do it? I can't find it readily in the repo here. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Relates elastic#52622
Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Extracted from elastic#93735 as this change makes sense in its own right. Relates elastic#52622.
Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Extracted from #93735 as this change makes sense in its own right. Relates #52622.
Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Extracted from elastic#93735 as this change makes sense in its own right. Relates elastic#52622.
Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Extracted from elastic#93735 as this change makes sense in its own right. Relates elastic#52622.
Adds an API which scans all the metadata (and optionally the raw data) in a snapshot repository to look for corruptions or other inconsistencies. Closes #52622 Closes ES-8560
Adds an API which scans all the metadata (and optionally the raw data) in a snapshot repository to look for corruptions or other inconsistencies. Closes #52622 Closes ES-8560
As far as I am aware, Elasticsearch does not implement integrity checks for snapshots. If I am not mistaken, Elasticsearch will silently restore restore snapshots partially if, e.g., data of one shard within the snapshot has been corrupted.
It would be an invaluable feature for users if Elasticsearch would store integrity hashes (e.g., for each shard) together with each snapshot. These hashes may then be used to:
The text was updated successfully, but these errors were encountered: