Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrity checks for snapshots #52622

Closed
fkelbert opened this issue Feb 21, 2020 · 11 comments · Fixed by #112348
Closed

Integrity checks for snapshots #52622

fkelbert opened this issue Feb 21, 2020 · 11 comments · Fixed by #112348
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@fkelbert
Copy link

As far as I am aware, Elasticsearch does not implement integrity checks for snapshots. If I am not mistaken, Elasticsearch will silently restore restore snapshots partially if, e.g., data of one shard within the snapshot has been corrupted.

It would be an invaluable feature for users if Elasticsearch would store integrity hashes (e.g., for each shard) together with each snapshot. These hashes may then be used to:

  • upon restoring a corrupted snapshot: inform the user that a snapshot has been only partially restored due to a corrupt shard
  • upon taking a new incremental snapshot: take a new snapshot of any parts that might have been corrupted
@fkelbert fkelbert added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Feb 21, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear
Copy link
Member

upon restoring a corrupted snapshot: inform the user that a snapshot has been only partially restored due to a corrupt shard

This already happens. All files (including snapshot metadata) in the snapshot come with Lucene footers that contain a checksum of the file contents. We will not silently skip corrupted data if any file in a shard is corrupted and properly fail the restore for this shard.

upon taking a new incremental snapshot: take a new snapshot of any parts that might have been corrupted

This one is trickier. We can't really do this for every new snapshot as it would entail downloading all the files that we are reusing from previous snapshots wouldn't it? In the real world this could be massive amounts of data and would not be feasible.
Also, if a snapshot got corrupted at a given location, I don't think we should quietly reuse that location? What guarantees would we have that the next snapshot to it wouldn't again be corrupted?

@fkelbert
Copy link
Author

fkelbert commented Mar 9, 2020

upon taking a new incremental snapshot: take a new snapshot of any parts that might have been corrupted

We can't really do this for every new snapshot as it would entail downloading all the files that we are reusing from previous snapshots wouldn't it? In the real world this could be massive amounts of data and would not be feasible.

Can we detect this and inform the user, at least? This way, they would have a chance/choice to create a new backup, potentially in a different location.

In general, what is the recommended way to ensure snapshot integrity then? RAID0 or other methods on the backup drive?

Is this something we can document better in the snapshot/restore documentation?

@original-brownbear
Copy link
Member

Sorry was out for 2 weeks, continuing here:

Can we detect this and inform the user, at least?

How would we detect it? The only way to do so is to literally checksum every file in the repository, we can't do this as part of any other operation on the repository.
There has been talk about adding a verification endpoint in the past that would allow you to manually run this check. Is this what you had in mind?

In general, what is the recommended way to ensure snapshot integrity then? RAID0 or other methods on the backup drive?

We don't make any active recommendations here. Obviously, as with any other backup, you need to ensure that the volume you write to is safe. For the cloud provider backed blob stores and HDFS (assume appropriate replication settings) that can probably be assumed.
For an NFS based repository the user will have to make sure that the backing file system does not get corrupted. How they do it (IMO) is not something we can document as there's just far too many possible setups to cover?

Is this something we can document better in the snapshot/restore documentation?

See above, I'm not sure what we could possibly document here. The fact that whatever volume the snapshots go to must be safe/durable seems to be a given to me. Given specific instructions on how to set up NFS is way beyond the scope of our docs I'd say.

@fkelbert
Copy link
Author

Sorry was out for 2 weeks, continuing here:

Can we detect this and inform the user, at least?

How would we detect it? The only way to do so is to literally checksum every file in the repository, we can't do this as part of any other operation on the repository.

Yes, I had been thinking of checksums. But I am unable to say how practical this is.

There has been talk about adding a verification endpoint in the past that would allow you to manually run this check. Is this what you had in mind?

I am not aware of this discussion. But some check, even manual, is better than none, imo.

In general, what is the recommended way to ensure snapshot integrity then? RAID0 or other methods on the backup drive?

We don't make any active recommendations here. Obviously, as with any other backup, you need to ensure that the volume you write to is safe. For the cloud provider backed blob stores and HDFS (assume appropriate replication settings) that can probably be assumed.
For an NFS based repository the user will have to make sure that the backing file system does not get corrupted. How they do it (IMO) is not something we can document as there's just far too many possible setups to cover?

I understand that we do not want to make active recommendations. But I think we should at least inform our users that they are expected to ensure the integrity (and possibly point to some common solutions such as RAID).

@original-brownbear
Copy link
Member

I am not aware of this discussion. But some check, even manual, is better than none, imo.

Let's build the manual integrity check here then :) I think that's the best we can do really. There is no way of doing an automated check that wouldn't significantly increase the cost of the snapshot functionality for Cloud backed repositories.

But I think we should at least inform our users that they are expected to ensure the integrity

I think I disagree here. Who would ever expect that snapshots will continue to work if the underlying storage medium gets corrupted over time? :) This isn't just a thing specific to snapshots, no matter what a user would use an NFS setup that randomly corrupts data for would probably fail in the long run. I'd take a stable and consistent NFS as a given here tbh.

@DaveCTurner
Copy link
Contributor

But I think we should at least inform our users that they are expected to ensure the integrity

I do agree with this, but this raises further questions about how exactly they should do that. Today we don't have a good answer for that -- "avoid storage that might corrupt your data" isn't really helpful, and we don't have a good procedure for taking backups of the repository itself. I opened #54944 for this point.

@DaveCTurner
Copy link
Contributor

@fkelbert I forgot to say that your opinion on #54944 would be helpful to hear, given the context of this discussion.

@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Dec 14, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 11, 2023
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 13, 2023
@Matthew-Jenkins
Copy link

How does ES check the snapshot integrity? Is there a way to write a script to do it? I can't find it readily in the repo here.

@DaveCTurner

This comment was marked as off-topic.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 17, 2024
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Aug 18, 2024
Today there are a couple of assertions that can trip if the contents of
a snapshot repostiory are corrupted. It makes sense to assert the
integrity of snapshots in most tests, but we must also (a) protect
against these corruptions in production and (b) allow some tests to
verify the behaviour of the system when the repository is corrupted.

This commit introduces a flag to disable certain assertions, converts
the relevant assertions into production failures too, and introduces a
high-level test to verify that we do detect all relevant corruptions
without tripping any other assertions.

Relates elastic#52622
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Aug 18, 2024
Today there are a couple of assertions that can trip if the contents of
a snapshot repostiory are corrupted. It makes sense to assert the
integrity of snapshots in most tests, but we must also (a) protect
against these corruptions in production and (b) allow some tests to
verify the behaviour of the system when the repository is corrupted.

This commit introduces a flag to disable certain assertions, converts
the relevant assertions into production failures too, and introduces a
high-level test to verify that we do detect all relevant corruptions
without tripping any other assertions.

Extracted from elastic#93735 as this change makes sense in its own right.
Relates elastic#52622.
DaveCTurner added a commit that referenced this issue Aug 19, 2024
Today there are a couple of assertions that can trip if the contents of
a snapshot repostiory are corrupted. It makes sense to assert the
integrity of snapshots in most tests, but we must also (a) protect
against these corruptions in production and (b) allow some tests to
verify the behaviour of the system when the repository is corrupted.

This commit introduces a flag to disable certain assertions, converts
the relevant assertions into production failures too, and introduces a
high-level test to verify that we do detect all relevant corruptions
without tripping any other assertions.

Extracted from #93735 as this change makes sense in its own right.
Relates #52622.
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Sep 4, 2024
Today there are a couple of assertions that can trip if the contents of
a snapshot repostiory are corrupted. It makes sense to assert the
integrity of snapshots in most tests, but we must also (a) protect
against these corruptions in production and (b) allow some tests to
verify the behaviour of the system when the repository is corrupted.

This commit introduces a flag to disable certain assertions, converts
the relevant assertions into production failures too, and introduces a
high-level test to verify that we do detect all relevant corruptions
without tripping any other assertions.

Extracted from elastic#93735 as this change makes sense in its own right.
Relates elastic#52622.
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this issue Sep 5, 2024
Today there are a couple of assertions that can trip if the contents of
a snapshot repostiory are corrupted. It makes sense to assert the
integrity of snapshots in most tests, but we must also (a) protect
against these corruptions in production and (b) allow some tests to
verify the behaviour of the system when the repository is corrupted.

This commit introduces a flag to disable certain assertions, converts
the relevant assertions into production failures too, and introduces a
high-level test to verify that we do detect all relevant corruptions
without tripping any other assertions.

Extracted from elastic#93735 as this change makes sense in its own right.
Relates elastic#52622.
elasticsearchmachine pushed a commit that referenced this issue Sep 11, 2024
Adds an API which scans all the metadata (and optionally the raw data)
in a snapshot repository to look for corruptions or other
inconsistencies.

Closes #52622 Closes
ES-8560
davidkyle pushed a commit that referenced this issue Sep 12, 2024
Adds an API which scans all the metadata (and optionally the raw data)
in a snapshot repository to look for corruptions or other
inconsistencies.

Closes #52622 Closes
ES-8560
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants