Integrity checks for snapshots #52622

fkelbert · 2020-02-21T07:49:02Z

As far as I am aware, Elasticsearch does not implement integrity checks for snapshots. If I am not mistaken, Elasticsearch will silently restore restore snapshots partially if, e.g., data of one shard within the snapshot has been corrupted.

It would be an invaluable feature for users if Elasticsearch would store integrity hashes (e.g., for each shard) together with each snapshot. These hashes may then be used to:

upon restoring a corrupted snapshot: inform the user that a snapshot has been only partially restored due to a corrupt shard
upon taking a new incremental snapshot: take a new snapshot of any parts that might have been corrupted

elasticmachine · 2020-02-21T07:49:04Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2020-02-21T08:11:34Z

upon restoring a corrupted snapshot: inform the user that a snapshot has been only partially restored due to a corrupt shard

This already happens. All files (including snapshot metadata) in the snapshot come with Lucene footers that contain a checksum of the file contents. We will not silently skip corrupted data if any file in a shard is corrupted and properly fail the restore for this shard.

upon taking a new incremental snapshot: take a new snapshot of any parts that might have been corrupted

This one is trickier. We can't really do this for every new snapshot as it would entail downloading all the files that we are reusing from previous snapshots wouldn't it? In the real world this could be massive amounts of data and would not be feasible.
Also, if a snapshot got corrupted at a given location, I don't think we should quietly reuse that location? What guarantees would we have that the next snapshot to it wouldn't again be corrupted?

fkelbert · 2020-03-09T12:07:33Z

upon taking a new incremental snapshot: take a new snapshot of any parts that might have been corrupted

We can't really do this for every new snapshot as it would entail downloading all the files that we are reusing from previous snapshots wouldn't it? In the real world this could be massive amounts of data and would not be feasible.

Can we detect this and inform the user, at least? This way, they would have a chance/choice to create a new backup, potentially in a different location.

In general, what is the recommended way to ensure snapshot integrity then? RAID0 or other methods on the backup drive?

Is this something we can document better in the snapshot/restore documentation?

original-brownbear · 2020-03-23T13:15:04Z

Sorry was out for 2 weeks, continuing here:

Can we detect this and inform the user, at least?

How would we detect it? The only way to do so is to literally checksum every file in the repository, we can't do this as part of any other operation on the repository.
There has been talk about adding a verification endpoint in the past that would allow you to manually run this check. Is this what you had in mind?

In general, what is the recommended way to ensure snapshot integrity then? RAID0 or other methods on the backup drive?

We don't make any active recommendations here. Obviously, as with any other backup, you need to ensure that the volume you write to is safe. For the cloud provider backed blob stores and HDFS (assume appropriate replication settings) that can probably be assumed.
For an NFS based repository the user will have to make sure that the backing file system does not get corrupted. How they do it (IMO) is not something we can document as there's just far too many possible setups to cover?

Is this something we can document better in the snapshot/restore documentation?

See above, I'm not sure what we could possibly document here. The fact that whatever volume the snapshots go to must be safe/durable seems to be a given to me. Given specific instructions on how to set up NFS is way beyond the scope of our docs I'd say.

fkelbert · 2020-03-23T17:38:18Z

Sorry was out for 2 weeks, continuing here:

Can we detect this and inform the user, at least?

How would we detect it? The only way to do so is to literally checksum every file in the repository, we can't do this as part of any other operation on the repository.

Yes, I had been thinking of checksums. But I am unable to say how practical this is.

There has been talk about adding a verification endpoint in the past that would allow you to manually run this check. Is this what you had in mind?

I am not aware of this discussion. But some check, even manual, is better than none, imo.

In general, what is the recommended way to ensure snapshot integrity then? RAID0 or other methods on the backup drive?

We don't make any active recommendations here. Obviously, as with any other backup, you need to ensure that the volume you write to is safe. For the cloud provider backed blob stores and HDFS (assume appropriate replication settings) that can probably be assumed.
For an NFS based repository the user will have to make sure that the backing file system does not get corrupted. How they do it (IMO) is not something we can document as there's just far too many possible setups to cover?

I understand that we do not want to make active recommendations. But I think we should at least inform our users that they are expected to ensure the integrity (and possibly point to some common solutions such as RAID).

original-brownbear · 2020-03-25T11:22:02Z

I am not aware of this discussion. But some check, even manual, is better than none, imo.

Let's build the manual integrity check here then :) I think that's the best we can do really. There is no way of doing an automated check that wouldn't significantly increase the cost of the snapshot functionality for Cloud backed repositories.

But I think we should at least inform our users that they are expected to ensure the integrity

I think I disagree here. Who would ever expect that snapshots will continue to work if the underlying storage medium gets corrupted over time? :) This isn't just a thing specific to snapshots, no matter what a user would use an NFS setup that randomly corrupts data for would probably fail in the long run. I'd take a stable and consistent NFS as a given here tbh.

DaveCTurner · 2020-04-08T10:33:31Z

But I think we should at least inform our users that they are expected to ensure the integrity

I do agree with this, but this raises further questions about how exactly they should do that. Today we don't have a good answer for that -- "avoid storage that might corrupt your data" isn't really helpful, and we don't have a good procedure for taking backups of the repository itself. I opened #54944 for this point.

DaveCTurner · 2020-04-08T13:46:54Z

@fkelbert I forgot to say that your opinion on #54944 would be helpful to hear, given the context of this discussion.

Relates elastic#52622

elasticsearchmachine · 2022-12-16T13:41:36Z

Pinging @elastic/es-distributed (Team:Distributed)

Relates elastic#52622

Matthew-Jenkins · 2024-02-27T19:19:24Z

How does ES check the snapshot integrity? Is there a way to write a script to do it? I can't find it readily in the repo here.

Relates elastic#52622

Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Relates elastic#52622

Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Extracted from elastic#93735 as this change makes sense in its own right. Relates elastic#52622.

Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Extracted from #93735 as this change makes sense in its own right. Relates #52622.

Today there are a couple of assertions that can trip if the contents of a snapshot repostiory are corrupted. It makes sense to assert the integrity of snapshots in most tests, but we must also (a) protect against these corruptions in production and (b) allow some tests to verify the behaviour of the system when the repository is corrupted. This commit introduces a flag to disable certain assertions, converts the relevant assertions into production failures too, and introduces a high-level test to verify that we do detect all relevant corruptions without tripping any other assertions. Extracted from elastic#93735 as this change makes sense in its own right. Relates elastic#52622.

Adds an API which scans all the metadata (and optionally the raw data) in a snapshot repository to look for corruptions or other inconsistencies. Closes #52622 Closes ES-8560

fkelbert added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Feb 21, 2020

original-brownbear added the >enhancement label Mar 25, 2020

DaveCTurner mentioned this issue Apr 9, 2020

Better procedure for backup of a snapshot repo #54944

Open

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

DaveCTurner mentioned this issue May 13, 2022

Fail writing snapshot data to filesystem repo if space running out #67790

Open

DaveCTurner mentioned this issue Dec 14, 2022

Add repository metadata integrity check API #92373

Closed

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Dec 14, 2022

Add repository metadata integrity check API

f861f21

Relates elastic#52622

repantis assigned DaveCTurner Dec 16, 2022

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 11, 2023

Add repository metadata integrity check API

4826e21

Relates elastic#52622

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 13, 2023

Add repository metadata integrity check API

2937ca8

Relates elastic#52622

DaveCTurner mentioned this issue Feb 13, 2023

Add repository metadata integrity check API #93735

Closed

This comment was marked as off-topic.

Sign in to view

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 17, 2024

Add repository metadata integrity check API

0571ead

Relates elastic#52622

DaveCTurner mentioned this issue Aug 18, 2024

Improve reaction to blob store corruptions #111954

Merged

DaveCTurner mentioned this issue Aug 29, 2024

Introduce repository integrity verification API #112348

Merged

elasticsearchmachine closed this as completed in #112348 Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrity checks for snapshots #52622

Integrity checks for snapshots #52622

fkelbert commented Feb 21, 2020

elasticmachine commented Feb 21, 2020

original-brownbear commented Feb 21, 2020

fkelbert commented Mar 9, 2020

original-brownbear commented Mar 23, 2020

fkelbert commented Mar 23, 2020

original-brownbear commented Mar 25, 2020

DaveCTurner commented Apr 8, 2020

DaveCTurner commented Apr 8, 2020

elasticsearchmachine commented Dec 16, 2022

Matthew-Jenkins commented Feb 27, 2024

This comment was marked as off-topic.

Integrity checks for snapshots #52622

Integrity checks for snapshots #52622

Comments

fkelbert commented Feb 21, 2020

elasticmachine commented Feb 21, 2020

original-brownbear commented Feb 21, 2020

fkelbert commented Mar 9, 2020

original-brownbear commented Mar 23, 2020

fkelbert commented Mar 23, 2020

original-brownbear commented Mar 25, 2020

DaveCTurner commented Apr 8, 2020

DaveCTurner commented Apr 8, 2020

elasticsearchmachine commented Dec 16, 2022

Matthew-Jenkins commented Feb 27, 2024

This comment was marked as off-topic.