Better procedure for backup of a snapshot repo #54944

DaveCTurner · 2020-04-08T10:29:51Z

The integrity of the files in a snapshot repository is vitally important. If the contents of any file in the repository are altered after it was first written then snapshots may become unrecoverable.

Many snapshot repositories have very good integrity: corruption of files in the major public cloud providers' blob stores is almost unheard-of. However, if you are using your own filesystem for the underlying blob store (e.g. using the shared-filesystem repository or Minio) then it is much more likely you will encounter repository corruption.

The usual mitigation for this is to take backups to separate media, but today we do not really offer a good procedure for taking such backups. Simply backing up the repository contents will not work if taking snapshots during the backup because such backups are not atomic and may capture an inconsistent view of the repository. Some filesystems support atomic snapshots which should work (but we don't document this) and the only other option is to stop all snapshot activity for the duration of the backup.

I propose adding support for temporarily putting repositories into "backup mode" in which it is safe to take a possibly-incremental filesystem-level backup of their contents in any order even while taking more snapshots. When entering backup mode we would capture the current index-N file, maybe by copying it to a new filename, and would then be careful not to delete any blobs that are referenced by this file. The user would back up the repository including this specially-named index-N file, and then explicitly leave backup mode allowing us to process any pending deletes again. If the user's repository became unusable then they could restore it from such a backup and Elasticsearch would be able to revert the repository back to the state it was in when the backup was started.

(We'd need to carefully distinguish the case where the repository was restored from backup vs the case where the repository remained intact and the cluster itself was rebuilt, since in the latter case we should be able to expose later snapshots to the user -- it would not be acceptable to always automatically revert the repository back to the start of "backup mode")

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-08T10:29:53Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

pdubois · 2020-04-09T14:21:46Z

The proposed solution would shield the users to perform inconsistent backups on separate media. The good thing is that if the media used for backups of the snapshots is of different nature like tapes or much slower disks or devices that are not always connected, it would help.
However it might be that it would not shield users from some other sources of inconsistencies like the underlying support for snapshot randomly not being consistent. I believe that the mechanic put in place should prevent immediately to backup inconsistent data once that is detected.
The initially proposed approach would help only and only if it would detect immediately some consistency issue on the support used to store the initial snapshots. The idea is to prevent any possibility to backup any inconsistent data (i.e. on tapes). Should it not be the case, if there are issues on the underlying support where initial snapshots are made (i.e: underlying support (disk) degrades progressively or is not reliable giving random value while reading), then backup would not be usable. It could be a goog thing if possible to detect that situation as soon as possible.

I also think that there is another scenario that is not covered when users need and wants to reassure themselves. Once bitten twice shy. Users recently had the case where support for indices where lost and it could be that they were making the snapshot on the same device or another device that they thought might also be impacted by a problem (i.e: ransomware). Therefore I believe that it is very important that early at every step of the backup process (live indices --> snapshot on first level backup --> snapshot on secondary level backup) to detect any inconsistency as soon as possible and never override data that was initially consistent if the source is not guaranteed to be.
In other words, a stable and consistent backup support at any level of the process should not be taken as a given here and the new procedure should integrante that fact.

A use case that is not covered would be to provide a separate tool like a backup checker that would help to check integrity without having to restore each indices individually (this is the only workaround to my knowledge ). In case of non disponibility of a power enough cluster or simply of any cluster up and running that would probably shorten the recovery time being certain that what you are restoring is not corrupted when a powerful enough cluster will become available.

DaveCTurner · 2020-04-09T15:28:23Z

the underlying support for snapshot randomly not being consistent

@pdubois I think you are confusing detecting a corruption (the subject of #52622) with reacting to discovering that the repository is corrupt (the subject of this proposal). Detecting a corruption is relatively straightforward, albeit very expensive. The question is what to do next once you've discovered your repository is corrupt. How do you think users should react? Do they typically have secondary backups from which they can recover? How are they currently taking these backups?

This commit spells out how important repository reliability is to searchable snapshots, and also documents a procedure for taking a backup of a snapshot repository. Relates #54944

kunisen · 2023-02-27T06:24:57Z

Thanks @DaveCTurner for guiding me from #93972 (comment) to this ticket.
Sorry I had one more concern here.

Assuming we have the following situation:

(A) We have frozen searchable snapshot indices stored in one repo
(B) We put the backup of the data into another repo
Due to some reason, i.e. operation miss, we lost (A), and we need to restore data from (B)

Is there a handy way to restore the data?

I vaguely remember that, the stored searchable snapshot data relies on the bucket name / repository name.
So if we lost the data from the original object store bucket, then even if we mount a backup repo, it's not that straightforward to work seamlessly.
Sorry if I wrongly remembered that without testing... 🙏

If that's the case, then I feel it might be useful to have a way to i.e. forcibly unmount the original searchable snapshot which is already lost (i.e. #85088), and then remount it to the backup repository.

Appreciate your further input about this and thanks! 🙇

DaveCTurner · 2023-02-27T08:12:48Z

the stored searchable snapshot data relies on the bucket name / repository name.

This is not correct, we use a unique ID stored within the repository.

DaveCTurner added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Apr 8, 2020

DaveCTurner mentioned this issue Apr 8, 2020

Integrity checks for snapshots #52622

Closed

ywelsch added the team-discuss label Apr 8, 2020

DaveCTurner added feedback_needed and removed team-discuss labels Apr 8, 2020

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 9, 2021

Relates elastic#54944

e8d631d

DaveCTurner mentioned this issue Feb 9, 2021

Add docs on repository reliability and backups #68740

Merged

fcofdez mentioned this issue Sep 5, 2022

Forbid updating a repository bucket/location if there are mounted searchable snapshots using it #89696

Open

Leaf-Lin mentioned this issue Oct 11, 2022

Restore partially cached searchable snapshot-backed indices on a remote cluster #90785

Open

DaveCTurner mentioned this issue Feb 21, 2023

The way to backup the searchable snapshot #93972

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better procedure for backup of a snapshot repo #54944

Better procedure for backup of a snapshot repo #54944

DaveCTurner commented Apr 8, 2020 •

edited

Loading

elasticmachine commented Apr 8, 2020

pdubois commented Apr 9, 2020

DaveCTurner commented Apr 9, 2020

kunisen commented Feb 27, 2023

DaveCTurner commented Feb 27, 2023

Better procedure for backup of a snapshot repo #54944

Better procedure for backup of a snapshot repo #54944

Comments

DaveCTurner commented Apr 8, 2020 • edited Loading

elasticmachine commented Apr 8, 2020

pdubois commented Apr 9, 2020

DaveCTurner commented Apr 9, 2020

kunisen commented Feb 27, 2023

DaveCTurner commented Feb 27, 2023

DaveCTurner commented Apr 8, 2020 •

edited

Loading