Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better procedure for backup of a snapshot repo #54944

Open
DaveCTurner opened this issue Apr 8, 2020 · 5 comments
Open

Better procedure for backup of a snapshot repo #54944

DaveCTurner opened this issue Apr 8, 2020 · 5 comments
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement feedback_needed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Apr 8, 2020

The integrity of the files in a snapshot repository is vitally important. If the contents of any file in the repository are altered after it was first written then snapshots may become unrecoverable.

Many snapshot repositories have very good integrity: corruption of files in the major public cloud providers' blob stores is almost unheard-of. However, if you are using your own filesystem for the underlying blob store (e.g. using the shared-filesystem repository or Minio) then it is much more likely you will encounter repository corruption.

The usual mitigation for this is to take backups to separate media, but today we do not really offer a good procedure for taking such backups. Simply backing up the repository contents will not work if taking snapshots during the backup because such backups are not atomic and may capture an inconsistent view of the repository. Some filesystems support atomic snapshots which should work (but we don't document this) and the only other option is to stop all snapshot activity for the duration of the backup.

I propose adding support for temporarily putting repositories into "backup mode" in which it is safe to take a possibly-incremental filesystem-level backup of their contents in any order even while taking more snapshots. When entering backup mode we would capture the current index-N file, maybe by copying it to a new filename, and would then be careful not to delete any blobs that are referenced by this file. The user would back up the repository including this specially-named index-N file, and then explicitly leave backup mode allowing us to process any pending deletes again. If the user's repository became unusable then they could restore it from such a backup and Elasticsearch would be able to revert the repository back to the state it was in when the backup was started.

(We'd need to carefully distinguish the case where the repository was restored from backup vs the case where the repository remained intact and the cluster itself was rebuilt, since in the latter case we should be able to expose later snapshots to the user -- it would not be acceptable to always automatically revert the repository back to the start of "backup mode")

@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Apr 8, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@pdubois
Copy link

pdubois commented Apr 9, 2020

The proposed solution would shield the users to perform inconsistent backups on separate media. The good thing is that if the media used for backups of the snapshots is of different nature like tapes or much slower disks or devices that are not always connected, it would help.
However it might be that it would not shield users from some other sources of inconsistencies like the underlying support for snapshot randomly not being consistent. I believe that the mechanic put in place should prevent immediately to backup inconsistent data once that is detected.
The initially proposed approach would help only and only if it would detect immediately some consistency issue on the support used to store the initial snapshots. The idea is to prevent any possibility to backup any inconsistent data (i.e. on tapes). Should it not be the case, if there are issues on the underlying support where initial snapshots are made (i.e: underlying support (disk) degrades progressively or is not reliable giving random value while reading), then backup would not be usable. It could be a goog thing if possible to detect that situation as soon as possible.

I also think that there is another scenario that is not covered when users need and wants to reassure themselves. Once bitten twice shy. Users recently had the case where support for indices where lost and it could be that they were making the snapshot on the same device or another device that they thought might also be impacted by a problem (i.e: ransomware). Therefore I believe that it is very important that early at every step of the backup process (live indices --> snapshot on first level backup --> snapshot on secondary level backup) to detect any inconsistency as soon as possible and never override data that was initially consistent if the source is not guaranteed to be.
In other words, a stable and consistent backup support at any level of the process should not be taken as a given here and the new procedure should integrante that fact.

A use case that is not covered would be to provide a separate tool like a backup checker that would help to check integrity without having to restore each indices individually (this is the only workaround to my knowledge ). In case of non disponibility of a power enough cluster or simply of any cluster up and running that would probably shorten the recovery time being certain that what you are restoring is not corrupted when a powerful enough cluster will become available.

@DaveCTurner
Copy link
Contributor Author

the underlying support for snapshot randomly not being consistent

@pdubois I think you are confusing detecting a corruption (the subject of #52622) with reacting to discovering that the repository is corrupt (the subject of this proposal). Detecting a corruption is relatively straightforward, albeit very expensive. The question is what to do next once you've discovered your repository is corrupt. How do you think users should react? Do they typically have secondary backups from which they can recover? How are they currently taking these backups?

@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 9, 2021
DaveCTurner added a commit that referenced this issue Feb 9, 2021
This commit spells out how important repository reliability is to
searchable snapshots, and also documents a procedure for taking a backup
of a snapshot repository.

Relates #54944
DaveCTurner added a commit that referenced this issue Feb 9, 2021
This commit spells out how important repository reliability is to
searchable snapshots, and also documents a procedure for taking a backup
of a snapshot repository.

Relates #54944
DaveCTurner added a commit that referenced this issue Feb 9, 2021
This commit spells out how important repository reliability is to
searchable snapshots, and also documents a procedure for taking a backup
of a snapshot repository.

Relates #54944
@kunisen
Copy link
Contributor

kunisen commented Feb 27, 2023

Thanks @DaveCTurner for guiding me from #93972 (comment) to this ticket.
Sorry I had one more concern here.

Assuming we have the following situation:

  • (A) We have frozen searchable snapshot indices stored in one repo
  • (B) We put the backup of the data into another repo
  • Due to some reason, i.e. operation miss, we lost (A), and we need to restore data from (B)

Is there a handy way to restore the data?

I vaguely remember that, the stored searchable snapshot data relies on the bucket name / repository name.
So if we lost the data from the original object store bucket, then even if we mount a backup repo, it's not that straightforward to work seamlessly.
Sorry if I wrongly remembered that without testing... 🙏

If that's the case, then I feel it might be useful to have a way to i.e. forcibly unmount the original searchable snapshot which is already lost (i.e. #85088), and then remount it to the backup repository.

Appreciate your further input about this and thanks! 🙇

@DaveCTurner
Copy link
Contributor Author

the stored searchable snapshot data relies on the bucket name / repository name.

This is not correct, we use a unique ID stored within the repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement feedback_needed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

6 participants