-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better procedure for backup of a snapshot repo #54944
Comments
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
The proposed solution would shield the users to perform inconsistent backups on separate media. The good thing is that if the media used for backups of the snapshots is of different nature like tapes or much slower disks or devices that are not always connected, it would help. I also think that there is another scenario that is not covered when users need and wants to reassure themselves. Once bitten twice shy. Users recently had the case where support for indices where lost and it could be that they were making the snapshot on the same device or another device that they thought might also be impacted by a problem (i.e: ransomware). Therefore I believe that it is very important that early at every step of the backup process (live indices --> snapshot on first level backup --> snapshot on secondary level backup) to detect any inconsistency as soon as possible and never override data that was initially consistent if the source is not guaranteed to be. A use case that is not covered would be to provide a separate tool like a backup checker that would help to check integrity without having to restore each indices individually (this is the only workaround to my knowledge ). In case of non disponibility of a power enough cluster or simply of any cluster up and running that would probably shorten the recovery time being certain that what you are restoring is not corrupted when a powerful enough cluster will become available. |
@pdubois I think you are confusing detecting a corruption (the subject of #52622) with reacting to discovering that the repository is corrupt (the subject of this proposal). Detecting a corruption is relatively straightforward, albeit very expensive. The question is what to do next once you've discovered your repository is corrupt. How do you think users should react? Do they typically have secondary backups from which they can recover? How are they currently taking these backups? |
This commit spells out how important repository reliability is to searchable snapshots, and also documents a procedure for taking a backup of a snapshot repository. Relates #54944
This commit spells out how important repository reliability is to searchable snapshots, and also documents a procedure for taking a backup of a snapshot repository. Relates #54944
This commit spells out how important repository reliability is to searchable snapshots, and also documents a procedure for taking a backup of a snapshot repository. Relates #54944
Thanks @DaveCTurner for guiding me from #93972 (comment) to this ticket. Assuming we have the following situation:
Is there a handy way to restore the data? I vaguely remember that, the stored searchable snapshot data relies on the bucket name / repository name. If that's the case, then I feel it might be useful to have a way to i.e. forcibly unmount the original searchable snapshot which is already lost (i.e. #85088), and then remount it to the backup repository. Appreciate your further input about this and thanks! 🙇 |
This is not correct, we use a unique ID stored within the repository. |
The integrity of the files in a snapshot repository is vitally important. If the contents of any file in the repository are altered after it was first written then snapshots may become unrecoverable.
Many snapshot repositories have very good integrity: corruption of files in the major public cloud providers' blob stores is almost unheard-of. However, if you are using your own filesystem for the underlying blob store (e.g. using the shared-filesystem repository or Minio) then it is much more likely you will encounter repository corruption.
The usual mitigation for this is to take backups to separate media, but today we do not really offer a good procedure for taking such backups. Simply backing up the repository contents will not work if taking snapshots during the backup because such backups are not atomic and may capture an inconsistent view of the repository. Some filesystems support atomic snapshots which should work (but we don't document this) and the only other option is to stop all snapshot activity for the duration of the backup.
I propose adding support for temporarily putting repositories into "backup mode" in which it is safe to take a possibly-incremental filesystem-level backup of their contents in any order even while taking more snapshots. When entering backup mode we would capture the current
index-N
file, maybe by copying it to a new filename, and would then be careful not to delete any blobs that are referenced by this file. The user would back up the repository including this specially-namedindex-N
file, and then explicitly leave backup mode allowing us to process any pending deletes again. If the user's repository became unusable then they could restore it from such a backup and Elasticsearch would be able to revert the repository back to the state it was in when the backup was started.(We'd need to carefully distinguish the case where the repository was restored from backup vs the case where the repository remained intact and the cluster itself was rebuilt, since in the latter case we should be able to expose later snapshots to the user -- it would not be acceptable to always automatically revert the repository back to the start of "backup mode")
The text was updated successfully, but these errors were encountered: