-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent index closing while snapshot is restoring the indices #16321
Comments
@ywelsch please could you take a look at this |
I verified the issue on master. It is present on all ES versions. To fix this, there are two possible ways:
@clintongormley wdyt? |
At least for this particular use case, this option is desirable because they didn't realize that the restore is still in progress (would like the restore to proceed), and not because they want to cancel a specific restore. |
@ppf2 what's your take on closing the index during snapshot (in contrast to restore)? Currently, the close operation succeeds but we fail the shards that are closed. |
For this specific report from the field, the indices did close, but the restore status from the cluster state shows a shard apparently stuck in INIT state - reopening the index did not resolve the issue, but deleting the index helped get the restore out of that stuck state. So somehow the close operation did not successfully fail the shards, but left them the restore procedure in a started state, thinking that it is still initializing the restore on one of the shards.
|
@ppf2 I think you misunderstood me, I'm not claiming that the close operation successfully failed the restore process. My question was related to the comment of yours saying that it would be preferable to fail closing an index if there are shards of that index still being restored. I wanted to know whether you think the same applies to the snapshot case. So my question was: Should closing an index fail while the index is being snapshotted? My goal here is to find out what the expected behavior should be in the following situations:
@clintongormley care to chime in? |
@ywelsch Ah sorry misinterpreted :) I am thinking for deletions, we can cancel that snapshot operation (only on shards of that index) and perform necessary cleanup so that it doesn't end up with partial files in the repository. Same thing with restore, cancel the restore of the shards for that index, and do some cleanup in the target cluster. For closing, maybe we can just prevent them from doing so and allow the snapshot or restore operation to complete? But yah, let's get input from @clintongormley @imotov |
Here are my initial thoughts: For restore:
For snapshot:
Note: If the user does not want to wait for non-partial snapshot to finish before executing close/delete, he has to cancel the snapshot operation by deleting the snapshot in progress. |
+1 on the restore part but I don't think we should abort the snapshot if an index is closed or deleted. That might lead to unexpected data loss. When users use partial snapshot they have certain set of partially available indices that they have in mind (basically indices that were partially available at the beginning of the snapshot). The proposed behavior arbitrary extends this to any index that happened to be copied over when the close or delete operation is performed, I think we should keep a lock on the shard and finish the snapshot operation before closing it. |
@imotov Can you elaborate a bit more on the use cases for partial snapshots? The only one I have in mind is the following: As part of a general backup mechanism, hourly / daily snapshots are triggered (e.g. by cron job). The idea of partial snapshots would then be to snapshot as much as possible, even if some indices / shards are unavailable at that point in time. My proposed solution would be in line with that idea. |
@ywelsch I see. I thought about partial snapshot as an emergency override that someone would deploy during a catastrophic event when they have a half working cluster and would like to make a snapshot of whatever they have left before taking some drastic recovery measures. During these highly stressful events the user might inadvertently close a half backed up index while thinking that they have a full copy. |
@ywelsch I agree with your proposals for restore, but I think we should fail to close or delete an index while a snapshot is in progress (partial or otherwise). I realise that this might mean that the user is blocked from deleting an index while a background cron job is doing a snapshot. Perhaps the exception can include a message about how to cancel the current snapshot, and provide the snapshot_id etc required to perform the cancellation in a script friendly format. That way, if the delete/close happens in a script, the user can code around the blocked action. |
@imotov @clintongormley: Let me just recapitulate a bit: For restore, we all agree that:
For snapshot, we disagree. Currently, deleting an index that is being snapshotted results in two outcomes, depending on whether the snapshot was started as partial or not:
In both cases, the delete operation succeeds and takes priority over snapshotting. We currently even have a test ( In light of that, let me make my case again:
In conclusion, the snapshot/delete-close case needs more discussion before I feel comfortable with implementing it. To not block too long on this discussion, I can in the meanwhile make a PR for the restore case. |
@ywelsch has convinced me of his argument:
|
Is there any possible way or steps to not close the index while performing restore? If closing is the only way then it makes user life miserable to one by one closing indexes in order to perform restore. |
On 1.7.1. This may be related to #15432 , but it is unclear in #15432 what the cause was. So I am filing a separate ticket on this.
In short, it looks like when snapshot restore is running, if the indices are closed while it is operating on a shard, it will leave the snapshot restore request in a STARTED state.
And shards that are in INIT state as reported by the restore request:
So that when the end user tries to kick off another restore, it will fail:
Because only 1 snapshot restore request can be run at any point in time.
It will be a good idea for us to implement a check to prevent users from closing indices while the restore operation is working on those shards which will help prevent this type of issue.
The text was updated successfully, but these errors were encountered: