Prevent index closing while snapshot is restoring the indices #16321

ppf2 · 2016-01-29T21:12:33Z

On 1.7.1. This may be related to #15432 , but it is unclear in #15432 what the cause was. So I am filing a separate ticket on this.

In short, it looks like when snapshot restore is running, if the indices are closed while it is operating on a shard, it will leave the snapshot restore request in a STARTED state.

    "restore" : {
      "snapshots" : [ {
        "snapshot" : "snapshot_name",
        "repository" : "repository_name",
        "state" : "STARTED",

And shards that are in INIT state as reported by the restore request:

{
          "index" : "pricing_2015121502",
          "shard" : 1,
          "state" : "INIT"
}

So that when the end user tries to kick off another restore, it will fail:

failed to restore snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [repository_name:snapshot_name] Restore process is already running in this cluster
    at org.elasticsearch.snapshots.RestoreService$1.execute(RestoreService.java:174)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:374)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:196)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:162)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Because only 1 snapshot restore request can be run at any point in time.

It will be a good idea for us to implement a check to prevent users from closing indices while the restore operation is working on those shards which will help prevent this type of issue.

The text was updated successfully, but these errors were encountered:

clintongormley · 2016-02-02T13:53:31Z

@ywelsch please could you take a look at this

ywelsch · 2016-02-03T17:11:04Z

I verified the issue on master. It is present on all ES versions.

To fix this, there are two possible ways:

Fail closing an index if there are shards of that index still being restored.
Always let the index close operation succeed but fail the restore process of the shards of that index. This is more in line with what we currently do for index deletes that happen during the snapshot restore process.

@clintongormley wdyt?

ppf2 · 2016-02-03T17:32:23Z

Fail closing an index if there are shards of that index still being restored.

At least for this particular use case, this option is desirable because they didn't realize that the restore is still in progress (would like the restore to proceed), and not because they want to cancel a specific restore.

ywelsch · 2016-02-04T07:58:09Z

@ppf2 what's your take on closing the index during snapshot (in contrast to restore)? Currently, the close operation succeeds but we fail the shards that are closed.

ppf2 · 2016-02-04T18:19:12Z

Currently, the close operation succeeds but we fail the shards that are closed.

For this specific report from the field, the indices did close, but the restore status from the cluster state shows a shard apparently stuck in INIT state - reopening the index did not resolve the issue, but deleting the index helped get the restore out of that stuck state. So somehow the close operation did not successfully fail the shards, but left them the restore procedure in a started state, thinking that it is still initializing the restore on one of the shards.

{
          "index" : "pricing_2015121502",
          "shard" : 1,
          "state" : "INIT"
}

ywelsch · 2016-02-04T19:42:08Z

@ppf2 I think you misunderstood me, I'm not claiming that the close operation successfully failed the restore process. My question was related to the comment of yours saying that it would be preferable to fail closing an index if there are shards of that index still being restored. I wanted to know whether you think the same applies to the snapshot case. So my question was: Should closing an index fail while the index is being snapshotted?

My goal here is to find out what the expected behavior should be in the following situations:

Deleting an index during snapshot operation of that index
Closing an index during snapshot operation of that index
Deleting an index during restore operation of that index
Closing an index during restore operation of that index

@clintongormley care to chime in?

ppf2 · 2016-02-04T20:33:52Z

@ywelsch Ah sorry misinterpreted :)

I am thinking for deletions, we can cancel that snapshot operation (only on shards of that index) and perform necessary cleanup so that it doesn't end up with partial files in the repository. Same thing with restore, cancel the restore of the shards for that index, and do some cleanup in the target cluster. For closing, maybe we can just prevent them from doing so and allow the snapshot or restore operation to complete? But yah, let's get input from @clintongormley @imotov

ywelsch · 2016-02-05T15:53:39Z

Here are my initial thoughts:

For restore:

Closing an index that is being restored from a snapshot should fail the close operation. There is no good reason to do allow this, as closing an index that is being restored makes the index unusable (it cannot be recovered). What the user probably intends to do is to delete the index.
Deleting an index that is being restored should cancel the restore operation for that index and delete the index. Deleting an index that is being restored would be a simple approach to cancel the restore process for that index. An open point for me is whether the restore information returned after finishing the restore process should take the deleted index into account in its statistics and status.

For snapshot:

Closing an index that is being snapshotted should only succeed if snapshot is marked as partial. If so, index should be closed and snapshot should abort snapshotting the shards of that index. If shards of the deleted index have been already successfully snapshotted in the meantime, they will remain in the snapshot.
Deleting an index that is being snapshotted should only succeed if snapshot is marked as partial. If so, index should be deleted and snapshot should abort snapshotting the shards of that index. If shards of the deleted index have been already successfully snapshotted in the meantime, they will remain in the snapshot.

Note: If the user does not want to wait for non-partial snapshot to finish before executing close/delete, he has to cancel the snapshot operation by deleting the snapshot in progress.

imotov · 2016-02-05T16:16:18Z

+1 on the restore part but I don't think we should abort the snapshot if an index is closed or deleted. That might lead to unexpected data loss. When users use partial snapshot they have certain set of partially available indices that they have in mind (basically indices that were partially available at the beginning of the snapshot). The proposed behavior arbitrary extends this to any index that happened to be copied over when the close or delete operation is performed, I think we should keep a lock on the shard and finish the snapshot operation before closing it.

ywelsch · 2016-02-05T16:25:33Z

@imotov Can you elaborate a bit more on the use cases for partial snapshots? The only one I have in mind is the following: As part of a general backup mechanism, hourly / daily snapshots are triggered (e.g. by cron job). The idea of partial snapshots would then be to snapshot as much as possible, even if some indices / shards are unavailable at that point in time. My proposed solution would be in line with that idea.

imotov · 2016-02-09T14:00:32Z

@ywelsch I see. I thought about partial snapshot as an emergency override that someone would deploy during a catastrophic event when they have a half working cluster and would like to make a snapshot of whatever they have left before taking some drastic recovery measures. During these highly stressful events the user might inadvertently close a half backed up index while thinking that they have a full copy.

clintongormley · 2016-02-13T20:36:03Z

@ywelsch I agree with your proposals for restore, but I think we should fail to close or delete an index while a snapshot is in progress (partial or otherwise).

I realise that this might mean that the user is blocked from deleting an index while a background cron job is doing a snapshot. Perhaps the exception can include a message about how to cancel the current snapshot, and provide the snapshot_id etc required to perform the cancellation in a script friendly format. That way, if the delete/close happens in a script, the user can code around the blocked action.

ywelsch · 2016-03-03T15:06:54Z

@imotov @clintongormley: Let me just recapitulate a bit:

For restore, we all agree that:

deleting an index that is being restored should cancel the restore operation for that index and delete the index. Same behavior as before.
closing an index that is being restored from a snapshot should fail the close operation. The previous behavior was to force-close the index and render it thereby unusable as it could not be recovered anymore upon opening.

For snapshot, we disagree. Currently, deleting an index that is being snapshotted results in two outcomes, depending on whether the snapshot was started as partial or not:

If it was started as partial and an index is deleted during snapshotting, the index is deleted and the snapshot still completes but has the snapshot state partial.
If it was started as non-partial, the index is deleted and the snapshot completes with the snapshot state failed.

In both cases, the delete operation succeeds and takes priority over snapshotting. We currently even have a test (SharedClusterSnapshotRestoreIT.testDeleteIndexDuringSnapshot) that checks exactly both of these scenarios and asserts the current behavior.

In light of that, let me make my case again:

My suggestion is to change the behavior only for the case where the snapshot is not started as partial. In that case I want snapshotting to take priority and the delete to fail as the user explicitly requested to have a full snapshot of all the specified stuff. I apply the same reasoning to close.
Your suggestions break way harder with the current way of doing snapshots. You're essentially saying that snapshots should always take priority (independently of whether started as partial or not) over deletes / closing. As I said before this does not play nicely with daily background snapshots (e.g. on cloud services). As for the disaster scenario outlined by @imotov, I don't think that doing snapshots during a disaster is exactly the scenario we want to optimize for. @clintongormley I'm missing any kind of argument why you think we should go this way.

In conclusion, the snapshot/delete-close case needs more discussion before I feel comfortable with implementing it. To not block too long on this discussion, I can in the meanwhile make a PR for the restore case.

clintongormley · 2016-03-08T08:46:09Z

@ywelsch has convinced me of his argument:

Deleting an index during restore should cancel the restore of that index
Closing an index during restore should fail the close request
Closing or deleting an index during a full snapshot should fail the close/delete request
Closing or deleting an index during a partial snapshot should succeed and mark the snapshot as partial.

Closes elastic#16321

wonderland14 · 2022-06-13T12:25:12Z

Is there any possible way or steps to not close the index while performing restore? If closing is the only way then it makes user life miserable to one by one closing indexes in order to perform restore.

ppf2 added help wanted adoptme :Analytics/Aggregations Aggregations :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >bug and removed :Analytics/Aggregations Aggregations help wanted adoptme labels Jan 29, 2016

clintongormley assigned ywelsch Feb 2, 2016

ywelsch mentioned this issue Mar 3, 2016

Prevent closing index during snapshot restore #16933

Merged

ywelsch mentioned this issue Mar 8, 2016

Fail closing or deleting indices during a full snapshot #17021

Merged

ywelsch added a commit to ywelsch/elasticsearch that referenced this issue Mar 10, 2016

Fail closing or deleting indices during a full snapshot

266394c

Closes elastic#16321

ywelsch closed this as completed in #17021 Mar 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent index closing while snapshot is restoring the indices #16321

Prevent index closing while snapshot is restoring the indices #16321

ppf2 commented Jan 29, 2016

clintongormley commented Feb 2, 2016

ywelsch commented Feb 3, 2016

ppf2 commented Feb 3, 2016

ywelsch commented Feb 4, 2016

ppf2 commented Feb 4, 2016

ywelsch commented Feb 4, 2016

ppf2 commented Feb 4, 2016

ywelsch commented Feb 5, 2016

imotov commented Feb 5, 2016

ywelsch commented Feb 5, 2016

imotov commented Feb 9, 2016

clintongormley commented Feb 13, 2016

ywelsch commented Mar 3, 2016

clintongormley commented Mar 8, 2016

wonderland14 commented Jun 13, 2022

Prevent index closing while snapshot is restoring the indices #16321

Prevent index closing while snapshot is restoring the indices #16321

Comments

ppf2 commented Jan 29, 2016

clintongormley commented Feb 2, 2016

ywelsch commented Feb 3, 2016

ppf2 commented Feb 3, 2016

ywelsch commented Feb 4, 2016

ppf2 commented Feb 4, 2016

ywelsch commented Feb 4, 2016

ppf2 commented Feb 4, 2016

ywelsch commented Feb 5, 2016

imotov commented Feb 5, 2016

ywelsch commented Feb 5, 2016

imotov commented Feb 9, 2016

clintongormley commented Feb 13, 2016

ywelsch commented Mar 3, 2016

clintongormley commented Mar 8, 2016

wonderland14 commented Jun 13, 2022