Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent index closing while snapshot is restoring the indices #16321

Closed
ppf2 opened this issue Jan 29, 2016 · 15 comments
Closed

Prevent index closing while snapshot is restoring the indices #16321

ppf2 opened this issue Jan 29, 2016 · 15 comments
Assignees
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@ppf2
Copy link
Member

ppf2 commented Jan 29, 2016

On 1.7.1. This may be related to #15432 , but it is unclear in #15432 what the cause was. So I am filing a separate ticket on this.

In short, it looks like when snapshot restore is running, if the indices are closed while it is operating on a shard, it will leave the snapshot restore request in a STARTED state.

    "restore" : {
      "snapshots" : [ {
        "snapshot" : "snapshot_name",
        "repository" : "repository_name",
        "state" : "STARTED",

And shards that are in INIT state as reported by the restore request:

{
          "index" : "pricing_2015121502",
          "shard" : 1,
          "state" : "INIT"
}

So that when the end user tries to kick off another restore, it will fail:

failed to restore snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [repository_name:snapshot_name] Restore process is already running in this cluster
    at org.elasticsearch.snapshots.RestoreService$1.execute(RestoreService.java:174)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:374)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:196)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:162)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Because only 1 snapshot restore request can be run at any point in time.

It will be a good idea for us to implement a check to prevent users from closing indices while the restore operation is working on those shards which will help prevent this type of issue.

@ppf2 ppf2 added help wanted adoptme :Analytics/Aggregations Aggregations :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >bug and removed :Analytics/Aggregations Aggregations help wanted adoptme labels Jan 29, 2016
@clintongormley
Copy link
Contributor

@ywelsch please could you take a look at this

@ywelsch
Copy link
Contributor

ywelsch commented Feb 3, 2016

I verified the issue on master. It is present on all ES versions.

To fix this, there are two possible ways:

  • Fail closing an index if there are shards of that index still being restored.
  • Always let the index close operation succeed but fail the restore process of the shards of that index. This is more in line with what we currently do for index deletes that happen during the snapshot restore process.

@clintongormley wdyt?

@ppf2
Copy link
Member Author

ppf2 commented Feb 3, 2016

Fail closing an index if there are shards of that index still being restored.

At least for this particular use case, this option is desirable because they didn't realize that the restore is still in progress (would like the restore to proceed), and not because they want to cancel a specific restore.

@ywelsch
Copy link
Contributor

ywelsch commented Feb 4, 2016

@ppf2 what's your take on closing the index during snapshot (in contrast to restore)? Currently, the close operation succeeds but we fail the shards that are closed.

@ppf2
Copy link
Member Author

ppf2 commented Feb 4, 2016

Currently, the close operation succeeds but we fail the shards that are closed.

For this specific report from the field, the indices did close, but the restore status from the cluster state shows a shard apparently stuck in INIT state - reopening the index did not resolve the issue, but deleting the index helped get the restore out of that stuck state. So somehow the close operation did not successfully fail the shards, but left them the restore procedure in a started state, thinking that it is still initializing the restore on one of the shards.

{
          "index" : "pricing_2015121502",
          "shard" : 1,
          "state" : "INIT"
}

@ywelsch
Copy link
Contributor

ywelsch commented Feb 4, 2016

@ppf2 I think you misunderstood me, I'm not claiming that the close operation successfully failed the restore process. My question was related to the comment of yours saying that it would be preferable to fail closing an index if there are shards of that index still being restored. I wanted to know whether you think the same applies to the snapshot case. So my question was: Should closing an index fail while the index is being snapshotted?

My goal here is to find out what the expected behavior should be in the following situations:

  • Deleting an index during snapshot operation of that index
  • Closing an index during snapshot operation of that index
  • Deleting an index during restore operation of that index
  • Closing an index during restore operation of that index

@clintongormley care to chime in?

@ppf2
Copy link
Member Author

ppf2 commented Feb 4, 2016

@ywelsch Ah sorry misinterpreted :)

I am thinking for deletions, we can cancel that snapshot operation (only on shards of that index) and perform necessary cleanup so that it doesn't end up with partial files in the repository. Same thing with restore, cancel the restore of the shards for that index, and do some cleanup in the target cluster. For closing, maybe we can just prevent them from doing so and allow the snapshot or restore operation to complete? But yah, let's get input from @clintongormley @imotov

@ywelsch
Copy link
Contributor

ywelsch commented Feb 5, 2016

Here are my initial thoughts:

For restore:

  • Closing an index that is being restored from a snapshot should fail the close operation. There is no good reason to do allow this, as closing an index that is being restored makes the index unusable (it cannot be recovered). What the user probably intends to do is to delete the index.
  • Deleting an index that is being restored should cancel the restore operation for that index and delete the index. Deleting an index that is being restored would be a simple approach to cancel the restore process for that index. An open point for me is whether the restore information returned after finishing the restore process should take the deleted index into account in its statistics and status.

For snapshot:

  • Closing an index that is being snapshotted should only succeed if snapshot is marked as partial. If so, index should be closed and snapshot should abort snapshotting the shards of that index. If shards of the deleted index have been already successfully snapshotted in the meantime, they will remain in the snapshot.
  • Deleting an index that is being snapshotted should only succeed if snapshot is marked as partial. If so, index should be deleted and snapshot should abort snapshotting the shards of that index. If shards of the deleted index have been already successfully snapshotted in the meantime, they will remain in the snapshot.

Note: If the user does not want to wait for non-partial snapshot to finish before executing close/delete, he has to cancel the snapshot operation by deleting the snapshot in progress.

@imotov
Copy link
Contributor

imotov commented Feb 5, 2016

+1 on the restore part but I don't think we should abort the snapshot if an index is closed or deleted. That might lead to unexpected data loss. When users use partial snapshot they have certain set of partially available indices that they have in mind (basically indices that were partially available at the beginning of the snapshot). The proposed behavior arbitrary extends this to any index that happened to be copied over when the close or delete operation is performed, I think we should keep a lock on the shard and finish the snapshot operation before closing it.

@ywelsch
Copy link
Contributor

ywelsch commented Feb 5, 2016

@imotov Can you elaborate a bit more on the use cases for partial snapshots? The only one I have in mind is the following: As part of a general backup mechanism, hourly / daily snapshots are triggered (e.g. by cron job). The idea of partial snapshots would then be to snapshot as much as possible, even if some indices / shards are unavailable at that point in time. My proposed solution would be in line with that idea.

@imotov
Copy link
Contributor

imotov commented Feb 9, 2016

@ywelsch I see. I thought about partial snapshot as an emergency override that someone would deploy during a catastrophic event when they have a half working cluster and would like to make a snapshot of whatever they have left before taking some drastic recovery measures. During these highly stressful events the user might inadvertently close a half backed up index while thinking that they have a full copy.

@clintongormley
Copy link
Contributor

@ywelsch I agree with your proposals for restore, but I think we should fail to close or delete an index while a snapshot is in progress (partial or otherwise).

I realise that this might mean that the user is blocked from deleting an index while a background cron job is doing a snapshot. Perhaps the exception can include a message about how to cancel the current snapshot, and provide the snapshot_id etc required to perform the cancellation in a script friendly format. That way, if the delete/close happens in a script, the user can code around the blocked action.

@ywelsch
Copy link
Contributor

ywelsch commented Mar 3, 2016

@imotov @clintongormley: Let me just recapitulate a bit:

For restore, we all agree that:

  • deleting an index that is being restored should cancel the restore operation for that index and delete the index. Same behavior as before.
  • closing an index that is being restored from a snapshot should fail the close operation. The previous behavior was to force-close the index and render it thereby unusable as it could not be recovered anymore upon opening.

For snapshot, we disagree. Currently, deleting an index that is being snapshotted results in two outcomes, depending on whether the snapshot was started as partial or not:

  • If it was started as partial and an index is deleted during snapshotting, the index is deleted and the snapshot still completes but has the snapshot state partial.
  • If it was started as non-partial, the index is deleted and the snapshot completes with the snapshot state failed.

In both cases, the delete operation succeeds and takes priority over snapshotting. We currently even have a test (SharedClusterSnapshotRestoreIT.testDeleteIndexDuringSnapshot) that checks exactly both of these scenarios and asserts the current behavior.

In light of that, let me make my case again:

  • My suggestion is to change the behavior only for the case where the snapshot is not started as partial. In that case I want snapshotting to take priority and the delete to fail as the user explicitly requested to have a full snapshot of all the specified stuff. I apply the same reasoning to close.
  • Your suggestions break way harder with the current way of doing snapshots. You're essentially saying that snapshots should always take priority (independently of whether started as partial or not) over deletes / closing. As I said before this does not play nicely with daily background snapshots (e.g. on cloud services). As for the disaster scenario outlined by @imotov, I don't think that doing snapshots during a disaster is exactly the scenario we want to optimize for. @clintongormley I'm missing any kind of argument why you think we should go this way.

In conclusion, the snapshot/delete-close case needs more discussion before I feel comfortable with implementing it. To not block too long on this discussion, I can in the meanwhile make a PR for the restore case.

@clintongormley
Copy link
Contributor

@ywelsch has convinced me of his argument:

  • Deleting an index during restore should cancel the restore of that index
  • Closing an index during restore should fail the close request
  • Closing or deleting an index during a full snapshot should fail the close/delete request
  • Closing or deleting an index during a partial snapshot should succeed and mark the snapshot as partial.

@wonderland14
Copy link

Is there any possible way or steps to not close the index while performing restore? If closing is the only way then it makes user life miserable to one by one closing indexes in order to perform restore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

5 participants