"Wait for snapshot" action in Delete phase of ILM doesn't check if index was backed up #57809

yuliacech · 2020-06-08T12:08:56Z

Hello team,

while adding a field for "wait for snapshot policy" to Delete phase in Index Lifecycle Management UI, I noticed that this action does not in fact ensure that a snapshot of the index exists before deleting the index. This can lead to irreversible data loss of documents in the managed index.

How to recreate this behaviour:

Create a repository for snapshots
Create a snapshot policy that backs up an index
Create a different index (this is the managed index) with an alias for rollover
Create an index lifecycle policy that deletes the managed index with "wait for snapshot" option
After conditions for Delete phase are met and snapshot policy is executed, the managed index is deleted. This index can't be restored as the snapshot contains a different index.

Expand for console commands

PUT /_snapshot/my_repo
{
  "type": "fs",
  "settings": {
    "location": "./my_repo_test",
    "compress": true
  }
}

(index to be backed up my_snapshot_index, snapshots created every minute and deleted after 10 min)

PUT _slm/policy/my_snapshot_policy
{
  "name": "<snapshot-{now}>",
  "schedule": "0 * * * * ?",
  "repository": "my_repo",
  "config": {
    "indices": [
      "my_snapshot_index"
    ]
  },
  "retention": {
    "expire_after": "10m",
    "min_count": 1,
    "max_count": 3
  }
}

PUT /my_test_index-1
PUT /my_test_index-1/_alias/my_alias

(rollover after 3 docs, delete after my_snapshot_policy created a snapshot).

PUT _ilm/policy/my_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "30d",
            "max_docs": 3
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "delete": {
        "min_age": "0d",
        "actions": {
          "wait_for_snapshot": {
            "policy": "my_snapshot_policy"
          },
          "delete": {}
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-09T09:02:19Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

elasticsearchmachine · 2023-09-01T15:37:21Z

Pinging @elastic/es-data-management (Team:Data Management)

gmarouli · 2023-09-19T06:56:32Z

After a small investigation we saw that there a couple of things that could be done to fix this:

Before we check if the SLM policy has finished executing, we check if this policy applies this index, if not, we will treat as an miconfiguration error and inform the user that the policy does not apply to this index and there is no guarantee there is a "fresh" snapshot of this index. We will not delete until the policy updated to apply to this index or it's skipped (I need to play around with this to provide the exact steps).
If the policy is correct we want to make sure that there has been a recent successful run, either by checking that there is a snapshot that includes this index or by checking the cluster state versions. This requires a bit more thought but it should be possible.

Looking at the two ideas above, the first one is "quite" simple and it will solve this issue for the majority of our users. We will move forward with this one first and we can see the impact.

The current description of wait for snapshot can be misleading/ambiguous. Ref: #57809

Clarify wait_for_snapshot action. It doesn't ensure that a snapshot of the index is available before deletion. Ref: #57809

…pshot of that SLM policy (#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: #57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: #57809

…pshot of that SLM policy (elastic#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: elastic#57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: elastic#57809

…pshot of that SLM policy (elastic#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: elastic#57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: elastic#57809 (cherry picked from commit 5697fcf) # Conflicts: # x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStepTests.java

…pshot of that SLM policy (#100911) (#101027) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: #57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: #57809

…est snapshot of that SLM policy (#100911) (#101030) * `WaitForSnapshotStep` verifies if the index belongs to the latest snapshot of that SLM policy (#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: #57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: #57809 (cherry picked from commit 5697fcf)

yuliacech added >enhancement needs:triage Requires assignment of a team area label labels Jun 8, 2020

cbuescher added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Jun 9, 2020

elasticmachine added the Team:Data Management Meta label for data/management team label Jun 9, 2020

cbuescher removed Team:Data Management Meta label for data/management team needs:triage Requires assignment of a team area label labels Jun 9, 2020

DaveCTurner mentioned this issue May 25, 2021

ILM uses end-time of SLM snapshot when waiting for snapshot #73357

Closed

mattc58 added the team-discuss label Sep 1, 2023

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 1, 2023

mattc58 assigned gmarouli Sep 14, 2023

gmarouli removed the team-discuss label Sep 19, 2023

ppf2 added a commit that referenced this issue Sep 26, 2023

Update ilm-wait-for-snapshot.asciidoc

26346ba

The current description of wait for snapshot can be misleading/ambiguous. Ref: #57809

ppf2 added a commit that referenced this issue Sep 26, 2023

Clarify wait_for_snapshot action

f564f16

Clarify wait_for_snapshot action. It doesn't ensure that a snapshot of the index is available before deletion. Ref: #57809

ppf2 mentioned this issue Sep 26, 2023

Clarify Wait for snapshot action in documentation #99934

Open

This was referenced Oct 13, 2023

Check if the index belongs to the SLM policy before deleting #100816

Closed

WaitForSnapshotStep verifies if the index belongs to the latest snapshot of that SLM policy #100911

Merged

elasticsearchmachine closed this as completed in #100911 Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Wait for snapshot" action in Delete phase of ILM doesn't check if index was backed up #57809

"Wait for snapshot" action in Delete phase of ILM doesn't check if index was backed up #57809

yuliacech commented Jun 8, 2020

elasticmachine commented Jun 9, 2020

elasticsearchmachine commented Sep 1, 2023

gmarouli commented Sep 19, 2023

"Wait for snapshot" action in Delete phase of ILM doesn't check if index was backed up #57809

"Wait for snapshot" action in Delete phase of ILM doesn't check if index was backed up #57809

Comments

yuliacech commented Jun 8, 2020

elasticmachine commented Jun 9, 2020

elasticsearchmachine commented Sep 1, 2023

gmarouli commented Sep 19, 2023