Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Wait for snapshot" action in Delete phase of ILM doesn't check if index was backed up #57809

Closed
yuliacech opened this issue Jun 8, 2020 · 3 comments · Fixed by #100911
Closed
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team

Comments

@yuliacech
Copy link
Contributor

Hello team,

while adding a field for "wait for snapshot policy" to Delete phase in Index Lifecycle Management UI, I noticed that this action does not in fact ensure that a snapshot of the index exists before deleting the index. This can lead to irreversible data loss of documents in the managed index.

How to recreate this behaviour:

  1. Create a repository for snapshots
  2. Create a snapshot policy that backs up an index
  3. Create a different index (this is the managed index) with an alias for rollover
  4. Create an index lifecycle policy that deletes the managed index with "wait for snapshot" option
  5. After conditions for Delete phase are met and snapshot policy is executed, the managed index is deleted. This index can't be restored as the snapshot contains a different index.
Expand for console commands
PUT /_snapshot/my_repo
{
  "type": "fs",
  "settings": {
    "location": "./my_repo_test",
    "compress": true
  }
}

(index to be backed up my_snapshot_index, snapshots created every minute and deleted after 10 min)

PUT _slm/policy/my_snapshot_policy
{
  "name": "<snapshot-{now}>",
  "schedule": "0 * * * * ?",
  "repository": "my_repo",
  "config": {
    "indices": [
      "my_snapshot_index"
    ]
  },
  "retention": {
    "expire_after": "10m",
    "min_count": 1,
    "max_count": 3
  }
}

PUT /my_test_index-1
PUT /my_test_index-1/_alias/my_alias

(rollover after 3 docs, delete after my_snapshot_policy created a snapshot).

PUT _ilm/policy/my_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "30d",
            "max_docs": 3
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "delete": {
        "min_age": "0d",
        "actions": {
          "wait_for_snapshot": {
            "policy": "my_snapshot_policy"
          },
          "delete": {}
        }
      }
    }
  }
}
@yuliacech yuliacech added >enhancement needs:triage Requires assignment of a team area label labels Jun 8, 2020
@cbuescher cbuescher added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Jun 9, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jun 9, 2020
@cbuescher cbuescher removed Team:Data Management Meta label for data/management team needs:triage Requires assignment of a team area label labels Jun 9, 2020
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 1, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@gmarouli
Copy link
Contributor

After a small investigation we saw that there a couple of things that could be done to fix this:

  1. Before we check if the SLM policy has finished executing, we check if this policy applies this index, if not, we will treat as an miconfiguration error and inform the user that the policy does not apply to this index and there is no guarantee there is a "fresh" snapshot of this index. We will not delete until the policy updated to apply to this index or it's skipped (I need to play around with this to provide the exact steps).
  2. If the policy is correct we want to make sure that there has been a recent successful run, either by checking that there is a snapshot that includes this index or by checking the cluster state versions. This requires a bit more thought but it should be possible.

Looking at the two ideas above, the first one is "quite" simple and it will solve this issue for the majority of our users. We will move forward with this one first and we can see the impact.

ppf2 added a commit that referenced this issue Sep 26, 2023
The current description of wait for snapshot can be misleading/ambiguous.

Ref: #57809
ppf2 added a commit that referenced this issue Sep 26, 2023
Clarify wait_for_snapshot action.  It doesn't ensure that a snapshot of the index is available before deletion.

Ref: #57809
elasticsearchmachine pushed a commit that referenced this issue Oct 18, 2023
…pshot of that SLM policy (#100911)

The `WaitForSnapshotStep` used to check if the SLM policy has been
executed after the index has entered the delete phase, but it did not
check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not
include this index, when the index would enter the
`WaitForSnapshotStep`, it would wait for a snapshot to be taken, a
snapshot that would not include the index, and then ILM would delete the
index.

See the exact reproduction path:
#57809

**Solution** This PR, after it finds a successful SLM run, it verifies
if the snapshot taken by SLM contains this index. If not it throws an
error, otherwise it proceeds.

ILM explain will report:

```
"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }
```

**Backwards compatibility concerns** In this PR, the
`WaitForSnapshotStep` changed from `ClusterStateWaitStep` to
`AsyncWaitStep`. We do not think this is gonna cause an issue. This was
tested manually by the following steps: - Run a master node with the old
version. - When ILM is executing `wait-for-snapshot`, we shutdown the
node - We start the node again with the new version os ES - ES was able
to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: #57809
gmarouli added a commit to gmarouli/elasticsearch that referenced this issue Oct 18, 2023
…pshot of that SLM policy (elastic#100911)

The `WaitForSnapshotStep` used to check if the SLM policy has been
executed after the index has entered the delete phase, but it did not
check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not
include this index, when the index would enter the
`WaitForSnapshotStep`, it would wait for a snapshot to be taken, a
snapshot that would not include the index, and then ILM would delete the
index.

See the exact reproduction path:
elastic#57809

**Solution** This PR, after it finds a successful SLM run, it verifies
if the snapshot taken by SLM contains this index. If not it throws an
error, otherwise it proceeds.

ILM explain will report:

```
"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }
```

**Backwards compatibility concerns** In this PR, the
`WaitForSnapshotStep` changed from `ClusterStateWaitStep` to
`AsyncWaitStep`. We do not think this is gonna cause an issue. This was
tested manually by the following steps: - Run a master node with the old
version. - When ILM is executing `wait-for-snapshot`, we shutdown the
node - We start the node again with the new version os ES - ES was able
to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: elastic#57809
gmarouli added a commit to gmarouli/elasticsearch that referenced this issue Oct 18, 2023
…pshot of that SLM policy (elastic#100911)

The `WaitForSnapshotStep` used to check if the SLM policy has been
executed after the index has entered the delete phase, but it did not
check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not
include this index, when the index would enter the
`WaitForSnapshotStep`, it would wait for a snapshot to be taken, a
snapshot that would not include the index, and then ILM would delete the
index.

See the exact reproduction path:
elastic#57809

**Solution** This PR, after it finds a successful SLM run, it verifies
if the snapshot taken by SLM contains this index. If not it throws an
error, otherwise it proceeds.

ILM explain will report:

```
"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }
```

**Backwards compatibility concerns** In this PR, the
`WaitForSnapshotStep` changed from `ClusterStateWaitStep` to
`AsyncWaitStep`. We do not think this is gonna cause an issue. This was
tested manually by the following steps: - Run a master node with the old
version. - When ILM is executing `wait-for-snapshot`, we shutdown the
node - We start the node again with the new version os ES - ES was able
to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: elastic#57809
(cherry picked from commit 5697fcf)

# Conflicts:
#	x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStepTests.java
elasticsearchmachine pushed a commit that referenced this issue Oct 18, 2023
…pshot of that SLM policy (#100911) (#101027)

The `WaitForSnapshotStep` used to check if the SLM policy has been
executed after the index has entered the delete phase, but it did not
check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not
include this index, when the index would enter the
`WaitForSnapshotStep`, it would wait for a snapshot to be taken, a
snapshot that would not include the index, and then ILM would delete the
index.

See the exact reproduction path:
#57809

**Solution** This PR, after it finds a successful SLM run, it verifies
if the snapshot taken by SLM contains this index. If not it throws an
error, otherwise it proceeds.

ILM explain will report:

```
"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }
```

**Backwards compatibility concerns** In this PR, the
`WaitForSnapshotStep` changed from `ClusterStateWaitStep` to
`AsyncWaitStep`. We do not think this is gonna cause an issue. This was
tested manually by the following steps: - Run a master node with the old
version. - When ILM is executing `wait-for-snapshot`, we shutdown the
node - We start the node again with the new version os ES - ES was able
to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: #57809
gmarouli added a commit that referenced this issue Oct 18, 2023
…est snapshot of that SLM policy (#100911) (#101030)

* `WaitForSnapshotStep` verifies if the index belongs to the latest snapshot of that SLM policy (#100911)

The `WaitForSnapshotStep` used to check if the SLM policy has been
executed after the index has entered the delete phase, but it did not
check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not
include this index, when the index would enter the
`WaitForSnapshotStep`, it would wait for a snapshot to be taken, a
snapshot that would not include the index, and then ILM would delete the
index.

See the exact reproduction path:
#57809

**Solution** This PR, after it finds a successful SLM run, it verifies
if the snapshot taken by SLM contains this index. If not it throws an
error, otherwise it proceeds.

ILM explain will report:

```
"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }
```

**Backwards compatibility concerns** In this PR, the
`WaitForSnapshotStep` changed from `ClusterStateWaitStep` to
`AsyncWaitStep`. We do not think this is gonna cause an issue. This was
tested manually by the following steps: - Run a master node with the old
version. - When ILM is executing `wait-for-snapshot`, we shutdown the
node - We start the node again with the new version os ES - ES was able
to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: #57809
(cherry picked from commit 5697fcf)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
6 participants