Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitForSnapshotStep verifies if the index belongs to the latest snapshot of that SLM policy #100911

Merged

Conversation

gmarouli
Copy link
Contributor

@gmarouli gmarouli commented Oct 16, 2023

The WaitForSnapshotStep used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the WaitForSnapshotStep, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index.

See the exact reproduction path: #57809

Solution
This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds.

ILM explain will report:

"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }

Backwards compatibility concerns
In this PR, the WaitForSnapshotStep changed from ClusterStateWaitStep to AsyncWaitStep. We do not think this is gonna cause an issue. This was tested manually by the following steps:

  • Run a master node with the old version.
  • When ILM is executing wait-for-snapshot, we shutdown the node
  • We start the node again with the new version os ES
  • ES was able to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: #57809

@gmarouli gmarouli added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Oct 16, 2023
@elasticsearchmachine elasticsearchmachine added v8.12.0 Team:Data Management Meta label for data/management team labels Oct 16, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @gmarouli, I've created a changelog YAML for you.

Copy link
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code mostly looks very good. I've added a couple of admittedly trivial comments.

@gmarouli
Copy link
Contributor Author

Thank you for review @joegallo, you keep me sharp 🤓 .

@elasticsearchmachine
Copy link
Collaborator

Hi @gmarouli, I've updated the changelog YAML for you.

@gmarouli gmarouli added v7.17.15 auto-backport Automatically create backport pull requests when merged v8.11.1 labels Oct 17, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @gmarouli, I've updated the changelog YAML for you.

@gmarouli
Copy link
Contributor Author

About the backport labels, I think we should try to backport it, I do realise that backporting to 7.17.x might be tricky, that's why I would like to timebox it at least give it a try see if it's worth it.

@joegallo
Copy link
Contributor

joegallo commented Oct 17, 2023

Regarding backporting, I think it should be pretty straightforward (hopefully!), a lot of the ILM code has been pretty stable for a while. A tricky bit is that there's no generally available EmptyInfo on 7.17, I think that was introduced pretty recently via #100179. Maybe it's worth pre-gaming a >non-issue PR that only introduces the EmptyInfo like #100179 did?

edit: or... looking at how it's used in AsyncWaitStep, you should probably just use null there instead for the backport...

Copy link
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@gmarouli
Copy link
Contributor Author

@elasticmachine update branch

@gmarouli gmarouli added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 18, 2023
@elasticsearchmachine elasticsearchmachine merged commit 5697fcf into elastic:main Oct 18, 2023
13 checks passed
@gmarouli gmarouli deleted the fix-ilm-delete-snapshot-wait-v2 branch October 18, 2023 07:00
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.11
7.17 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 100911

gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Oct 18, 2023
…pshot of that SLM policy (elastic#100911)

The `WaitForSnapshotStep` used to check if the SLM policy has been
executed after the index has entered the delete phase, but it did not
check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not
include this index, when the index would enter the
`WaitForSnapshotStep`, it would wait for a snapshot to be taken, a
snapshot that would not include the index, and then ILM would delete the
index.

See the exact reproduction path:
elastic#57809

**Solution** This PR, after it finds a successful SLM run, it verifies
if the snapshot taken by SLM contains this index. If not it throws an
error, otherwise it proceeds.

ILM explain will report:

```
"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }
```

**Backwards compatibility concerns** In this PR, the
`WaitForSnapshotStep` changed from `ClusterStateWaitStep` to
`AsyncWaitStep`. We do not think this is gonna cause an issue. This was
tested manually by the following steps: - Run a master node with the old
version. - When ILM is executing `wait-for-snapshot`, we shutdown the
node - We start the node again with the new version os ES - ES was able
to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: elastic#57809
@gmarouli
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
7.17

Questions ?

Please refer to the Backport tool documentation

gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Oct 18, 2023
…pshot of that SLM policy (elastic#100911)

The `WaitForSnapshotStep` used to check if the SLM policy has been
executed after the index has entered the delete phase, but it did not
check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not
include this index, when the index would enter the
`WaitForSnapshotStep`, it would wait for a snapshot to be taken, a
snapshot that would not include the index, and then ILM would delete the
index.

See the exact reproduction path:
elastic#57809

**Solution** This PR, after it finds a successful SLM run, it verifies
if the snapshot taken by SLM contains this index. If not it throws an
error, otherwise it proceeds.

ILM explain will report:

```
"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }
```

**Backwards compatibility concerns** In this PR, the
`WaitForSnapshotStep` changed from `ClusterStateWaitStep` to
`AsyncWaitStep`. We do not think this is gonna cause an issue. This was
tested manually by the following steps: - Run a master node with the old
version. - When ILM is executing `wait-for-snapshot`, we shutdown the
node - We start the node again with the new version os ES - ES was able
to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: elastic#57809
(cherry picked from commit 5697fcf)

# Conflicts:
#	x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStepTests.java
elasticsearchmachine pushed a commit that referenced this pull request Oct 18, 2023
…pshot of that SLM policy (#100911) (#101027)

The `WaitForSnapshotStep` used to check if the SLM policy has been
executed after the index has entered the delete phase, but it did not
check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not
include this index, when the index would enter the
`WaitForSnapshotStep`, it would wait for a snapshot to be taken, a
snapshot that would not include the index, and then ILM would delete the
index.

See the exact reproduction path:
#57809

**Solution** This PR, after it finds a successful SLM run, it verifies
if the snapshot taken by SLM contains this index. If not it throws an
error, otherwise it proceeds.

ILM explain will report:

```
"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }
```

**Backwards compatibility concerns** In this PR, the
`WaitForSnapshotStep` changed from `ClusterStateWaitStep` to
`AsyncWaitStep`. We do not think this is gonna cause an issue. This was
tested manually by the following steps: - Run a master node with the old
version. - When ILM is executing `wait-for-snapshot`, we shutdown the
node - We start the node again with the new version os ES - ES was able
to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: #57809
gmarouli added a commit that referenced this pull request Oct 18, 2023
…est snapshot of that SLM policy (#100911) (#101030)

* `WaitForSnapshotStep` verifies if the index belongs to the latest snapshot of that SLM policy (#100911)

The `WaitForSnapshotStep` used to check if the SLM policy has been
executed after the index has entered the delete phase, but it did not
check if the SLM policy included this index.

The result of this is that if the user used an SLM policy that did not
include this index, when the index would enter the
`WaitForSnapshotStep`, it would wait for a snapshot to be taken, a
snapshot that would not include the index, and then ILM would delete the
index.

See the exact reproduction path:
#57809

**Solution** This PR, after it finds a successful SLM run, it verifies
if the snapshot taken by SLM contains this index. If not it throws an
error, otherwise it proceeds.

ILM explain will report:

```
"step_info": {
        "type": "illegal_state_exception",
        "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'"
      }
```

**Backwards compatibility concerns** In this PR, the
`WaitForSnapshotStep` changed from `ClusterStateWaitStep` to
`AsyncWaitStep`. We do not think this is gonna cause an issue. This was
tested manually by the following steps: - Run a master node with the old
version. - When ILM is executing `wait-for-snapshot`, we shutdown the
node - We start the node again with the new version os ES - ES was able
to pick up the step and continue with the new code.

We believe that this covers bwc concerns.

Fixes: #57809
(cherry picked from commit 5697fcf)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team v7.17.15 v8.11.1 v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Wait for snapshot" action in Delete phase of ILM doesn't check if index was backed up
5 participants