-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Wait for snapshot" action in Delete phase of ILM doesn't check if index was backed up #57809
Labels
:Data Management/ILM+SLM
Index and Snapshot lifecycle management
>enhancement
Team:Data Management
Meta label for data/management team
Comments
yuliacech
added
>enhancement
needs:triage
Requires assignment of a team area label
labels
Jun 8, 2020
cbuescher
added
the
:Data Management/ILM+SLM
Index and Snapshot lifecycle management
label
Jun 9, 2020
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
cbuescher
removed
Team:Data Management
Meta label for data/management team
needs:triage
Requires assignment of a team area label
labels
Jun 9, 2020
elasticsearchmachine
added
the
Team:Data Management
Meta label for data/management team
label
Sep 1, 2023
Pinging @elastic/es-data-management (Team:Data Management) |
After a small investigation we saw that there a couple of things that could be done to fix this:
Looking at the two ideas above, the first one is "quite" simple and it will solve this issue for the majority of our users. We will move forward with this one first and we can see the impact. |
ppf2
added a commit
that referenced
this issue
Sep 26, 2023
The current description of wait for snapshot can be misleading/ambiguous. Ref: #57809
ppf2
added a commit
that referenced
this issue
Sep 26, 2023
Clarify wait_for_snapshot action. It doesn't ensure that a snapshot of the index is available before deletion. Ref: #57809
This was referenced Oct 13, 2023
elasticsearchmachine
pushed a commit
that referenced
this issue
Oct 18, 2023
…pshot of that SLM policy (#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: #57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: #57809
gmarouli
added a commit
to gmarouli/elasticsearch
that referenced
this issue
Oct 18, 2023
…pshot of that SLM policy (elastic#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: elastic#57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: elastic#57809
gmarouli
added a commit
to gmarouli/elasticsearch
that referenced
this issue
Oct 18, 2023
…pshot of that SLM policy (elastic#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: elastic#57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: elastic#57809 (cherry picked from commit 5697fcf) # Conflicts: # x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/WaitForSnapshotStepTests.java
elasticsearchmachine
pushed a commit
that referenced
this issue
Oct 18, 2023
…pshot of that SLM policy (#100911) (#101027) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: #57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: #57809
gmarouli
added a commit
that referenced
this issue
Oct 18, 2023
…est snapshot of that SLM policy (#100911) (#101030) * `WaitForSnapshotStep` verifies if the index belongs to the latest snapshot of that SLM policy (#100911) The `WaitForSnapshotStep` used to check if the SLM policy has been executed after the index has entered the delete phase, but it did not check if the SLM policy included this index. The result of this is that if the user used an SLM policy that did not include this index, when the index would enter the `WaitForSnapshotStep`, it would wait for a snapshot to be taken, a snapshot that would not include the index, and then ILM would delete the index. See the exact reproduction path: #57809 **Solution** This PR, after it finds a successful SLM run, it verifies if the snapshot taken by SLM contains this index. If not it throws an error, otherwise it proceeds. ILM explain will report: ``` "step_info": { "type": "illegal_state_exception", "reason": "the last successful snapshot of policy 'hourly-snapshots' does not include index '.ds-my-other-stream-2023.10.16-000001'" } ``` **Backwards compatibility concerns** In this PR, the `WaitForSnapshotStep` changed from `ClusterStateWaitStep` to `AsyncWaitStep`. We do not think this is gonna cause an issue. This was tested manually by the following steps: - Run a master node with the old version. - When ILM is executing `wait-for-snapshot`, we shutdown the node - We start the node again with the new version os ES - ES was able to pick up the step and continue with the new code. We believe that this covers bwc concerns. Fixes: #57809 (cherry picked from commit 5697fcf)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
:Data Management/ILM+SLM
Index and Snapshot lifecycle management
>enhancement
Team:Data Management
Meta label for data/management team
Hello team,
while adding a field for "wait for snapshot policy" to Delete phase in Index Lifecycle Management UI, I noticed that this action does not in fact ensure that a snapshot of the index exists before deleting the index. This can lead to irreversible data loss of documents in the managed index.
How to recreate this behaviour:
Expand for console commands
(index to be backed up
my_snapshot_index
, snapshots created every minute and deleted after 10 min)(rollover after 3 docs, delete after
my_snapshot_policy
created a snapshot).The text was updated successfully, but these errors were encountered: