- 
                Notifications
    
You must be signed in to change notification settings  - Fork 25.6k
 
Avoid counting snapshot failures twice in SLM #136759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
| 
           Pinging @elastic/es-data-management (Team:Data Management)  | 
    
| 
           Hi @nielsbauman, I've created a changelog YAML for you.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Prevents double-counting snapshot failures toward invocationsSinceLastSuccess in SLM when multiple snapshot failure listeners process the same failures concurrently.
- Introduces snapshotIsRegistered flag to gate incrementing invocationsSinceLastSuccess
 - Adds initiatingSnapshot to test setup for failure cleanup scenario
 
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description | 
|---|---|
| x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java | Adds snapshotIsRegistered boolean and guards failure invocation increment to avoid double counting. | 
| x-pack/plugin/slm/src/test/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTaskTests.java | Updates test input list to include initiatingSnapshot for scenario coverage. | 
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
        
          
                x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java
          
            Show resolved
            Hide resolved
        
      | snapshotInfoFailure1.snapshotId(), | ||
| snapshotInfoFailure2.snapshotId() | ||
| snapshotInfoFailure2.snapshotId(), | ||
| initiatingSnapshot | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was failing with my other changes. I think this test was originally wrong; since this WriteJobStatus runs for initiatingSnapshot, it should also be present in the registered snapshots. Please correct me if my reasoning is wrong!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right, the test was passing before because of this exact bug that you are fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Niels for looking into this! The fix looks good, my only comment its that it's probably worth adding a test to cover for this scenario to prevent regression (and to absolutely prove our theory!). You can check out SLMSnapshotBlockingIntegTests and SLMStatDisruptionIT for some similar test examples.
| 
           Thanks for having a look, @samxbr. I think that's a great suggestion. However, I've been trying to create an integration test for some time now, but haven't managed to do so... I tried in both the test classes you suggested, and I came closest in  The timing is pretty tricky, as the   | 
    
          
 Have you tried using  I am approving this PR for now, feel free to merge if the test is taking too long (I can give it a try later when I have time too). I don't want to block this fix because it can potentially prevent some OBS alerts.  | 
    
| 
           Yeah, I tried using  elasticsearch/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java Lines 1914 to 1930 in c832243 
 cleanupOldMetadata is where we call FinalizeSnapshotContext#onDone.
So, I modified  diff --git a/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java b/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
index 7c9eb5652e9..6e3333a1910 100644
--- a/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
+++ b/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
@@ -260,13 +260,12 @@ public class SLMStatDisruptionIT extends AbstractSnapshotIntegTestCase {
                 fsc.clusterMetadata(),
                 fsc.snapshotInfo(),
                 fsc.repositoryMetaVersion(),
-                fsc,
-                () -> {
+                ActionListener.runBefore(fsc, () -> {
                     // run the passed lambda before calling the usual callback
                     // this is where the cluster can be restarted before SLM is called back with the snapshotInfo
                     beforeResponseRunnable.run();
-                    fsc.onDone();
-                }
+                }),
+                fsc::onDone
             );
             super.finalizeSnapshot(newFinalizeContext);
         }which does what I want; it actually blocks the SLM listener from being run. However, that results in another issue, because due to this line: elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java Line 998 in 74cf26b 
 we can only "finalize" one snapshot at a time per repository. That means that if we block the SLM listener until the second snapshot is also registered as failed, we get a deadlock, as the second snapshot will never be registered as failed if the first one isn't fully "finalized" yet. In short, what we need is the ability to block somewhere before we submit this cluster state task: elasticsearch/x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java Lines 300 to 303 in 74cf26b 
 but I wasn't able to find a way to do that. If you have any suggestions, I'd be more than happy to hear them! I'll go ahead and merge this PR tomorrow. We can always add a test retrospectively if we come up with one.  | 
    
          💚 Backport successful
  | 
    
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the
invocationsSinceLastSuccess), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis.We simply avoid incrementing
invocationsSinceLastSuccessif the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.