Avoid counting snapshot failures twice in SLM #136759

nielsbauman · 2025-10-17T16:20:19Z

We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the invocationsSinceLastSuccess), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis.

We simply avoid incrementing invocationsSinceLastSuccess if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.

We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.

elasticsearchmachine · 2025-10-17T16:20:44Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-10-17T16:20:45Z

Hi @nielsbauman, I've created a changelog YAML for you.

Copilot

Pull Request Overview

Prevents double-counting snapshot failures toward invocationsSinceLastSuccess in SLM when multiple snapshot failure listeners process the same failures concurrently.

Introduces snapshotIsRegistered flag to gate incrementing invocationsSinceLastSuccess
Adds initiatingSnapshot to test setup for failure cleanup scenario

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java	Adds snapshotIsRegistered boolean and guards failure invocation increment to avoid double counting.
x-pack/plugin/slm/src/test/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTaskTests.java	Updates test input list to include initiatingSnapshot for scenario coverage.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

nielsbauman · 2025-10-17T16:22:47Z

x-pack/plugin/slm/src/test/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTaskTests.java

                snapshotInfoFailure1.snapshotId(),
-                snapshotInfoFailure2.snapshotId()
+                snapshotInfoFailure2.snapshotId(),
+                initiatingSnapshot


This test was failing with my other changes. I think this test was originally wrong; since this WriteJobStatus runs for initiatingSnapshot, it should also be present in the registered snapshots. Please correct me if my reasoning is wrong!

I think you are right, the test was passing before because of this exact bug that you are fixing.

samxbr

Thanks Niels for looking into this! The fix looks good, my only comment its that it's probably worth adding a test to cover for this scenario to prevent regression (and to absolutely prove our theory!). You can check out SLMSnapshotBlockingIntegTests and SLMStatDisruptionIT for some similar test examples.

nielsbauman · 2025-10-20T12:19:18Z

Thanks for having a look, @samxbr. I think that's a great suggestion. However, I've been trying to create an integration test for some time now, but haven't managed to do so... I tried in both the test classes you suggested, and I came closest in SLMStatsDisruptionIT. I wrote a test that failed regularly on main, but not always - i.e. it was flaky.

The timing is pretty tricky, as the SnapshotsService needs to finish two (or more) snapshots before the SLM listeners are invoked. I tried lots of different things, but I don't know the Snapshot and/or SLM code well enough to figure this out. Perhaps you can give it a try if you want?

samxbr · 2025-10-20T15:57:46Z

Thanks for having a look, @samxbr. I think that's a great suggestion. However, I've been trying to create an integration test for some time now, but haven't managed to do so... I tried in both the test classes you suggested, and I came closest in SLMStatsDisruptionIT. I wrote a test that failed regularly on main, but not always - i.e. it was flaky.

The timing is pretty tricky, as the SnapshotsService needs to finish two (or more) snapshots before the SLM listeners are invoked. I tried lots of different things, but I don't know the Snapshot and/or SLM code well enough to figure this out. Perhaps you can give it a try if you want?

Have you tried using TestRestartBeforeListenersRepo? I think we can do something similar, add a block after the snapshot is completed but before the listener has been called. In the test we can wait for 2 snapshots to be blocked before the listener, unblock one snapshot listener, wait for it to complete and validate the stats, then unblock the second one and do the same.

I am approving this PR for now, feel free to merge if the test is taking too long (I can give it a try later when I have time too). I don't want to block this fix because it can potentially prevent some OBS alerts.

nielsbauman · 2025-10-20T21:10:36Z

Yeah, I tried using TestRestartBeforeListenersRepo. Although, I don't actually believe that does anything useful in its current form, as it blocks in FinalizeSnapshotContext.onDone, but that Runnable is only executed after the FinalizeSnapshotContext listener is invoked, which is the action/API/response listener. See this code snippet:

elasticsearch/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

Lines 1914 to 1930 in c832243

    
           // Report success, then clean up. 
        
           .<RepositoryData>andThen((l, rootBlobUpdateResult) -> { 
        
               l.onResponse(rootBlobUpdateResult.newRepositoryData()); 
        
               cleanupOldMetadata( 
        
                   rootBlobUpdateResult.oldRepositoryData(), 
        
                   rootBlobUpdateResult.newRepositoryData(), 
        
                   finalizeSnapshotContext, 
        
                   writeShardGens 
        
               ); 
        
           }) 
        
           // Finally subscribe the context as the listener, wrapping exceptions if needed 
        
           .addListener( 
        
               finalizeSnapshotContext.delegateResponse( 
        
                   (l, e) -> l.onFailure(new SnapshotException(metadata.name(), snapshotId, "failed to update snapshot in repository", e)) 
        
               ) 
        
           );

cleanupOldMetadata is where we call FinalizeSnapshotContext#onDone.

So, I modified TestRestartBeforeListenersRepo to block in the listener instead, like so:

diff --git a/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java b/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
index 7c9eb5652e9..6e3333a1910 100644
--- a/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
+++ b/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
@@ -260,13 +260,12 @@ public class SLMStatDisruptionIT extends AbstractSnapshotIntegTestCase {
                 fsc.clusterMetadata(),
                 fsc.snapshotInfo(),
                 fsc.repositoryMetaVersion(),
-                fsc,
-                () -> {
+                ActionListener.runBefore(fsc, () -> {
                     // run the passed lambda before calling the usual callback
                     // this is where the cluster can be restarted before SLM is called back with the snapshotInfo
                     beforeResponseRunnable.run();
-                    fsc.onDone();
-                }
+                }),
+                fsc::onDone
             );
             super.finalizeSnapshot(newFinalizeContext);
         }

which does what I want; it actually blocks the SLM listener from being run.

However, that results in another issue, because due to this line:

elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

Line 998 in 74cf26b

if (tryEnterRepoLoop(projectId, repoName)) {

we can only "finalize" one snapshot at a time per repository. That means that if we block the SLM listener until the second snapshot is also registered as failed, we get a deadlock, as the second snapshot will never be registered as failed if the first one isn't fully "finalized" yet.

In short, what we need is the ability to block somewhere before we submit this cluster state task:

elasticsearch/x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

Lines 300 to 303 in 74cf26b

    
           submitUnbatchedTask( 
        
               clusterService, 
        
               "slm-record-failure-" + policyMetadata.getPolicy().getId(), 
        
               WriteJobStatus.failure(

but I wasn't able to find a way to do that. If you have any suggestions, I'd be more than happy to hear them! I'll go ahead and merge this PR tomorrow. We can always add a test retrospectively if we come up with one.

elasticsearchmachine · 2025-10-21T06:56:40Z

💚 Backport successful

Status	Branch	Result
✅	9.2

We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.

nielsbauman requested review from Copilot and samxbr October 17, 2025 16:20

nielsbauman added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management auto-backport Automatically create backport pull requests when merged v9.2.1 labels Oct 17, 2025

elasticsearchmachine added v9.3.0 Team:Data Management Meta label for data/management team labels Oct 17, 2025

Update docs/changelog/136759.yaml

7570f27

Copilot AI reviewed Oct 17, 2025

View reviewed changes

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java Show resolved Hide resolved

nielsbauman commented Oct 17, 2025

View reviewed changes

samxbr reviewed Oct 17, 2025

View reviewed changes

samxbr approved these changes Oct 20, 2025

View reviewed changes

nielsbauman merged commit e319760 into elastic:main Oct 21, 2025
34 checks passed

nielsbauman deleted the slm-invocations-fix branch October 21, 2025 06:55

nielsbauman mentioned this pull request Oct 21, 2025

[9.2] Avoid counting snapshot failures twice in SLM (#136759) #136849

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid counting snapshot failures twice in SLM #136759

Avoid counting snapshot failures twice in SLM #136759

Uh oh!

nielsbauman commented Oct 17, 2025

Uh oh!

elasticsearchmachine commented Oct 17, 2025

Uh oh!

elasticsearchmachine commented Oct 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

nielsbauman Oct 17, 2025

Uh oh!

samxbr Oct 17, 2025

Uh oh!

samxbr left a comment

Uh oh!

nielsbauman commented Oct 20, 2025

Uh oh!

samxbr commented Oct 20, 2025 •

edited

Loading

Uh oh!

nielsbauman commented Oct 20, 2025

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Avoid counting snapshot failures twice in SLM #136759

Avoid counting snapshot failures twice in SLM #136759

Uh oh!

Conversation

nielsbauman commented Oct 17, 2025

Uh oh!

elasticsearchmachine commented Oct 17, 2025

Uh oh!

elasticsearchmachine commented Oct 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

nielsbauman Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

samxbr Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

samxbr left a comment

Choose a reason for hiding this comment

Uh oh!

nielsbauman commented Oct 20, 2025

Uh oh!

samxbr commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nielsbauman commented Oct 20, 2025

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 21, 2025

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samxbr commented Oct 20, 2025 •

edited

Loading