Skip to content

Conversation

@nielsbauman
Copy link
Contributor

We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the invocationsSinceLastSuccess), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis.

We simply avoid incrementing invocationsSinceLastSuccess if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.

We came across a scenario where 3 snapshot failures were counted as 5
"invocations since last success", resulting in a premature yellow SLM
health indicator. The three snapshot failures completed at virtually the
same time. Our theory is that the listener of the first snapshot failure
already processed the other two snapshot failures (incrementing the
`invocationsSinceLastSuccess`), but the listeners of those other two
snapshots then incremented that field too. There we two warning logs
indicating that the snapshots weren't found in the registered set,
confirming our hypothesis.

We simply avoid incrementing `invocationsSinceLastSuccess` if the
listener failed with an exception and the snapshot isn't registered
anymore; assuming that another listener has already incremented the
field.
@nielsbauman nielsbauman requested review from Copilot and samxbr October 17, 2025 16:20
@nielsbauman nielsbauman added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management auto-backport Automatically create backport pull requests when merged v9.2.1 labels Oct 17, 2025
@elasticsearchmachine elasticsearchmachine added v9.3.0 Team:Data Management Meta label for data/management team labels Oct 17, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @nielsbauman, I've created a changelog YAML for you.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Prevents double-counting snapshot failures toward invocationsSinceLastSuccess in SLM when multiple snapshot failure listeners process the same failures concurrently.

  • Introduces snapshotIsRegistered flag to gate incrementing invocationsSinceLastSuccess
  • Adds initiatingSnapshot to test setup for failure cleanup scenario

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java Adds snapshotIsRegistered boolean and guards failure invocation increment to avoid double counting.
x-pack/plugin/slm/src/test/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTaskTests.java Updates test input list to include initiatingSnapshot for scenario coverage.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

snapshotInfoFailure1.snapshotId(),
snapshotInfoFailure2.snapshotId()
snapshotInfoFailure2.snapshotId(),
initiatingSnapshot
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was failing with my other changes. I think this test was originally wrong; since this WriteJobStatus runs for initiatingSnapshot, it should also be present in the registered snapshots. Please correct me if my reasoning is wrong!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, the test was passing before because of this exact bug that you are fixing.

Copy link
Contributor

@samxbr samxbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Niels for looking into this! The fix looks good, my only comment its that it's probably worth adding a test to cover for this scenario to prevent regression (and to absolutely prove our theory!). You can check out SLMSnapshotBlockingIntegTests and SLMStatDisruptionIT for some similar test examples.

@nielsbauman
Copy link
Contributor Author

Thanks for having a look, @samxbr. I think that's a great suggestion. However, I've been trying to create an integration test for some time now, but haven't managed to do so... I tried in both the test classes you suggested, and I came closest in SLMStatsDisruptionIT. I wrote a test that failed regularly on main, but not always - i.e. it was flaky.

The timing is pretty tricky, as the SnapshotsService needs to finish two (or more) snapshots before the SLM listeners are invoked. I tried lots of different things, but I don't know the Snapshot and/or SLM code well enough to figure this out. Perhaps you can give it a try if you want?

@samxbr
Copy link
Contributor

samxbr commented Oct 20, 2025

Thanks for having a look, @samxbr. I think that's a great suggestion. However, I've been trying to create an integration test for some time now, but haven't managed to do so... I tried in both the test classes you suggested, and I came closest in SLMStatsDisruptionIT. I wrote a test that failed regularly on main, but not always - i.e. it was flaky.

The timing is pretty tricky, as the SnapshotsService needs to finish two (or more) snapshots before the SLM listeners are invoked. I tried lots of different things, but I don't know the Snapshot and/or SLM code well enough to figure this out. Perhaps you can give it a try if you want?

Have you tried using TestRestartBeforeListenersRepo? I think we can do something similar, add a block after the snapshot is completed but before the listener has been called. In the test we can wait for 2 snapshots to be blocked before the listener, unblock one snapshot listener, wait for it to complete and validate the stats, then unblock the second one and do the same.

I am approving this PR for now, feel free to merge if the test is taking too long (I can give it a try later when I have time too). I don't want to block this fix because it can potentially prevent some OBS alerts.

@nielsbauman
Copy link
Contributor Author

Yeah, I tried using TestRestartBeforeListenersRepo. Although, I don't actually believe that does anything useful in its current form, as it blocks in FinalizeSnapshotContext.onDone, but that Runnable is only executed after the FinalizeSnapshotContext listener is invoked, which is the action/API/response listener. See this code snippet:

// Report success, then clean up.
.<RepositoryData>andThen((l, rootBlobUpdateResult) -> {
l.onResponse(rootBlobUpdateResult.newRepositoryData());
cleanupOldMetadata(
rootBlobUpdateResult.oldRepositoryData(),
rootBlobUpdateResult.newRepositoryData(),
finalizeSnapshotContext,
writeShardGens
);
})
// Finally subscribe the context as the listener, wrapping exceptions if needed
.addListener(
finalizeSnapshotContext.delegateResponse(
(l, e) -> l.onFailure(new SnapshotException(metadata.name(), snapshotId, "failed to update snapshot in repository", e))
)
);

cleanupOldMetadata is where we call FinalizeSnapshotContext#onDone.

So, I modified TestRestartBeforeListenersRepo to block in the listener instead, like so:

diff --git a/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java b/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
index 7c9eb5652e9..6e3333a1910 100644
--- a/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
+++ b/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
@@ -260,13 +260,12 @@ public class SLMStatDisruptionIT extends AbstractSnapshotIntegTestCase {
                 fsc.clusterMetadata(),
                 fsc.snapshotInfo(),
                 fsc.repositoryMetaVersion(),
-                fsc,
-                () -> {
+                ActionListener.runBefore(fsc, () -> {
                     // run the passed lambda before calling the usual callback
                     // this is where the cluster can be restarted before SLM is called back with the snapshotInfo
                     beforeResponseRunnable.run();
-                    fsc.onDone();
-                }
+                }),
+                fsc::onDone
             );
             super.finalizeSnapshot(newFinalizeContext);
         }

which does what I want; it actually blocks the SLM listener from being run.

However, that results in another issue, because due to this line:


we can only "finalize" one snapshot at a time per repository. That means that if we block the SLM listener until the second snapshot is also registered as failed, we get a deadlock, as the second snapshot will never be registered as failed if the first one isn't fully "finalized" yet.

In short, what we need is the ability to block somewhere before we submit this cluster state task:

submitUnbatchedTask(
clusterService,
"slm-record-failure-" + policyMetadata.getPolicy().getId(),
WriteJobStatus.failure(

but I wasn't able to find a way to do that. If you have any suggestions, I'd be more than happy to hear them! I'll go ahead and merge this PR tomorrow. We can always add a test retrospectively if we come up with one.

@nielsbauman nielsbauman merged commit e319760 into elastic:main Oct 21, 2025
34 checks passed
@nielsbauman nielsbauman deleted the slm-invocations-fix branch October 21, 2025 06:55
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
9.2

nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Oct 21, 2025
We came across a scenario where 3 snapshot failures were counted as 5
"invocations since last success", resulting in a premature yellow SLM
health indicator. The three snapshot failures completed at virtually the
same time. Our theory is that the listener of the first snapshot failure
already processed the other two snapshot failures (incrementing the
`invocationsSinceLastSuccess`), but the listeners of those other two
snapshots then incremented that field too. There we two warning logs
indicating that the snapshots weren't found in the registered set,
confirming our hypothesis.

We simply avoid incrementing `invocationsSinceLastSuccess` if the
listener failed with an exception and the snapshot isn't registered
anymore; assuming that another listener has already incremented the
field.
elasticsearchmachine pushed a commit that referenced this pull request Oct 21, 2025
We came across a scenario where 3 snapshot failures were counted as 5
"invocations since last success", resulting in a premature yellow SLM
health indicator. The three snapshot failures completed at virtually the
same time. Our theory is that the listener of the first snapshot failure
already processed the other two snapshot failures (incrementing the
`invocationsSinceLastSuccess`), but the listeners of those other two
snapshots then incremented that field too. There we two warning logs
indicating that the snapshots weren't found in the registered set,
confirming our hypothesis.

We simply avoid incrementing `invocationsSinceLastSuccess` if the
listener failed with an exception and the snapshot isn't registered
anymore; assuming that another listener has already incremented the
field.
chrisparrinello pushed a commit to chrisparrinello/elasticsearch that referenced this pull request Oct 24, 2025
We came across a scenario where 3 snapshot failures were counted as 5
"invocations since last success", resulting in a premature yellow SLM
health indicator. The three snapshot failures completed at virtually the
same time. Our theory is that the listener of the first snapshot failure
already processed the other two snapshot failures (incrementing the
`invocationsSinceLastSuccess`), but the listeners of those other two
snapshots then incremented that field too. There we two warning logs
indicating that the snapshots weren't found in the registered set,
confirming our hypothesis.

We simply avoid incrementing `invocationsSinceLastSuccess` if the
listener failed with an exception and the snapshot isn't registered
anymore; assuming that another listener has already incremented the
field.
fzowl pushed a commit to voyage-ai/elasticsearch that referenced this pull request Nov 3, 2025
We came across a scenario where 3 snapshot failures were counted as 5
"invocations since last success", resulting in a premature yellow SLM
health indicator. The three snapshot failures completed at virtually the
same time. Our theory is that the listener of the first snapshot failure
already processed the other two snapshot failures (incrementing the
`invocationsSinceLastSuccess`), but the listeners of those other two
snapshots then incremented that field too. There we two warning logs
indicating that the snapshots weren't found in the registered set,
confirming our hypothesis.

We simply avoid incrementing `invocationsSinceLastSuccess` if the
listener failed with an exception and the snapshot isn't registered
anymore; assuming that another listener has already incremented the
field.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team v9.2.1 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants