Fix a flaky test for issue 20058 #20240

liuguoqingfz · 2025-12-15T15:27:11Z

Description

Fix a flaky test where there's a timing/race where the snapshot sometimes starts while one primary shard is still initializing/relocating, so createFullSnapshot() observes totalShards=11 but only successfulShards=10.

Related Issues

Resolves #20058

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

Tests
- Improved snapshot cloning test stability through enhanced cluster health monitoring and lock file synchronization polling mechanisms.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-15T15:27:35Z

Walkthrough

This pull request stabilizes the CloneSnapshotIT test by wrapping timing-sensitive operations with assertBusy polling. Health checks are added around index creation, snapshotting, and cloning; lock file count assertions are converted from direct checks to polled assertions with 60-second timeouts to accommodate asynchronous operations.

Changes

Cohort / File(s)	Summary
Test Flakiness Fix `server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java`	Wrapped cluster health polling and lock file assertions with `assertBusy` to handle timing variations; added health readiness checks (Yellow status, no initializing/relocating shards) around index creation, snapshotting, and cloning operations with 60s timeout

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Verify that the 60-second assertBusy timeout is appropriate for the remote store operations and cluster health stabilization
Confirm that assertBusy is applied to all critical timing-sensitive assertion points (lock file counts and health checks)
Ensure the polling logic doesn't unintentionally mask legitimate test failures

Suggested labels

bug

Suggested reviewers

dbwiddis
cwperks
gbbafna
sachinpkale
msfroh
andrross

Poem

🐰 A flaky test once danced with timing's whim,
Now wrapped in patient polls to right its limb,
With sixty seconds grace to let tasks bloom,
No more will random races seal its doom! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change—fixing a flaky test for issue 20058—which aligns with the code modifications that add assertBusy-wrapped health checks and assertions.
Description check	✅ Passed	The description explains the timing/race condition causing the flakiness and links to issue #20058. However, it lacks specific technical details about the fix strategy (assertBusy wrapping, health polling).
Linked Issues check	✅ Passed	The PR directly addresses issue #20058 by fixing the flaky CloneSnapshotIT tests through assertBusy-wrapped health checks and lock file count assertions, resolving the documented race condition.
Out of Scope Changes check	✅ Passed	All changes are narrowly focused on stabilizing the CloneSnapshotIT test by adding assertBusy polling mechanisms; no unrelated modifications are present.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (2)

205-218: Remove unnecessary fully qualified class names.

ClusterState and SnapshotDeletionsInProgress are already imported at the top of this file (lines 44-45), and notNullValue should be added as a static import for consistency with the existing matcher imports. Using FQNs here is inconsistent with the rest of the codebase.

-        // Stronger ordering: wait until cluster state shows the repository is in-use due to deletion
-        assertBusy(() -> {
-            final org.opensearch.cluster.ClusterState state = client().admin().cluster().prepareState().get().getState();
-            final org.opensearch.cluster.SnapshotDeletionsInProgress deletions = state.custom(
-                org.opensearch.cluster.SnapshotDeletionsInProgress.TYPE
-            );
-
-            assertThat("SnapshotDeletionsInProgress must be present once delete starts", deletions, org.hamcrest.Matchers.notNullValue());
-            assertThat(
-                deletions.getEntries().stream().map(org.opensearch.cluster.SnapshotDeletionsInProgress.Entry::repository).toList(),
-                hasItem(equalTo(repoName))
-            );
-        });
+        // Stronger ordering: wait until cluster state shows the repository is in-use due to deletion
+        assertBusy(() -> {
+            final ClusterState state = client().admin().cluster().prepareState().get().getState();
+            final SnapshotDeletionsInProgress deletions = state.custom(SnapshotDeletionsInProgress.TYPE);
+
+            assertThat("SnapshotDeletionsInProgress must be present once delete starts", deletions, notNullValue());
+            assertThat(
+                deletions.getEntries().stream().map(SnapshotDeletionsInProgress.Entry::repository).toList(),
+                hasItem(equalTo(repoName))
+            );
+        });

Add notNullValue to the static imports:

import static org.hamcrest.Matchers.notNullValue;

224-228: Clean up fully qualified names and add import for ExceptionsHelper.

The approach of catching Exception and unwrapping via ExceptionsHelper.unwrapCause() is correct for handling transport-layer wrapping, but the FQNs should be replaced with imports for consistency.

         // Transport can wrap the real exception; assert on the unwrapped root cause
-        final Exception ex = assertThrows(Exception.class, () -> updateRepository(repoName, "mock", newSettings));
-        final Throwable cause = org.opensearch.ExceptionsHelper.unwrapCause(ex);
-        assertThat(cause, org.hamcrest.Matchers.instanceOf(IllegalStateException.class));
+        final Exception ex = assertThrows(Exception.class, () -> updateRepository(repoName, "mock", newSettings));
+        final Throwable cause = ExceptionsHelper.unwrapCause(ex);
+        assertThat(cause, instanceOf(IllegalStateException.class));
         assertEquals("trying to modify or unregister repository that is currently used", cause.getMessage());

Add ExceptionsHelper to the imports:

import org.opensearch.ExceptionsHelper;

Note: instanceOf is already statically imported at line 93.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d7f787 and 2ea5728.

📒 Files selected for processing (2)

server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1 hunks)
server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (2)

server/src/main/java/org/opensearch/cluster/SnapshotDeletionsInProgress.java (1)

SnapshotDeletionsInProgress (59-391)

libs/core/src/main/java/org/opensearch/ExceptionsHelper.java (1)

ExceptionsHelper (76-476)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: gradle-check
GitHub Check: detect-breaking-change
GitHub Check: Analyze (java)

🔇 Additional comments (6)

server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (6)

180-189: Excellent fix for the race condition!

This assertBusy wrapper correctly addresses the flaky test issue described in the PR. By waiting for no initializing or relocating shards before taking the snapshot, you prevent the race condition where createFullSnapshot() would observe fewer successful shards than total shards.

193-193: Correct use of assertBusy for async file operations.

Wrapping the lock file count assertion in assertBusy properly handles the asynchronous nature of file creation/cleanup operations in the remote store.

197-206: Consistent stabilization pattern applied.

This health check ensures the cluster is stable after indexing additional documents and before creating the shallow snapshot, preventing the same race condition at this stage of the test.

210-210: LGTM!

Consistent application of assertBusy for lock file assertions.

214-223: Good defensive programming for conditional path.

Adding cluster stabilization after the optional index deletion ensures the subsequent clone operation doesn't encounter race conditions regardless of which test path is taken.

230-230: Completes the stabilization pattern correctly.

The final assertBusy wrapper for the lock file count after cloning completes the comprehensive fix. The logging statement on line 229 is a helpful addition for debugging if flakiness reoccurs.

…on: cluster at least YELLOW + no initializing shards + no relocating shards before each snapshot Signed-off-by: Joe Liu <guoqing4@illinois.edu>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1)

157-231: The other flaky tests also lack pre-snapshot health checks.

testShallowCloneNameAvailability (line 258, 261) and testCloneAfterRepoShallowSettingDisabled (line 324) both call createFullSnapshot immediately after indexing, without the assertBusy cluster health validation that was added to testCloneShallowSnapshotIndex. They should be updated with similar health checks to prevent race conditions during snapshot and clone operations.

🧹 Nitpick comments (1)

server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1)
184-184: Consider adding an import for ClusterHealthStatus.

The fully qualified class name org.opensearch.cluster.health.ClusterHealthStatus.YELLOW is used in three places. Adding a static import would improve readability:
+import static org.opensearch.cluster.health.ClusterHealthStatus.YELLOW;
Then use:
-                .setWaitForStatus(org.opensearch.cluster.health.ClusterHealthStatus.YELLOW)
+                .setWaitForStatus(YELLOW)
Also applies to: 201-201, 218-218

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ea5728 and 1dc6303.

📒 Files selected for processing (1)

server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: gradle-check
GitHub Check: Analyze (java)
GitHub Check: assemble (21, ubuntu-24.04-arm)
GitHub Check: assemble (25, ubuntu-latest)
GitHub Check: assemble (25, windows-latest)
GitHub Check: assemble (25, ubuntu-24.04-arm)
GitHub Check: assemble (21, windows-latest)
GitHub Check: assemble (21, ubuntu-latest)
GitHub Check: detect-breaking-change
GitHub Check: precommit (25, ubuntu-latest)
GitHub Check: precommit (21, ubuntu-latest)
GitHub Check: precommit (25, windows-latest)
GitHub Check: precommit (25, ubuntu-24.04-arm)
GitHub Check: precommit (21, macos-15)
GitHub Check: precommit (21, ubuntu-24.04-arm)
GitHub Check: precommit (21, macos-15-intel)
GitHub Check: precommit (21, windows-latest)
GitHub Check: precommit (21, windows-2025, true)
GitHub Check: precommit (25, macos-15)
GitHub Check: precommit (25, macos-15-intel)

🔇 Additional comments (2)

server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (2)

180-189: Excellent fix for the race condition.

The health checks with assertBusy directly address the timing issue where snapshots could start while shards are initializing or relocating. Waiting for YELLOW status with no initializing/relocating shards before snapshot operations ensures cluster stability.

Also applies to: 197-206, 214-223

193-193: Good fix for asynchronous lock file operations.

Converting the lock file assertions to use assertBusy properly handles the asynchronous nature of remote store lock file creation/cleanup.

Minor observation: The logger statement at line 229 executes once before the assertBusy loop, so it may log a count that differs from the eventual assertion result during test flakiness investigations.

Also applies to: 210-210, 230-230

github-actions · 2025-12-15T16:59:42Z

✅ Gradle check result for 1dc6303: SUCCESS

codecov · 2025-12-15T17:00:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.16%. Comparing base (1022486) to head (1dc6303).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #20240      +/-   ##
============================================
- Coverage     73.20%   73.16%   -0.05%     
+ Complexity    71766    71760       -6     
============================================
  Files          5795     5795              
  Lines        328302   328303       +1     
  Branches      47283    47283              
============================================
- Hits         240345   240212     -133     
- Misses        68628    68834     +206     
+ Partials      19329    19257      -72

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andrross · 2025-12-16T00:58:23Z

server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java

        createIndex(remoteStoreEnabledIndexName, remoteStoreEnabledIndexSettings);
        indexRandomDocs(remoteStoreEnabledIndexName, randomIntBetween(5, 10));

+        assertBusy(


I don't think assertBusy() will do anything here since you're not asserting anything. I believe it will only retry AssertionErrors, and your code block would throw a different exception upon timeout or failure. I think you don't actually need assertBusy() here since the client call will block until that status is reached. I'd also consider looking at whether one of the ensureYellow() helper methods in the base class can be used here.

liuguoqingfz requested a review from a team as a code owner December 15, 2025 15:27

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 15, 2025

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

make testCloneShallowSnapshotIndex() establish the missing preconditi…

1dc6303

…on: cluster at least YELLOW + no initializing shards + no relocating shards before each snapshot Signed-off-by: Joe Liu <guoqing4@illinois.edu>

liuguoqingfz force-pushed the flakytest-20058 branch from 2ea5728 to 1dc6303 Compare December 15, 2025 15:45

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

andrross reviewed Dec 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix a flaky test for issue 20058 #20240

Fix a flaky test for issue 20058 #20240

liuguoqingfz commented Dec 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

codecov bot commented Dec 15, 2025

Uh oh!

andrross Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix a flaky test for issue 20058 #20240

Are you sure you want to change the base?

Fix a flaky test for issue 20058 #20240

Conversation

liuguoqingfz commented Dec 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

codecov bot commented Dec 15, 2025

Codecov Report

Uh oh!

andrross Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liuguoqingfz commented Dec 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 15, 2025 •

edited

Loading