Skip to content

Conversation

@liuguoqingfz
Copy link
Contributor

@liuguoqingfz liuguoqingfz commented Dec 15, 2025

Description

Fix a flaky test where there's a timing/race where the snapshot sometimes starts while one primary shard is still initializing/relocating, so createFullSnapshot() observes totalShards=11 but only successfulShards=10.

Related Issues

Resolves #20058

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Tests
    • Improved snapshot cloning test stability through enhanced cluster health monitoring and lock file synchronization polling mechanisms.

✏️ Tip: You can customize this high-level summary in your review settings.

@liuguoqingfz liuguoqingfz requested a review from a team as a code owner December 15, 2025 15:27
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 15, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 15, 2025

Walkthrough

This pull request stabilizes the CloneSnapshotIT test by wrapping timing-sensitive operations with assertBusy polling. Health checks are added around index creation, snapshotting, and cloning; lock file count assertions are converted from direct checks to polled assertions with 60-second timeouts to accommodate asynchronous operations.

Changes

Cohort / File(s) Summary
Test Flakiness Fix
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java
Wrapped cluster health polling and lock file assertions with assertBusy to handle timing variations; added health readiness checks (Yellow status, no initializing/relocating shards) around index creation, snapshotting, and cloning operations with 60s timeout

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Verify that the 60-second assertBusy timeout is appropriate for the remote store operations and cluster health stabilization
  • Confirm that assertBusy is applied to all critical timing-sensitive assertion points (lock file counts and health checks)
  • Ensure the polling logic doesn't unintentionally mask legitimate test failures

Suggested labels

bug

Suggested reviewers

  • dbwiddis
  • cwperks
  • gbbafna
  • sachinpkale
  • msfroh
  • andrross

Poem

🐰 A flaky test once danced with timing's whim,
Now wrapped in patient polls to right its limb,
With sixty seconds grace to let tasks bloom,
No more will random races seal its doom! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change—fixing a flaky test for issue 20058—which aligns with the code modifications that add assertBusy-wrapped health checks and assertions.
Description check ✅ Passed The description explains the timing/race condition causing the flakiness and links to issue #20058. However, it lacks specific technical details about the fix strategy (assertBusy wrapping, health polling).
Linked Issues check ✅ Passed The PR directly addresses issue #20058 by fixing the flaky CloneSnapshotIT tests through assertBusy-wrapped health checks and lock file count assertions, resolving the documented race condition.
Out of Scope Changes check ✅ Passed All changes are narrowly focused on stabilizing the CloneSnapshotIT test by adding assertBusy polling mechanisms; no unrelated modifications are present.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (2)

205-218: Remove unnecessary fully qualified class names.

ClusterState and SnapshotDeletionsInProgress are already imported at the top of this file (lines 44-45), and notNullValue should be added as a static import for consistency with the existing matcher imports. Using FQNs here is inconsistent with the rest of the codebase.

-        // Stronger ordering: wait until cluster state shows the repository is in-use due to deletion
-        assertBusy(() -> {
-            final org.opensearch.cluster.ClusterState state = client().admin().cluster().prepareState().get().getState();
-            final org.opensearch.cluster.SnapshotDeletionsInProgress deletions = state.custom(
-                org.opensearch.cluster.SnapshotDeletionsInProgress.TYPE
-            );
-
-            assertThat("SnapshotDeletionsInProgress must be present once delete starts", deletions, org.hamcrest.Matchers.notNullValue());
-            assertThat(
-                deletions.getEntries().stream().map(org.opensearch.cluster.SnapshotDeletionsInProgress.Entry::repository).toList(),
-                hasItem(equalTo(repoName))
-            );
-        });
+        // Stronger ordering: wait until cluster state shows the repository is in-use due to deletion
+        assertBusy(() -> {
+            final ClusterState state = client().admin().cluster().prepareState().get().getState();
+            final SnapshotDeletionsInProgress deletions = state.custom(SnapshotDeletionsInProgress.TYPE);
+
+            assertThat("SnapshotDeletionsInProgress must be present once delete starts", deletions, notNullValue());
+            assertThat(
+                deletions.getEntries().stream().map(SnapshotDeletionsInProgress.Entry::repository).toList(),
+                hasItem(equalTo(repoName))
+            );
+        });

Add notNullValue to the static imports:

import static org.hamcrest.Matchers.notNullValue;

224-228: Clean up fully qualified names and add import for ExceptionsHelper.

The approach of catching Exception and unwrapping via ExceptionsHelper.unwrapCause() is correct for handling transport-layer wrapping, but the FQNs should be replaced with imports for consistency.

         // Transport can wrap the real exception; assert on the unwrapped root cause
-        final Exception ex = assertThrows(Exception.class, () -> updateRepository(repoName, "mock", newSettings));
-        final Throwable cause = org.opensearch.ExceptionsHelper.unwrapCause(ex);
-        assertThat(cause, org.hamcrest.Matchers.instanceOf(IllegalStateException.class));
+        final Exception ex = assertThrows(Exception.class, () -> updateRepository(repoName, "mock", newSettings));
+        final Throwable cause = ExceptionsHelper.unwrapCause(ex);
+        assertThat(cause, instanceOf(IllegalStateException.class));
         assertEquals("trying to modify or unregister repository that is currently used", cause.getMessage());

Add ExceptionsHelper to the imports:

import org.opensearch.ExceptionsHelper;

Note: instanceOf is already statically imported at line 93.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d7f787 and 2ea5728.

📒 Files selected for processing (2)
  • server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1 hunks)
  • server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (2)
server/src/main/java/org/opensearch/cluster/SnapshotDeletionsInProgress.java (1)
  • SnapshotDeletionsInProgress (59-391)
libs/core/src/main/java/org/opensearch/ExceptionsHelper.java (1)
  • ExceptionsHelper (76-476)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: gradle-check
  • GitHub Check: detect-breaking-change
  • GitHub Check: Analyze (java)
🔇 Additional comments (6)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (6)

180-189: Excellent fix for the race condition!

This assertBusy wrapper correctly addresses the flaky test issue described in the PR. By waiting for no initializing or relocating shards before taking the snapshot, you prevent the race condition where createFullSnapshot() would observe fewer successful shards than total shards.


193-193: Correct use of assertBusy for async file operations.

Wrapping the lock file count assertion in assertBusy properly handles the asynchronous nature of file creation/cleanup operations in the remote store.


197-206: Consistent stabilization pattern applied.

This health check ensures the cluster is stable after indexing additional documents and before creating the shallow snapshot, preventing the same race condition at this stage of the test.


210-210: LGTM!

Consistent application of assertBusy for lock file assertions.


214-223: Good defensive programming for conditional path.

Adding cluster stabilization after the optional index deletion ensures the subsequent clone operation doesn't encounter race conditions regardless of which test path is taken.


230-230: Completes the stabilization pattern correctly.

The final assertBusy wrapper for the lock file count after cloning completes the comprehensive fix. The logging statement on line 229 is a helpful addition for debugging if flakiness reoccurs.

…on: cluster at least YELLOW + no initializing shards + no relocating shards before each snapshot

Signed-off-by: Joe Liu <guoqing4@illinois.edu>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1)

157-231: The other flaky tests also lack pre-snapshot health checks.

testShallowCloneNameAvailability (line 258, 261) and testCloneAfterRepoShallowSettingDisabled (line 324) both call createFullSnapshot immediately after indexing, without the assertBusy cluster health validation that was added to testCloneShallowSnapshotIndex. They should be updated with similar health checks to prevent race conditions during snapshot and clone operations.

🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1)

184-184: Consider adding an import for ClusterHealthStatus.

The fully qualified class name org.opensearch.cluster.health.ClusterHealthStatus.YELLOW is used in three places. Adding a static import would improve readability:

+import static org.opensearch.cluster.health.ClusterHealthStatus.YELLOW;

Then use:

-                .setWaitForStatus(org.opensearch.cluster.health.ClusterHealthStatus.YELLOW)
+                .setWaitForStatus(YELLOW)

Also applies to: 201-201, 218-218

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ea5728 and 1dc6303.

📒 Files selected for processing (1)
  • server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, macos-15-intel)
🔇 Additional comments (2)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (2)

180-189: Excellent fix for the race condition.

The health checks with assertBusy directly address the timing issue where snapshots could start while shards are initializing or relocating. Waiting for YELLOW status with no initializing/relocating shards before snapshot operations ensures cluster stability.

Also applies to: 197-206, 214-223


193-193: Good fix for asynchronous lock file operations.

Converting the lock file assertions to use assertBusy properly handles the asynchronous nature of remote store lock file creation/cleanup.

Minor observation: The logger statement at line 229 executes once before the assertBusy loop, so it may log a count that differs from the eventual assertion result during test flakiness investigations.

Also applies to: 210-210, 230-230

@github-actions
Copy link
Contributor

✅ Gradle check result for 1dc6303: SUCCESS

@codecov
Copy link

codecov bot commented Dec 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.16%. Comparing base (1022486) to head (1dc6303).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20240      +/-   ##
============================================
- Coverage     73.20%   73.16%   -0.05%     
+ Complexity    71766    71760       -6     
============================================
  Files          5795     5795              
  Lines        328302   328303       +1     
  Branches      47283    47283              
============================================
- Hits         240345   240212     -133     
- Misses        68628    68834     +206     
+ Partials      19329    19257      -72     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

createIndex(remoteStoreEnabledIndexName, remoteStoreEnabledIndexSettings);
indexRandomDocs(remoteStoreEnabledIndexName, randomIntBetween(5, 10));

assertBusy(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think assertBusy() will do anything here since you're not asserting anything. I believe it will only retry AssertionErrors, and your code block would throw a different exception upon timeout or failure. I think you don't actually need assertBusy() here since the client call will block until that status is reached. I'd also consider looking at whether one of the ensureYellow() helper methods in the base class can be used here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for CloneSnapshotIT

2 participants