-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Fix a flaky test for issue 20058 #20240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request stabilizes the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (2)
205-218: Remove unnecessary fully qualified class names.
ClusterStateandSnapshotDeletionsInProgressare already imported at the top of this file (lines 44-45), andnotNullValueshould be added as a static import for consistency with the existing matcher imports. Using FQNs here is inconsistent with the rest of the codebase.- // Stronger ordering: wait until cluster state shows the repository is in-use due to deletion - assertBusy(() -> { - final org.opensearch.cluster.ClusterState state = client().admin().cluster().prepareState().get().getState(); - final org.opensearch.cluster.SnapshotDeletionsInProgress deletions = state.custom( - org.opensearch.cluster.SnapshotDeletionsInProgress.TYPE - ); - - assertThat("SnapshotDeletionsInProgress must be present once delete starts", deletions, org.hamcrest.Matchers.notNullValue()); - assertThat( - deletions.getEntries().stream().map(org.opensearch.cluster.SnapshotDeletionsInProgress.Entry::repository).toList(), - hasItem(equalTo(repoName)) - ); - }); + // Stronger ordering: wait until cluster state shows the repository is in-use due to deletion + assertBusy(() -> { + final ClusterState state = client().admin().cluster().prepareState().get().getState(); + final SnapshotDeletionsInProgress deletions = state.custom(SnapshotDeletionsInProgress.TYPE); + + assertThat("SnapshotDeletionsInProgress must be present once delete starts", deletions, notNullValue()); + assertThat( + deletions.getEntries().stream().map(SnapshotDeletionsInProgress.Entry::repository).toList(), + hasItem(equalTo(repoName)) + ); + });Add
notNullValueto the static imports:import static org.hamcrest.Matchers.notNullValue;
224-228: Clean up fully qualified names and add import for ExceptionsHelper.The approach of catching
Exceptionand unwrapping viaExceptionsHelper.unwrapCause()is correct for handling transport-layer wrapping, but the FQNs should be replaced with imports for consistency.// Transport can wrap the real exception; assert on the unwrapped root cause - final Exception ex = assertThrows(Exception.class, () -> updateRepository(repoName, "mock", newSettings)); - final Throwable cause = org.opensearch.ExceptionsHelper.unwrapCause(ex); - assertThat(cause, org.hamcrest.Matchers.instanceOf(IllegalStateException.class)); + final Exception ex = assertThrows(Exception.class, () -> updateRepository(repoName, "mock", newSettings)); + final Throwable cause = ExceptionsHelper.unwrapCause(ex); + assertThat(cause, instanceOf(IllegalStateException.class)); assertEquals("trying to modify or unregister repository that is currently used", cause.getMessage());Add
ExceptionsHelperto the imports:import org.opensearch.ExceptionsHelper;Note:
instanceOfis already statically imported at line 93.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java(1 hunks)server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (2)
server/src/main/java/org/opensearch/cluster/SnapshotDeletionsInProgress.java (1)
SnapshotDeletionsInProgress(59-391)libs/core/src/main/java/org/opensearch/ExceptionsHelper.java (1)
ExceptionsHelper(76-476)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: gradle-check
- GitHub Check: detect-breaking-change
- GitHub Check: Analyze (java)
🔇 Additional comments (6)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (6)
180-189: Excellent fix for the race condition!This assertBusy wrapper correctly addresses the flaky test issue described in the PR. By waiting for no initializing or relocating shards before taking the snapshot, you prevent the race condition where
createFullSnapshot()would observe fewer successful shards than total shards.
193-193: Correct use of assertBusy for async file operations.Wrapping the lock file count assertion in assertBusy properly handles the asynchronous nature of file creation/cleanup operations in the remote store.
197-206: Consistent stabilization pattern applied.This health check ensures the cluster is stable after indexing additional documents and before creating the shallow snapshot, preventing the same race condition at this stage of the test.
210-210: LGTM!Consistent application of assertBusy for lock file assertions.
214-223: Good defensive programming for conditional path.Adding cluster stabilization after the optional index deletion ensures the subsequent clone operation doesn't encounter race conditions regardless of which test path is taken.
230-230: Completes the stabilization pattern correctly.The final assertBusy wrapper for the lock file count after cloning completes the comprehensive fix. The logging statement on line 229 is a helpful addition for debugging if flakiness reoccurs.
…on: cluster at least YELLOW + no initializing shards + no relocating shards before each snapshot Signed-off-by: Joe Liu <guoqing4@illinois.edu>
2ea5728 to
1dc6303
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1)
157-231: The other flaky tests also lack pre-snapshot health checks.
testShallowCloneNameAvailability(line 258, 261) andtestCloneAfterRepoShallowSettingDisabled(line 324) both callcreateFullSnapshotimmediately after indexing, without theassertBusycluster health validation that was added totestCloneShallowSnapshotIndex. They should be updated with similar health checks to prevent race conditions during snapshot and clone operations.
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1)
184-184: Consider adding an import for ClusterHealthStatus.The fully qualified class name
org.opensearch.cluster.health.ClusterHealthStatus.YELLOWis used in three places. Adding a static import would improve readability:+import static org.opensearch.cluster.health.ClusterHealthStatus.YELLOW;Then use:
- .setWaitForStatus(org.opensearch.cluster.health.ClusterHealthStatus.YELLOW) + .setWaitForStatus(YELLOW)Also applies to: 201-201, 218-218
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: Analyze (java)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, macos-15-intel)
🔇 Additional comments (2)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (2)
180-189: Excellent fix for the race condition.The health checks with
assertBusydirectly address the timing issue where snapshots could start while shards are initializing or relocating. Waiting forYELLOWstatus with no initializing/relocating shards before snapshot operations ensures cluster stability.Also applies to: 197-206, 214-223
193-193: Good fix for asynchronous lock file operations.Converting the lock file assertions to use
assertBusyproperly handles the asynchronous nature of remote store lock file creation/cleanup.Minor observation: The logger statement at line 229 executes once before the
assertBusyloop, so it may log a count that differs from the eventual assertion result during test flakiness investigations.Also applies to: 210-210, 230-230
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #20240 +/- ##
============================================
- Coverage 73.20% 73.16% -0.05%
+ Complexity 71766 71760 -6
============================================
Files 5795 5795
Lines 328302 328303 +1
Branches 47283 47283
============================================
- Hits 240345 240212 -133
- Misses 68628 68834 +206
+ Partials 19329 19257 -72 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| createIndex(remoteStoreEnabledIndexName, remoteStoreEnabledIndexSettings); | ||
| indexRandomDocs(remoteStoreEnabledIndexName, randomIntBetween(5, 10)); | ||
|
|
||
| assertBusy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think assertBusy() will do anything here since you're not asserting anything. I believe it will only retry AssertionErrors, and your code block would throw a different exception upon timeout or failure. I think you don't actually need assertBusy() here since the client call will block until that status is reached. I'd also consider looking at whether one of the ensureYellow() helper methods in the base class can be used here.
Description
Fix a flaky test where there's a timing/race where the snapshot sometimes starts while one primary shard is still initializing/relocating, so createFullSnapshot() observes totalShards=11 but only successfulShards=10.
Related Issues
Resolves #20058
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.