-
Notifications
You must be signed in to change notification settings - Fork 14.9k
MINOR: Fix flaky test in ReplicaManagerTest #21244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MINOR: Fix flaky test in ReplicaManagerTest #21244
Conversation
Refer to apache#20082 (comment). Refactored the test to fix a race condition caused by dynamic Mockito stubbing during test execution. The previous implementation used `doReturn(false)` and `reset()` on a spy object while a background thread was running, causing a `ClassCastException`. This patch replaces that logic with a thread-safe `AtomicBoolean` and `doAnswer` approach to toggle the mock's behavior safely.
| try { | ||
| val spiedPartition = spy(Partition(tpId, time, replicaManager)) | ||
|
|
||
| val blockPromotion = new AtomicBoolean(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the following changes be sufficient?
val newReplicaFolder = replicaManager.logManager.liveLogDirs.filterNot(_ == firstLogDir).head
// Prevent promotion of future replica
doReturn(false).when(spiedPartition).maybeReplaceCurrentWithFutureReplica()
replicaManager.alterReplicaLogDirs(Map(tp -> newReplicaFolder.getAbsolutePath))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the spied call might still not be visible to the ReplicaAlterLogDirsThread. IIUC, the visibility across threads can only be enforced through some synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gaurav-narula for the explanation on visibility!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change @Parkerhiphop. You might want to initialise this as new AtomicBoolean(true) and avoid invoking blockPromotion.set(true) explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaurav-narula Thanks! That's a great refinement! It makes the code much cleaner.
I have updated the code to new AtomicBoolean(true) and removed the explicit set(true) call.
The logic remains the same (blocking the first attempt via CAS), and I verified it with the loop test again (same result as before)
gaurav-narula
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Left some comments
|
|
||
| // Prevent promotion of future replica | ||
| doReturn(false).when(spiedPartition).maybeReplaceCurrentWithFutureReplica() | ||
| blockPromotion.set(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is racey. What if the test thread is interrupted and ReplicaAlterDirsThread invokes spiedPartition.maybeReplaceCurrentWithFutureReplica() before this line executes?
Perhaps consider a CAS within doAnswer instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to use CAS if (blockPromotion.compareAndSet(true, false))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the CAS suggestion.
I have addressed the race condition and simplified the logic as suggested:
- Execution Order: Moved
blockPromotion.set(true)beforealterReplicaLogDirsto ensure the flag is set before the background thread starts. - CAS: Adopted
compareAndSet(true, false)withindoAnswerto automatically unblock the promotion after the first attempt.
To verify the stability, I ran the following test command with no failures (also updated in the PR description):
Test Command
N=100; I=0; while [ $I -lt $N ] && ./gradlew cleanTest core:test --tests ReplicaManagerTest -PmaxParallelForks=4 \
; do (( I=$I+1 )); echo "Completed run: $I"; sleep 1; done
Test Result
BUILD SUCCESSFUL in 12s
151 actionable tasks: 2 executed, 149 up-to-date
Consider enabling configuration cache to speed up this build: https://docs.gradle.org/9.2.1/userguide/configuration_cache_enabling.html
Completed run: 100
| try { | ||
| val spiedPartition = spy(Partition(tpId, time, replicaManager)) | ||
|
|
||
| val blockPromotion = new AtomicBoolean(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the spied call might still not be visible to the ReplicaAlterLogDirsThread. IIUC, the visibility across threads can only be enforced through some synchronization.
8351ce6 to
99dc779
Compare
|
Apologies for the force push. |
|
the fatal error is caused by |
Refer to
#20082 (comment).
Refactored the test to fix a race condition caused by dynamic Mockito
stubbing during test execution.
The previous implementation used
doReturn(false)andreset()on aspy object while a background thread was running, causing a
ClassCastException.This patch replaces that logic with a thread-safe
AtomicBooleananddoAnswerapproach to toggle the mock's behavior safely.Test Command
Test Result
Reviewers: Gaurav Narula gaurav_narula2@apple.com, Chia-Ping Tsai
chia7712@gmail.com, PoAn Yang payang@apache.org