KAFKA-19458: resume cleaning on future replica dir change #20082

gaurav-narula · 2025-07-01T16:12:16Z

ReplicaManager#alterReplicaLogDirs does not resume log cleaner while handling an AlterReplicaLogDirs request for a topic partition which already has an AlterReplicaLogDirs in progress, leading to a resource leak where the cleaning for topic partitions remains paused even after the log directory has been altered.

This change ensures we invoke LogManager#resumeCleaning if the future replica directory has changed.

github-actions · 2025-07-09T03:32:58Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

gaurav-narula · 2025-07-14T09:56:04Z

CC: @junrao @chia7712 can you please take a look?

github-actions · 2025-07-16T03:46:45Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

junrao

@gaurav-narula : Thanks for the PR. Nice catch! Left a couple of comments.

junrao · 2025-07-16T22:17:30Z

core/src/main/scala/kafka/server/ReplicaManager.scala

              if (partition.futureReplicaDirChanged(destinationDir)) {
                replicaAlterLogDirsManager.removeFetcherForPartitions(Set(topicPartition))
                partition.removeFutureLocalReplica()
+                logManager.resumeCleaning(topicPartition)


In theory, it's possible that immediately after the partition.futureReplicaDirChanged check, the future partition catches up and is replaced as the current partition. During that process, logManager.resumeCleaning() has already been called. Calling it a second time here will lead to IllegalStateException. We could potentially check for the future replica again after replicaAlterLogDirsManager.removeFetcherForPartitions(). At that point, if the future replica still exist, it's guaranteed to be there afterward.

junrao · 2025-07-16T22:28:28Z

core/src/main/scala/kafka/log/LogManager.scala

                 remoteStorageSystemEnable: Boolean,
-                 val initialTaskDelayMs: Long) extends Logging {
+                 val initialTaskDelayMs: Long,
+                 cleanerFactory: (CleanerConfig, util.List[File], ConcurrentMap[TopicPartition, UnifiedLog], LogDirFailureChannel, Time) => LogCleaner = (cleanerConfig, files, map, logDirFailureChannel, time) => new LogCleaner(cleanerConfig, files, map, logDirFailureChannel, time)) extends Logging {


Quite a long time. Could we format it better?

`ReplicaManager#alterReplicaLogDirs` does not resume log cleaner while handling an `AlterReplicaLogDirs` request for a topic partition which already has an `AlterReplicaLogDirs` in progress, leading to a resource leak where the cleaning for topic partitions remains paused even after the log directory has been altered. This change ensures we invoke `LogManager#resumeCleaning` if the future replica directory has changed.

junrao

@gaurav-narula : Thanks for the updated PR. LGTM

`ReplicaManager#alterReplicaLogDirs` does not resume log cleaner while handling an `AlterReplicaLogDirs` request for a topic partition which already has an `AlterReplicaLogDirs` in progress, leading to a resource leak where the cleaning for topic partitions remains paused even after the log directory has been altered. This change ensures we invoke `LogManager#resumeCleaning` if the future replica directory has changed. Reviewers: Jun Rao <junrao@gmail.com>

chia7712 · 2025-12-22T12:24:01Z

core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala

+      replicaManager.alterReplicaLogDirs(Map(tp -> newReplicaFolder.getAbsolutePath))
+
+      // Prevent promotion of future replica
+      doReturn(false).when(spiedPartition).maybeReplaceCurrentWithFutureReplica()


hi all,

java.lang.ClassCastException: class java.lang.Boolean cannot be cast to class org.apache.kafka.storage.internals.log.UnifiedLog (java.lang.Boolean is in module java.base of loader 'bootstrap'; org.apache.kafka.storage.internals.log.UnifiedLog is in unnamed module of loader 'app') at kafka.cluster.Partition.futureLocalLogOrException(Partition.scala:408) at kafka.server.ReplicaManager.futureLocalLogOrException(ReplicaManager.scala:587) at kafka.server.ReplicaManagerTest.testReplicaAlterLogDirsMultipleReassignmentDoesNotBlockLogCleaner(ReplicaManagerTest.scala:5534) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)

The replica thread interfered with the doReturn(false) stubbing, causing Mockito to incorrectly apply the return value to the wrong method.

You can reproduce the error with the following command

N=100; I=0; while [ $I -lt $N ] && ./gradlew cleanTest core:test --tests ReplicaManagerTest -PmaxParallelForks=4 \ ; do (( I=$I+1 )); echo "Completed run: $I"; sleep 1; done

Will file a patch for it 😄

Hi, I've opened #21244 to fix this flaky test.
Thanks for the detailed explanation here, it was very helpful!

Refer to apache#20082 (comment). Refactored the test to fix a race condition caused by dynamic Mockito stubbing during test execution. The previous implementation used `doReturn(false)` and `reset()` on a spy object while a background thread was running, causing a `ClassCastException`. This patch replaces that logic with a thread-safe `AtomicBoolean` and `doAnswer` approach to toggle the mock's behavior safely.

Refer to #20082 (comment). Refactored the test to fix a race condition caused by dynamic Mockito stubbing during test execution. The previous implementation used `doReturn(false)` and `reset()` on a spy object while a background thread was running, causing a `ClassCastException`. This patch replaces that logic with a thread-safe `AtomicBoolean` and `doAnswer` approach to toggle the mock's behavior safely. ## Test Command ``` N=100; I=0; while [ $I -lt $N ] && ./gradlew cleanTest core:test --tests ReplicaManagerTest -PmaxParallelForks=4 \ ; do (( I=$I+1 )); echo "Completed run: $I"; sleep 1; done ``` ## Test Result ``` BUILD SUCCESSFUL in 12s 151 actionable tasks: 2 executed, 149 up-to-date Consider enabling configuration cache to speed up this build: https://docs.gradle.org/9.2.1/userguide/configuration_cache_enabling.html Completed run: 100 ``` Reviewers: Gaurav Narula <gaurav_narula2@apple.com>, Chia-Ping Tsai <chia7712@gmail.com>, PoAn Yang <payang@apache.org>

github-actions bot added triage PRs from the community core Kafka Broker small Small PRs labels Jul 1, 2025

junrao added the ci-approved label Jul 8, 2025

github-actions bot added the needs-attention label Jul 9, 2025

gaurav-narula force-pushed the KAFKA-19458 branch from 28dc019 to 83b70cf Compare July 14, 2025 16:02

github-actions bot removed the needs-attention label Jul 15, 2025

github-actions bot added the needs-attention label Jul 16, 2025

junrao reviewed Jul 16, 2025

View reviewed changes

github-actions bot removed needs-attention triage PRs from the community labels Jul 17, 2025

gaurav-narula force-pushed the KAFKA-19458 branch from 83b70cf to 0b427e4 Compare July 17, 2025 17:01

github-actions bot removed the small Small PRs label Jul 17, 2025

Address review comments

9773727

gaurav-narula force-pushed the KAFKA-19458 branch from 0b427e4 to 9773727 Compare July 17, 2025 17:03

junrao approved these changes Jul 17, 2025

View reviewed changes

junrao merged commit 12761c0 into apache:trunk Jul 17, 2025
20 checks passed

chia7712 reviewed Dec 22, 2025

View reviewed changes

Parkerhiphop mentioned this pull request Jan 4, 2026

MINOR: Fix flaky test in ReplicaManagerTest #21244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19458: resume cleaning on future replica dir change #20082

KAFKA-19458: resume cleaning on future replica dir change #20082

Uh oh!

gaurav-narula commented Jul 1, 2025

Uh oh!

github-actions bot commented Jul 9, 2025

Uh oh!

gaurav-narula commented Jul 14, 2025

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

junrao left a comment

Uh oh!

junrao Jul 16, 2025

Uh oh!

junrao Jul 16, 2025

Uh oh!

junrao left a comment

Uh oh!

Uh oh!

chia7712 Dec 22, 2025

Uh oh!

Parkerhiphop Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

KAFKA-19458: resume cleaning on future replica dir change #20082

KAFKA-19458: resume cleaning on future replica dir change #20082

Uh oh!

Conversation

gaurav-narula commented Jul 1, 2025

Uh oh!

github-actions bot commented Jul 9, 2025

Uh oh!

gaurav-narula commented Jul 14, 2025

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

junrao Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

junrao Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chia7712 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Parkerhiphop Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants