Prevent coordinator from getting stuck if leadership changes during coordinator run #14385
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
If the current leader coordinator is asked to stop being leader, the following happens:
DruidCoordinator.balancerExec
(used for strategy cost computations) is shutdownBalanceSegments
duty, which can exit abnormally or even get stuck in the race conditions explained below.✅ Case 1:
balancerExec.submit()
afterbalancerExec.shutdown()
,BalanceSegments
exits abnormallyTypical sequence of events:
balancerExec
is shutdownCostBalancerStrategy.findNewSegmentHomeBalancer()
or any other strategy method is invokedbalancerExec.submit()
is invoked withcomputeCost()
tasksRejectedExecutionException
and ends the coordinator run as desired❌ Case 2:
balancerExec.submit()
beforebalancerExec.shutdown()
,BalanceSegments
gets stuckTypical sequence of events:
BalanceSegments
duty is in progressCostBalancerStrategy.findNewSegmentHomeBalancer()
is invoked for some segmentcomputeCost()
tasks for, say 5 servers, are submitted to the executorbalancerExec
is shutdowncomputeCost()
tasks do not handle interrupts, the 3 picked up tasks finish execution normallyfindNewSegmentHomeBalancer
waits indefinitely for the futures to finish✅ Case 3: Change in
balancerComputeThreads
dynamic configA change in this config also results in a shutdown of the
balancerExec
. But this shutdown is never done concurrently with the coordinator duties and thus doesn't cause the coordinator to get stuck.Changes
resultFuture.get()
. 1 minute is the typical time for a full coordinator run and is more than enough time for cost computations of a single segment.