Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent coordinator from getting stuck if leadership changes during coordinator run #14385

Merged
merged 2 commits into from
Jun 8, 2023

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Jun 7, 2023

Description

If the current leader coordinator is asked to stop being leader, the following happens:

  • The DruidCoordinator.balancerExec (used for strategy cost computations) is shutdown
  • The currently running duty finishes execution normally and no more duties are executed
  • An exception to this is the BalanceSegments duty, which can exit abnormally or even get stuck in the race conditions explained below.

✅ Case 1: balancerExec.submit() after balancerExec.shutdown(), BalanceSegments exits abnormally

Typical sequence of events:

  • Current coordinator stops being leader and balancerExec is shutdown
  • CostBalancerStrategy.findNewSegmentHomeBalancer() or any other strategy method is invoked
  • balancerExec.submit() is invoked with computeCost() tasks
  • Since the executor has already been shutdown, submission of new tasks throws a RejectedExecutionException and ends the coordinator run as desired

❌ Case 2: balancerExec.submit() before balancerExec.shutdown(), BalanceSegments gets stuck

Typical sequence of events:

  • BalanceSegments duty is in progress
  • CostBalancerStrategy.findNewSegmentHomeBalancer() is invoked for some segment
  • computeCost() tasks for, say 5 servers, are submitted to the executor
  • Executor picks up 3 of these tasks and starts executing them
  • Coordinator stops being leader and balancerExec is shutdown
  • Since the computeCost() tasks do not handle interrupts, the 3 picked up tasks finish execution normally
  • But the 2 remaining tasks are never picked up by the executor as it is already shutdown
  • The method findNewSegmentHomeBalancer waits indefinitely for the futures to finish

✅ Case 3: Change in balancerComputeThreads dynamic config

A change in this config also results in a shutdown of the balancerExec. But this shutdown is never done concurrently with the coordinator duties and thus doesn't cause the coordinator to get stuck.

Changes

  • Add a timeout of 1 minute to the resultFuture.get(). 1 minute is the typical time for a full coordinator run and is more than enough time for cost computations of a single segment.
  • Raise an alert if an exception is encountered while computing costs and if the executor has not been shutdown. This is because a shutdown is intentional and does not require an alert.

@kfaraz kfaraz merged commit 12e8fa5 into apache:master Jun 8, 2023
@kfaraz kfaraz deleted the fix_balancer_future branch June 8, 2023 09:59
@abhishekagarwal87 abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants