Skip to content

Conversation

@zcoo
Copy link
Contributor

@zcoo zcoo commented Jul 24, 2025

Purpose

Coordinator Server Supports High-Available.

Linked issue: close #188

Brief change log

Tests

see : com.alibaba.fluss.server.coordinator.CoordinatorServerElectionTest

API and Format

Documentation

Currently I make it by using an active (or leader) coordinator server and several standby (or alive) coordinator server. When several coordinator servers start up at the same time, only one can successfully preempt the ZK node and win the election and become active coordinator server leader. Once it failovers, the standby servers will take over it with an leader election process.

@zcoo
Copy link
Contributor Author

zcoo commented Jul 24, 2025

not ready for review

@zcoo zcoo force-pushed the 20250723_coordinator_ha branch 4 times, most recently from 3afaff9 to 5bf08aa Compare July 25, 2025 07:26
@michaelkoepf
Copy link
Contributor

@zcoo if it is not ready for review, you can click on Convert to draft in the right side bar, and, optionally, add [WIP] as prefix in the title of the PR.

@zcoo
Copy link
Contributor Author

zcoo commented Jul 28, 2025

@michaelkoepf i get it, thank you

@zcoo zcoo marked this pull request as draft July 28, 2025 01:50
@zcoo zcoo force-pushed the 20250723_coordinator_ha branch 5 times, most recently from 830229d to 9d8f97e Compare July 29, 2025 01:44
@zcoo zcoo marked this pull request as ready for review July 29, 2025 07:10
@zcoo
Copy link
Contributor Author

zcoo commented Jul 29, 2025

Ready for review now! @wuchong @swuferhong

@LiebingYu
Copy link
Contributor

LiebingYu commented Jul 29, 2025

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

  • coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.
  • coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

@zcoo
Copy link
Contributor Author

zcoo commented Jul 29, 2025

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

  • coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.
  • coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

Thanks for reminder. I considered coordinatorEpoch and coordinatorEpochZkVersion , but ultimately did not implement them in this PR to simplify its complexity. I think we can handle it later.

Do you have any suggestions?
@LB-Yu @wuchong @swuferhong

@LiebingYu
Copy link
Contributor

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

  • coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.
  • coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

Thanks for reminder. I considered coordinatorEpoch and coordinatorEpochZkVersion , but ultimately did not implement them in this PR to simplify its complexity. I think we can handle it later.

Do you have any suggestions? @LB-Yu @wuchong @swuferhong

IMO, we can split them into different PRs, but they should be merged together? Otherwise, we might introduce other metadata inconsistency issues when introducing HA. What's your opinion?

Comment on lines 75 to 81
// Do not return, otherwise the leader will be released immediately.
while (true) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't use LeaderLatch here if we need to hold the leadership?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good suggestion to me! It seems LeaderLatch is more suitable in this scene and I will try to use it instead

@zcoo zcoo force-pushed the 20250723_coordinator_ha branch 2 times, most recently from 84c1db7 to 18a3ff8 Compare August 8, 2025 06:55
Comment on lines 178 to 192
if (tryElectCoordinatorLeaderOnce()) {
startCoordinatorLeaderService();
} else {
// standby
CoordinatorLeaderElection coordinatorLeaderElection =
new CoordinatorLeaderElection(zkClient.getCuratorClient(), serverId);
coordinatorLeaderElection.startElectLeader(
() -> {
try {
startCoordinatorLeaderService();
} catch (Exception e) {
throw new RuntimeException(e);
}
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic looks strange to me. Why do we explicitly create the path for the first election, but use LeaderLatch for the standby Coordinator? Since we’re already using the framework, perhaps we could standardize on LeaderLatch?

Also, this logic seems problematic. If cs-1 becomes the leader on the first startup and later loses leadership due to a network issue, it appears it will never participate in the election again, because no LeaderLatch was registered for the node that became leader during the first startup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think we don't need to use special logic in first round election. I have changed it.

@LiebingYu
Copy link
Contributor

Hi @zcoo. Overall, I have some concerns about several issues:

  1. The mechanisms for leader election, startup, and resignation are not sufficiently robust. For reference, Kafka Controller provides a comprehensive implementation for controller lifecycle management:
Startup:
  Register ControllerChangeHandler and check whether the /controller path exists
  elect()
    Get activeControllerId from ZK
    If activeControllerId != -1, it means another broker has already become the controller, so just return
    Attempt to register as controller in ZK: registerControllerAndIncrementControllerEpoch
    If successful: call onControllerFailover()
    If failed:
      If it’s ControllerMovedException, call maybeResign (election failed)
      Otherwise, call triggerControllerMove (startup process failed)

When /controller node is created or its data changes:
  Call maybeResign()

When /controller node is deleted:
  processReelect
    maybeResign
    elect
  1. There is no management of epochs. Combined with the lack of a leader resignation mechanism, this becomes extremely risky in the case of split-brain scenarios. There is the possibility that two leaders could be working simultaneously, leading to metadata inconsistencies.

  2. After introducing multiple coordinators, should we re-examine potential race conditions in metadata management and ZK access to avoid the risks associated with split-brain situations?

@polyzos polyzos force-pushed the main branch 3 times, most recently from d88c76c to 434a4f4 Compare August 31, 2025 15:13
@zcoo zcoo force-pushed the 20250723_coordinator_ha branch from 74aca0b to 85c6d38 Compare September 2, 2025 02:54
@zcoo
Copy link
Contributor Author

zcoo commented Sep 2, 2025

@LiebingYu Sure. I think we need to these things and I will introduce epoc mechanism in this PR soon

@zcoo zcoo force-pushed the 20250723_coordinator_ha branch from 0d260c9 to ca027a1 Compare October 20, 2025 07:23
@zcoo zcoo force-pushed the 20250723_coordinator_ha branch 2 times, most recently from c98d28a to 90730f1 Compare December 15, 2025 07:28
@zcoo zcoo force-pushed the 20250723_coordinator_ha branch from 90730f1 to 8539a53 Compare January 6, 2026 07:10
@zcoo zcoo force-pushed the 20250723_coordinator_ha branch from 8539a53 to 14c6203 Compare January 6, 2026 07:27
@zcoo zcoo force-pushed the 20250723_coordinator_ha branch from 14c6203 to c851189 Compare January 9, 2026 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] CoordinatorServer Supports High-Available

5 participants