[server] Coordinator Server Supports High-Available #1401

zcoo · 2025-07-24T08:14:58Z

Purpose

Coordinator Server Supports High-Available.

Linked issue: close #188

Brief change log

Tests

see : com.alibaba.fluss.server.coordinator.CoordinatorServerElectionTest

API and Format

Documentation

Currently I make it by using an active (or leader) coordinator server and several standby (or alive) coordinator server. When several coordinator servers start up at the same time, only one can successfully preempt the ZK node and win the election and become active coordinator server leader. Once it failovers, the standby servers will take over it with an leader election process.

zcoo · 2025-07-24T08:15:26Z

not ready for review

michaelkoepf · 2025-07-25T15:36:25Z

@zcoo if it is not ready for review, you can click on Convert to draft in the right side bar, and, optionally, add [WIP] as prefix in the title of the PR.

zcoo · 2025-07-28T01:50:53Z

@michaelkoepf i get it, thank you

zcoo · 2025-07-29T07:11:59Z

Ready for review now! @wuchong @swuferhong

LiebingYu · 2025-07-29T08:24:26Z

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.
coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

zcoo · 2025-07-29T09:16:36Z

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.

coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

Thanks for reminder. I considered coordinatorEpoch and coordinatorEpochZkVersion , but ultimately did not implement them in this PR to simplify its complexity. I think we can handle it later.

Do you have any suggestions?
@LB-Yu @wuchong @swuferhong

LiebingYu · 2025-08-06T08:04:14Z

I have a question here. Do we need to properly handle coordinatorEpoch and coordinatorEpochZkVersion in this PR, just like Kafka does? In my opinion:

coordinatorEpoch prevents TabletServers from processing requests from zombie CoordinatorServers in the event of a split-brain scenario among CoordinatorServers.

coordinatorEpochZkVersion(KAFKA-6082) prevents zombie CoordinatorServers from modifying the ZK state during a split-brain scenario.

Should we take these two concerns into consideration in the same way as Kafka?

Thanks for reminder. I considered coordinatorEpoch and coordinatorEpochZkVersion , but ultimately did not implement them in this PR to simplify its complexity. I think we can handle it later.

Do you have any suggestions? @LB-Yu @wuchong @swuferhong

IMO, we can split them into different PRs, but they should be merged together? Otherwise, we might introduce other metadata inconsistency issues when introducing HA. What's your opinion?

LiebingYu · 2025-08-06T08:06:03Z

fluss-server/src/main/java/com/alibaba/fluss/server/coordinator/CoordinatorLeaderElection.java

+                                // Do not return, otherwise the leader will be released immediately.
+                                while (true) {
+                                    try {
+                                        Thread.sleep(1000);
+                                    } catch (InterruptedException e) {
+                                    }
+                                }


Why we don't use LeaderLatch here if we need to hold the leadership?

Very good suggestion to me! It seems LeaderLatch is more suitable in this scene and I will try to use it instead

LiebingYu · 2025-08-18T07:02:59Z

fluss-server/src/main/java/com/alibaba/fluss/server/coordinator/CoordinatorServer.java

+        if (tryElectCoordinatorLeaderOnce()) {
+            startCoordinatorLeaderService();
+        } else {
+            // standby
+            CoordinatorLeaderElection coordinatorLeaderElection =
+                    new CoordinatorLeaderElection(zkClient.getCuratorClient(), serverId);
+            coordinatorLeaderElection.startElectLeader(
+                    () -> {
+                        try {
+                            startCoordinatorLeaderService();
+                        } catch (Exception e) {
+                            throw new RuntimeException(e);
+                        }
+                    });
+        }


This logic looks strange to me. Why do we explicitly create the path for the first election, but use LeaderLatch for the standby Coordinator? Since we’re already using the framework, perhaps we could standardize on LeaderLatch?

Also, this logic seems problematic. If cs-1 becomes the leader on the first startup and later loses leadership due to a network issue, it appears it will never participate in the election again, because no LeaderLatch was registered for the node that became leader during the first startup.

yes I think we don't need to use special logic in first round election. I have changed it.

LiebingYu · 2025-08-20T02:27:12Z

Hi @zcoo. Overall, I have some concerns about several issues:

The mechanisms for leader election, startup, and resignation are not sufficiently robust. For reference, Kafka Controller provides a comprehensive implementation for controller lifecycle management:

Startup:
  Register ControllerChangeHandler and check whether the /controller path exists
  elect()
    Get activeControllerId from ZK
    If activeControllerId != -1, it means another broker has already become the controller, so just return
    Attempt to register as controller in ZK: registerControllerAndIncrementControllerEpoch
    If successful: call onControllerFailover()
    If failed:
      If it’s ControllerMovedException, call maybeResign (election failed)
      Otherwise, call triggerControllerMove (startup process failed)

When /controller node is created or its data changes:
  Call maybeResign()

When /controller node is deleted:
  processReelect
    maybeResign
    elect

There is no management of epochs. Combined with the lack of a leader resignation mechanism, this becomes extremely risky in the case of split-brain scenarios. There is the possibility that two leaders could be working simultaneously, leading to metadata inconsistencies.
After introducing multiple coordinators, should we re-examine potential race conditions in metadata management and ZK access to avoid the risks associated with split-brain situations?

zcoo · 2025-09-02T03:27:30Z

@LiebingYu Sure. I think we need to these things and I will introduce epoc mechanism in this PR soon

zcoo force-pushed the 20250723_coordinator_ha branch 4 times, most recently from 3afaff9 to 5bf08aa Compare July 25, 2025 07:26

zcoo marked this pull request as draft July 28, 2025 01:50

zcoo force-pushed the 20250723_coordinator_ha branch 5 times, most recently from 830229d to 9d8f97e Compare July 29, 2025 01:44

zcoo marked this pull request as ready for review July 29, 2025 07:10

LiebingYu reviewed Aug 6, 2025

View reviewed changes

zcoo force-pushed the 20250723_coordinator_ha branch 2 times, most recently from 84c1db7 to 18a3ff8 Compare August 8, 2025 06:55

wuchong added the priority=critical label Aug 12, 2025

LiebingYu reviewed Aug 18, 2025

View reviewed changes

polyzos force-pushed the main branch 3 times, most recently from d88c76c to 434a4f4 Compare August 31, 2025 15:13

zcoo force-pushed the 20250723_coordinator_ha branch from 74aca0b to 85c6d38 Compare September 2, 2025 02:54

zcoo force-pushed the 20250723_coordinator_ha branch from 0d260c9 to ca027a1 Compare October 20, 2025 07:23

zcoo force-pushed the 20250723_coordinator_ha branch 2 times, most recently from c98d28a to 90730f1 Compare December 15, 2025 07:28

wuchong removed the priority=critical label Dec 26, 2025

zcoo force-pushed the 20250723_coordinator_ha branch from 90730f1 to 8539a53 Compare January 6, 2026 07:10

zcoo added 23 commits January 6, 2026 15:12

[server] Coordinator Server Supports High-Available

11d6daf

[server] Coordinator Server Supports High-Available

b383f77

[server] Coordinator Server Supports High-Available

d508cca

[server] Coordinator Server Supports High-Available

f33b338

[server] Coordinator Server Supports High-Available

3a2c1d0

[server] Coordinator Server Supports High-Available

302980a

user LeaderLatch instead of LeaderSelector

d935b8f

[server] Coordinator Server Supports High-Available

d804a1d

merge

e22f787

support coordinator epoch

5b0a0cd

support coordinator epoch

e35913b

support coordinator epoch

b254c7a

modify test

3369a47

modify test

804f613

fix test

0418c8d

support coordinator epoch

679ea4c

support coordinator epoch2

4459dce

support coordinator epoch2

d175c17

support coordinator epoch2

595c75a

support coordinator epoch2

5a2f7f8

support coordinator epoch2

7649e55

support coordinator epoch2

b817a6f

support coordinator epoch2

db7a319

zcoo force-pushed the 20250723_coordinator_ha branch from 8539a53 to 14c6203 Compare January 6, 2026 07:27

fix

c851189

zcoo force-pushed the 20250723_coordinator_ha branch from 14c6203 to c851189 Compare January 9, 2026 06:15

[server] Coordinator Server Supports High-Available #1401

Are you sure you want to change the base?

[server] Coordinator Server Supports High-Available #1401

Uh oh!

Conversation

zcoo commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

zcoo commented Jul 24, 2025

Uh oh!

michaelkoepf commented Jul 25, 2025

Uh oh!

zcoo commented Jul 28, 2025

Uh oh!

zcoo commented Jul 29, 2025

Uh oh!

LiebingYu commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcoo commented Jul 29, 2025

Uh oh!

LiebingYu commented Aug 6, 2025

Uh oh!

LiebingYu Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

zcoo Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

LiebingYu Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

zcoo Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

LiebingYu commented Aug 20, 2025

Uh oh!

zcoo commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zcoo commented Jul 24, 2025 •

edited

Loading

LiebingYu commented Jul 29, 2025 •

edited

Loading