Upon being elected as master, prefer joins' node info to existing cluster state #19743

bleskes · 2016-08-02T11:45:13Z

When we introduces persistent node ids we were concerned that people may copy data folders from one to another resulting in two nodes competing for the same id in the cluster. To solve this we elected to not allow an incoming join if a different with same id already exists in the cluster, or if some other node already has the same transport address as the incoming join. The rationeel there was that it is better to prefer existing nodes and that we can rely on node fault detection to remove any node from the cluster that isn't correct any more, making room for the node that wants to join (and will keep trying).

Sadly there were two problems with this:

One minor and easy to fix - we didn't allow for the case where the existing node can have the same network address as the incoming one, but have a different ephemeral id (after node restart). This confused the logic in AllocationService, in this rare cases. The cluster is good enough to detect this and recover later on, but it's not clean.
The assumption that Node Fault Detection will clean up is wrong when the node just won an election (it wasn't master before) and needs to process the incoming joins in order to commit the cluster state and assume it's mastership. In those cases, the Node Fault Detection isn't active.

This PR fixes these two and prefers incoming nodes to existing node when finishing an election.
On top of the, on request by @ywelsch , AllocationService synchronization between the nodes of the cluster and it's routing table is now explicit rather than something we do all the time.

…ollision

ywelsch · 2016-08-03T06:41:26Z

core/src/main/java/org/elasticsearch/cluster/node/DiscoveryNodes.java

+         * @param nodeId id of the wanted node
+         * @return wanted node if it exists. Otherwise <code>null</code>
+         */
+        public DiscoveryNode get(String nodeId) {


can you add @nullable ?

ywelsch · 2016-08-03T08:30:00Z

@bleskes I left some comments but the overall change looks good. While I think that separating out deassociateDeadNodes from reroute is great, I feel less enthusiastic about how we handle electPrimariesAndUnassignedDanglingReplicas. That method goes conceptually together with cancelShard (the method that moves shards to unassigned). I wonder if we can marry them together. How about leaving electPrimariesAndUnassignedDanglingReplicas in reroute for now and tackle that in a future PR?

ywelsch · 2016-08-03T08:34:54Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

+     * unassigned an shards that are associated with nodes that are no longer part of the cluster, potentially promoting replicas
+     * if needed.
+     */
+    public RoutingAllocation.Result deassociateDeadNodes(ClusterState clusterState, boolean reroute, String reason) {


how about adding an overloaded version of the method where reroute is true (like we have for startedShards / failedShards)?

alternatively, we could make it even more explicit when reroute is not called, i.e. have deassociateDeadNodes always do the reroute and add a method deassociateDeadNodesWithoutReroute for the rare cases where we don't reroute.

tja. can do if you feel strongly about it. To me it feels a bit like an overkill

ok, let's leave it as is.

bleskes · 2016-08-04T20:39:09Z

@ywelsch thanks. I pushed an updated.

ywelsch · 2016-08-04T20:50:30Z

LGTM. Thanks @bleskes!

Slims the public interface of RoutingNodes down to 4 methods to update routing entries: - initializeShard() -> initializes an unassigned shard - startShard() -> starts an initializing shard / completes relocation of a shard - relocateShard() -> starts relocation of a started shard - failShard() -> fails/cancels an assigned shard In the spirit of PR #19743, where deassociateDeadNodes was moved to its own public method to be only called when nodes have actually left the cluster and not on every reroute step, this commit also removes electPrimariesAndUnassignedDanglingReplicas from AllocationService and folds it into the shard failure logic. This means that an active replica is promoted to primary in the same method where the primary was failed. Previously we would scan in each reroute iteration for active replicas to be promoted to primary.

bleskes added 12 commits July 30, 2016 11:30

make election stop not be a failure

cdd2e8f

allow joining nodes, conflicting with existing nodes to elect a master

ea76e82

failing allocation test

ff2eabc

merge from master

35a07e0

trimming assigned shards to conflicting nodes

ef6b987

fix tests

e896bf3

fix tests

665c285

fix test

98d7a3d

fix local discovery

e6329af

allow for stale cluster states even if we didn't explictly remove nodes

f2d9d03

line lengths

6717c91

Merge remote-tracking branch 'upstream/master' into not_master_node_c…

f1a9857

…ollision

bleskes added >bug :Allocation :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure v5.0.0-beta1 labels Aug 2, 2016

bleskes assigned ywelsch Aug 2, 2016

ywelsch reviewed Aug 3, 2016
View reviewed changes

bleskes added 2 commits August 4, 2016 21:03

feedback

9eb88f3

merge from master

d1f7827

nullable first!

bbe5055

bleskes merged commit 609a199 into elastic:master Aug 5, 2016

bleskes deleted the not_master_node_collision branch August 5, 2016 06:58

ywelsch mentioned this pull request Aug 8, 2016

Simplify RoutingNodes interface #19870

Merged

lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upon being elected as master, prefer joins' node info to existing cluster state #19743

Upon being elected as master, prefer joins' node info to existing cluster state #19743

Uh oh!

bleskes commented Aug 2, 2016 •

edited by ywelsch

Loading

Uh oh!

ywelsch Aug 3, 2016

Uh oh!

bleskes Aug 4, 2016

Uh oh!

ywelsch commented Aug 3, 2016

Uh oh!

ywelsch Aug 3, 2016

Uh oh!

bleskes Aug 4, 2016

Uh oh!

ywelsch Aug 4, 2016

Uh oh!

bleskes commented Aug 4, 2016

Uh oh!

ywelsch commented Aug 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Upon being elected as master, prefer joins' node info to existing cluster state #19743

Upon being elected as master, prefer joins' node info to existing cluster state #19743

Uh oh!

Conversation

bleskes commented Aug 2, 2016 • edited by ywelsch Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywelsch Aug 3, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes Aug 4, 2016

Choose a reason for hiding this comment

Uh oh!

ywelsch commented Aug 3, 2016

Uh oh!

ywelsch Aug 3, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes Aug 4, 2016

Choose a reason for hiding this comment

Uh oh!

ywelsch Aug 4, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes commented Aug 4, 2016

Uh oh!

ywelsch commented Aug 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bleskes commented Aug 2, 2016 •

edited by ywelsch

Loading