Skip to content

Conversation

@bleskes
Copy link
Contributor

@bleskes bleskes commented Jul 30, 2016

During our master elections, nodes "vote" for a master being issuing a join request to it. Since this is done in an async fashion, joins may arrive before the master itself has realized it had won the election. Therefore we start accumulating node joins on every node at election start (we don't know the result yet). When the election finish nodes that did not become the master (i.e., joined another node which won the election) need to potentially process and fail any incoming join request they may have received during the election. This is currently achieved by always issuing a cluster state update task that is doomed to fail, even if no pending joins are actually there. That aspect results in confusing (debug) log messages, making it seems like something is wrong. For example (note that NotMasterException)

[2016-07-30 22:25:53,040][DEBUG][cluster.service          ] [node_t1] processing [zen-disco-process-pending-joins [{node_t0}{4SqBTyYNQ82J9c75Cs7jtg}{kutaNSYbTZCSybvqczgWCA}{127.0.0.1}{127.0.0.1:9400} elected]]: execute
[2016-07-30 22:25:53,041][DEBUG][transport                ] [node_t1] connected to node [{node_t0}{4SqBTyYNQ82J9c75Cs7jtg}{kutaNSYbTZCSybvqczgWCA}{127.0.0.1}{127.0.0.1:9400}]
[2016-07-30 22:25:53,045][DEBUG][cluster.service          ] [node_t1] cluster state update task [zen-disco-process-pending-joins [{node_t0}{4SqBTyYNQ82J9c75Cs7jtg}{kutaNSYbTZCSybvqczgWCA}{127.0.0.1}{127.0.0.1:9400} elected]] failed
NotMasterException[Node [{node_t1}{eAQts270TiGFpoCDE-0PQQ}{or5bsv2ET220su78DLJk5g}{127.0.0.1}{127.0.0.1:9401}] not master for join request]
[2016-07-30 22:25:53,048][DEBUG][cluster.service          ] [node_t1] processing [zen-disco-process-pending-joins [{node_t0}{4SqBTyYNQ82J9c75Cs7jtg}{kutaNSYbTZCSybvqczgWCA}{127.0.0.1}{127.0.0.1:9400} elected]]: took [7ms] no change in cluster_state

This PR cleans up the logic a bit to only use failure where there are actual joins that are failed. The result is cleaner logs as well:

[2016-07-30 22:23:12,880][DEBUG][cluster.service          ] [node_t1] processing [zen-disco-election-stop [{node_t0}{jMR5HCpOQnOM4pGeFkUjng}{B5WIZQAdQk2cWbjGZ21mvQ}{127.0.0.1}{127.0.0.1:9400} elected]]: execute
[2016-07-30 22:23:12,881][DEBUG][cluster.service          ] [node_t1] processing [zen-disco-election-stop [{node_t0}{jMR5HCpOQnOM4pGeFkUjng}{B5WIZQAdQk2cWbjGZ21mvQ}{127.0.0.1}{127.0.0.1:9400} elected]]: took [0s] no change in cluster_state
[2016-07-30 22:23:12,881][DEBUG][transport                ] [node_t1] connected to node [{node_t0}{jMR5HCpOQnOM4pGeFkUjng}{B5WIZQAdQk2cWbjGZ21mvQ}{127.0.0.1}{127.0.0.1:9400}]

@bleskes bleskes added >enhancement :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure v5.0.0-beta1 labels Jul 30, 2016
@bleskes
Copy link
Contributor Author

bleskes commented Jul 30, 2016

@jasontedor welcome back. Do you mind taking a look?

@jasontedor jasontedor self-assigned this Jul 31, 2016
@jasontedor
Copy link
Member

Nice. LGTM.

@bleskes bleskes merged commit 7c6527e into elastic:master Aug 1, 2016
@bleskes bleskes deleted the node_join_logging branch August 1, 2016 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >enhancement v5.0.0-alpha5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants