You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This was spotted while debugging a recent code_update_test instability. This is not present in main right now, because we currently never close connections, but will reappear when we implement #2672.
In summary:
Node 3 becomes candidate, and opens channels to every other node to ask for votes.
Node 3 sends key_exchange_init to nodes it wasn't previously talking to.
Node 2 receives this and executes consume_initiator_key_share(), transitioning to WAITING_FOR_FINAL.
Node 2 sends a key_exchange_response message to Node 3.
Before Node 3 receives this key_exchange_response it receives an Append Entries from someone who already won in this term and transitions to follower, closing its outgoing connections.
Node 3 receives key_exchange_response and considers its channel with Node 2 established, but its key_exchange_final response is dropped (because the host-side connection has already been closed).
Node 2 remains in WAITING_FOR_FINAL.
Next time Node 2 tries to talk to 3, it fails (state is not ESTABLISHED, call initiate(), early out because we're in state WAITING_FOR_FINAL).
So nodes 2 and 3 are permanently unable to communicate, and progress which relies on one of them voting for the other + sending append entries can never happen.
Transformed logs showing the steps above:
1)
45:24.844 |03| Becoming candidate n[3=3692]: 3 (raft.h:2419)
2)
45:24.857 |03| Creating new outbound channel to n[2=f53b] (127.216.212.151:42357) (channels.h:940)
45:24.861 |03| Added node connection with n[2=f53b] (127.216.212.151:42357) (node_connections.h:378)
45:24.861 |03| Send request vote from n[3=3692] to n[2=f53b] at 78 (raft.h:2219)
45:24.861 |03| Queuing send of 1 message to n[2=f53b] - status is 0 (channels.h:764)
45:24.861 |03| Initiating node channel with n[2=f53b]. (channels.h:725)
45:24.862 |03| node send to n[2=f53b] [1035] (node_connections.h:321)
45:24.862 |03| key_exchange_init -> n[2=f53b] node serial: 83E423FC96F6DB5787741C2FDD235677 (channels.h:753)
3)
45:24.869 |02| consume_initiator_key_share from n[3=3692], status == 0 (channels.h:539)
5)
45:24.876 |03| Becoming follower n[3=3692]: 3 (raft.h:2516)
45:24.876 |03| Removed outgoing node connection with n[2=f53b] (node_connections.h:413)
45:24.876 |03| Removed outgoing node connection with n[5=4c79] (node_connections.h:413)
6)
45:24.876 |03| consume_responder_key_share from n[2=f53b], status == 1 (channels.h:443)
45:24.876 |03| key_exchange_response from n[2=f53b] (node_to_node.h:281)
45:24.878 |03| From responder n[2=f53b]: version=1 ks=<SNIP> (channels.h:461)
45:24.880 |03| Unknown node connection n[2=f53b] (node_connections.h:401)
45:24.880 |03| key_exchange_final -> n[2=f53b]: ks=<SNIP> (channels.h:522)
45:24.884 |03| Node channel with n[2=f53b] is now established. (channels.h:698)
45:24.884 |03| Unknown node connection n[2=f53b] (node_connections.h:401)
45:24.884 |03| -> n[2=f53b]: node msg with nonce=1 (channels.h:797)
8)
45:26.803 |02| Becoming candidate n[2=f53b]: 4 (raft.h:2419)
45:26.809 |02| Added node connection with n[3=3692] (127.186.152.85:44463) (node_connections.h:378)
45:26.809 |02| Early return from initiate with n[3=3692] - status is 2 (channels.h:721)
45:26.809 |02| Queuing send of 1 message to n[3=3692] - status is 2 (channels.h:764)
The fix may be as simple as removing that early-return from initiate() (you restart on ESTABLISHED, so you should also restart on WAITING_FOR_FINAL?), but I'd like to get a consistent/minimal repro to confirm that,
The text was updated successfully, but these errors were encountered:
This was spotted while debugging a recent
code_update_test
instability. This is not present in main right now, because we currently never close connections, but will reappear when we implement #2672.In summary:
key_exchange_init
to nodes it wasn't previously talking to.consume_initiator_key_share()
, transitioning toWAITING_FOR_FINAL
.key_exchange_response
message to Node 3.key_exchange_response
it receives an Append Entries from someone who already won in this term and transitions to follower, closing its outgoing connections.key_exchange_response
and considers its channel with Node 2 established, but itskey_exchange_final
response is dropped (because the host-side connection has already been closed).WAITING_FOR_FINAL
.ESTABLISHED
, callinitiate()
, early out because we're in stateWAITING_FOR_FINAL
).So nodes 2 and 3 are permanently unable to communicate, and progress which relies on one of them voting for the other + sending append entries can never happen.
Transformed logs showing the steps above:
The fix may be as simple as removing that early-return from
initiate()
(you restart onESTABLISHED
, so you should also restart onWAITING_FOR_FINAL
?), but I'd like to get a consistent/minimal repro to confirm that,The text was updated successfully, but these errors were encountered: