Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closing connections during key exchange leads to permanently blocked node-to-node channel #2684

Closed
eddyashton opened this issue Jun 17, 2021 · 2 comments
Labels
1.x-backport Highlights PRs targeting the 1.x branch bug

Comments

@eddyashton
Copy link
Member

This was spotted while debugging a recent code_update_test instability. This is not present in main right now, because we currently never close connections, but will reappear when we implement #2672.

In summary:

  1. Node 3 becomes candidate, and opens channels to every other node to ask for votes.
  2. Node 3 sends key_exchange_init to nodes it wasn't previously talking to.
  3. Node 2 receives this and executes consume_initiator_key_share(), transitioning to WAITING_FOR_FINAL.
  4. Node 2 sends a key_exchange_response message to Node 3.
  5. Before Node 3 receives this key_exchange_response it receives an Append Entries from someone who already won in this term and transitions to follower, closing its outgoing connections.
  6. Node 3 receives key_exchange_response and considers its channel with Node 2 established, but its key_exchange_final response is dropped (because the host-side connection has already been closed).
  7. Node 2 remains in WAITING_FOR_FINAL.
  8. Next time Node 2 tries to talk to 3, it fails (state is not ESTABLISHED, call initiate(), early out because we're in state WAITING_FOR_FINAL).

So nodes 2 and 3 are permanently unable to communicate, and progress which relies on one of them voting for the other + sending append entries can never happen.

Transformed logs showing the steps above:

1)
45:24.844 |03|       Becoming candidate n[3=3692]: 3 (raft.h:2419)

2)
45:24.857 |03|       Creating new outbound channel to n[2=f53b] (127.216.212.151:42357) (channels.h:940)
45:24.861 |03|       Added node connection with n[2=f53b] (127.216.212.151:42357) (node_connections.h:378)
45:24.861 |03|       Send request vote from n[3=3692] to n[2=f53b] at 78 (raft.h:2219)
45:24.861 |03|       Queuing send of 1 message to n[2=f53b] - status is 0 (channels.h:764)
45:24.861 |03|       Initiating node channel with n[2=f53b]. (channels.h:725)
45:24.862 |03|       node send to n[2=f53b] [1035] (node_connections.h:321)
45:24.862 |03|       key_exchange_init -> n[2=f53b] node serial: 83E423FC96F6DB5787741C2FDD235677 (channels.h:753)

3)
45:24.869 |02|     consume_initiator_key_share from n[3=3692], status == 0 (channels.h:539)

5)
45:24.876 |03|       Becoming follower n[3=3692]: 3 (raft.h:2516)
45:24.876 |03|       Removed outgoing node connection with n[2=f53b] (node_connections.h:413)
45:24.876 |03|       Removed outgoing node connection with n[5=4c79] (node_connections.h:413)

6)
45:24.876 |03|       consume_responder_key_share from n[2=f53b], status == 1 (channels.h:443)
45:24.876 |03|       key_exchange_response from n[2=f53b] (node_to_node.h:281)
45:24.878 |03|       From responder n[2=f53b]: version=1 ks=<SNIP> (channels.h:461)
45:24.880 |03|       Unknown node connection n[2=f53b] (node_connections.h:401)
45:24.880 |03|       key_exchange_final -> n[2=f53b]: ks=<SNIP> (channels.h:522)
45:24.884 |03|       Node channel with n[2=f53b] is now established. (channels.h:698)
45:24.884 |03|       Unknown node connection n[2=f53b] (node_connections.h:401)
45:24.884 |03|       -> n[2=f53b]: node msg with nonce=1 (channels.h:797)

8)
45:26.803 |02|     Becoming candidate n[2=f53b]: 4 (raft.h:2419)
45:26.809 |02|     Added node connection with n[3=3692] (127.186.152.85:44463) (node_connections.h:378)
45:26.809 |02|     Early return from initiate with n[3=3692] - status is 2 (channels.h:721)
45:26.809 |02|     Queuing send of 1 message to n[3=3692] - status is 2 (channels.h:764)

The fix may be as simple as removing that early-return from initiate() (you restart on ESTABLISHED, so you should also restart on WAITING_FOR_FINAL?), but I'd like to get a consistent/minimal repro to confirm that,

@eddyashton eddyashton added the bug label Jun 17, 2021
@achamayou achamayou added the 1.x-backport Highlights PRs targeting the 1.x branch label Jun 17, 2021
@achamayou
Copy link
Member

@eddyashton I believe this was addressed by #2801?

@eddyashton
Copy link
Member Author

Yes, this is now resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x-backport Highlights PRs targeting the 1.x branch bug
Projects
None yet
Development

No branches or pull requests

2 participants