Fix node crash #1996

pmconrad · 2019-09-19T05:29:39Z

Theory:
Node is connecting to a peer that has recently requested a firewall check and was then disconnected. initiate_connect_to creates a new peer_connection, inserts it into _handshaking_connections, then creates accept_or_connect_task to create the actual stcp connection, wherein sock.connect() will yield.
Suppose that while we're waiting for the connection, we receive the result of the firewall check from another peer. on_check_firewall_reply_message will look up the originating peer (to which we're still connecting) in _handshaking_connections and call send_message on it. This goes down to stcp_socket.writesome, which sends the message through _send_aes for encryption.
The problem is, because the connection has not been set up yet and there has been no key exchange, _send_aes hasn't been initialized yet.

…lized

pmconrad · 2019-09-19T14:29:01Z

There is one detail missing in OP:
Due to the recent fc changes, instances of std::array<unsigned char,N> are no longer zero-initialized per default. 6399020 plus bitshares/bitshares-fc#165 (comment) should fix that. ~~Requires another fc bump.~~ bumped

abitmore

Although I don't think the new _ready_for_sending promise is necessary (if the std::array issue is not there or has been fixed), it looks harmless, so I agree to add it.

libraries/protocol/address.cpp

abitmore · 2019-09-19T15:36:08Z

libraries/net/message_oriented_connection.cpp

@@ -251,6 +255,7 @@ namespace graphene { namespace net {
        }
        ~verify_no_send_in_progress() { var = false; }
      } _verify_no_send_in_progress(_send_message_in_progress);


Since the assert above will be skipped in release build, the elog above doesn't make sense. If we want to avoid two tasks calling send_message at the same time, we need a lock.

There is another elog below which looks suspicious as well.

I haven't seen any indication that this happens (i. e. two tasks calling send_message in parallel), so I wouldn't change this at this time.

The check below (MAX_MESSAGE_SIZE) is OK IMO. We will start sending the oversized message, but the remote end will probably disconnect us. If they don't it's fine too.
Note that a too large message is an indication of a problem in the chain layer - it should catch oversized blocks and transactions. Protocol messages are much smaller than the limit.

Hm, actually... this should be verified for some message types. But out of scope here.

clockworkgr · 2019-09-19T17:43:47Z

Does this solve the case where requesting_peer is disconnected due to inactivity before the firewall check result? as per TG discussion?

libraries/net/message_oriented_connection.cpp

pmconrad · 2019-09-19T18:00:54Z

It currently looks as if the case where the peer disconnects is handled correctly. This change here is specifically about the case where the firewall check reply is received while the peer is being reconnected.

abitmore · 2019-09-19T18:05:28Z

while the peer is being reconnected.

Or another peer is being connected whose node_id is (mis-)initialized with the id of the disconnected peer.

abitmore · 2019-09-22T00:15:31Z

Added logging in dbg331b branch in order to catch the non-zero node_id issue (see 17633fb).

abitmore · 2019-09-22T12:13:42Z

P2P log indicating the node_id initialization issue:

2019-09-22T07:22:57 p2p:message read_loop on_check_firewall_me ] Peer 140551329897024 x.x.x.x:1776 wants us to check whether it is firewalled node.cpp:3287
...
2019-09-22T07:23:04 p2p:terminate_inactive_connections_loop terminate_inactive_c ] Disconnecting peer x.x.x.x:1776 because they didn't respond to my request for item d76c20c24dd284c027f367b3a92a11e11ccd62c2 node.cpp:924
...
2019-09-22T07:23:07 p2p:message read_loop on_check_firewall_re ] Sending firewall check reply to 140551329897024 x.x.x.x:1776 node.cpp:3374
2019-09-22T07:23:07 p2p:message read_loop on_check_firewall_re ] original_peer->direction: outbound original_peer->connection_initiation_time: 2019-09-22T07:23:05 original_peer->our_state: disconnected original_peer->their_state: disconnected original_peer->negotiation_status: connecting original_peer->user_agent: peer_node_id: 1fe2f6327cf0d9241cb5c38769d90d8dbb6cee5f8e607aa7b7d7ec8a8d2896b622 node.cpp:3382

Note: connection_initiation_time: 2019-09-22T07:23:05 indicates it's a new connection, while the address of the peer_connection object is the same as the old one (140551329897024), thus the (mis)initialized node_id field is exactly the same. The node then crashed with SIGSEGV when calling send_message().

pmconrad added 3 commits September 19, 2019 07:24

Prevent a connection from sending a message before it is fully initia…

e5897e7

…lized

Bump fc for yield-in-exception-handler fix

82f1e23

Fixes wrt array-initialization

6399020

abitmore reviewed Sep 19, 2019

View reviewed changes

libraries/net/message_oriented_connection.cpp Show resolved Hide resolved

Bump fc

7de9985

abitmore added this to the 3.3.2 bugfix release milestone Sep 21, 2019

abitmore approved these changes Sep 22, 2019

View reviewed changes

pmconrad merged commit 44da6da into bitshares:release Sep 22, 2019

pmconrad deleted the fix-node-crash branch September 22, 2019 13:48

pmconrad restored the fix-node-crash branch September 23, 2019 06:06

pmconrad deleted the fix-node-crash branch September 23, 2019 06:07

jmjatlanta mentioned this pull request Sep 23, 2019

Release Notes: Bitshares Core 3.3.2 #2009

Closed

abitmore mentioned this pull request Nov 3, 2019

Bump FC for 4.0.0 release #2041

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix node crash #1996

Fix node crash #1996

pmconrad commented Sep 19, 2019

pmconrad commented Sep 19, 2019 •

edited

Loading

abitmore left a comment

abitmore Sep 19, 2019

pmconrad Sep 21, 2019

clockworkgr commented Sep 19, 2019

pmconrad commented Sep 19, 2019

abitmore commented Sep 19, 2019

abitmore commented Sep 22, 2019

abitmore commented Sep 22, 2019 •

edited

Loading

Fix node crash #1996

Fix node crash #1996

Conversation

pmconrad commented Sep 19, 2019

pmconrad commented Sep 19, 2019 • edited Loading

abitmore left a comment

Choose a reason for hiding this comment

abitmore Sep 19, 2019

Choose a reason for hiding this comment

pmconrad Sep 21, 2019

Choose a reason for hiding this comment

clockworkgr commented Sep 19, 2019

pmconrad commented Sep 19, 2019

abitmore commented Sep 19, 2019

abitmore commented Sep 22, 2019

abitmore commented Sep 22, 2019 • edited Loading

pmconrad commented Sep 19, 2019 •

edited

Loading

abitmore commented Sep 22, 2019 •

edited

Loading