Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix node crash #1996

Merged
merged 4 commits into from
Sep 22, 2019
Merged

Fix node crash #1996

merged 4 commits into from
Sep 22, 2019

Conversation

pmconrad
Copy link
Contributor

Theory:
Node is connecting to a peer that has recently requested a firewall check and was then disconnected. initiate_connect_to creates a new peer_connection, inserts it into _handshaking_connections, then creates accept_or_connect_task to create the actual stcp connection, wherein sock.connect() will yield.
Suppose that while we're waiting for the connection, we receive the result of the firewall check from another peer. on_check_firewall_reply_message will look up the originating peer (to which we're still connecting) in _handshaking_connections and call send_message on it. This goes down to stcp_socket.writesome, which sends the message through _send_aes for encryption.
The problem is, because the connection has not been set up yet and there has been no key exchange, _send_aes hasn't been initialized yet.

@pmconrad
Copy link
Contributor Author

pmconrad commented Sep 19, 2019

There is one detail missing in OP:
Due to the recent fc changes, instances of std::array<unsigned char,N> are no longer zero-initialized per default. 6399020 plus bitshares/bitshares-fc#165 (comment) should fix that. Requires another fc bump. bumped

Copy link
Member

@abitmore abitmore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don't think the new _ready_for_sending promise is necessary (if the std::array issue is not there or has been fixed), it looks harmless, so I agree to add it.

libraries/protocol/address.cpp Show resolved Hide resolved
@@ -251,6 +255,7 @@ namespace graphene { namespace net {
}
~verify_no_send_in_progress() { var = false; }
} _verify_no_send_in_progress(_send_message_in_progress);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the assert above will be skipped in release build, the elog above doesn't make sense. If we want to avoid two tasks calling send_message at the same time, we need a lock.

There is another elog below which looks suspicious as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any indication that this happens (i. e. two tasks calling send_message in parallel), so I wouldn't change this at this time.

The check below (MAX_MESSAGE_SIZE) is OK IMO. We will start sending the oversized message, but the remote end will probably disconnect us. If they don't it's fine too.
Note that a too large message is an indication of a problem in the chain layer - it should catch oversized blocks and transactions. Protocol messages are much smaller than the limit.

Hm, actually... this should be verified for some message types. But out of scope here.

@clockworkgr
Copy link
Member

Does this solve the case where requesting_peer is disconnected due to inactivity before the firewall check result? as per TG discussion?

@pmconrad
Copy link
Contributor Author

It currently looks as if the case where the peer disconnects is handled correctly. This change here is specifically about the case where the firewall check reply is received while the peer is being reconnected.

@abitmore
Copy link
Member

while the peer is being reconnected.

Or another peer is being connected whose node_id is (mis-)initialized with the id of the disconnected peer.

@abitmore abitmore added this to the 3.3.2 bugfix release milestone Sep 21, 2019
@abitmore
Copy link
Member

Added logging in dbg331b branch in order to catch the non-zero node_id issue (see 17633fb).

@abitmore
Copy link
Member

abitmore commented Sep 22, 2019

P2P log indicating the node_id initialization issue:

2019-09-22T07:22:57 p2p:message read_loop on_check_firewall_me ] Peer 140551329897024 x.x.x.x:1776 wants us to check whether it is firewalled node.cpp:3287
...
2019-09-22T07:23:04 p2p:terminate_inactive_connections_loop terminate_inactive_c ] Disconnecting peer x.x.x.x:1776 because they didn't respond to my request for item d76c20c24dd284c027f367b3a92a11e11ccd62c2 node.cpp:924
...
2019-09-22T07:23:07 p2p:message read_loop on_check_firewall_re ] Sending firewall check reply to 140551329897024 x.x.x.x:1776 node.cpp:3374
2019-09-22T07:23:07 p2p:message read_loop on_check_firewall_re ] original_peer->direction: outbound original_peer->connection_initiation_time: 2019-09-22T07:23:05 original_peer->our_state: disconnected original_peer->their_state: disconnected original_peer->negotiation_status: connecting original_peer->user_agent: peer_node_id: 1fe2f6327cf0d9241cb5c38769d90d8dbb6cee5f8e607aa7b7d7ec8a8d2896b622 node.cpp:3382

Note: connection_initiation_time: 2019-09-22T07:23:05 indicates it's a new connection, while the address of the peer_connection object is the same as the old one (140551329897024), thus the (mis)initialized node_id field is exactly the same. The node then crashed with SIGSEGV when calling send_message().

@pmconrad pmconrad merged commit 44da6da into bitshares:release Sep 22, 2019
@pmconrad pmconrad deleted the fix-node-crash branch September 22, 2019 13:48
@pmconrad pmconrad restored the fix-node-crash branch September 23, 2019 06:06
@pmconrad pmconrad deleted the fix-node-crash branch September 23, 2019 06:07
@abitmore abitmore mentioned this pull request Nov 3, 2019
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants