Optimisations For The Network Protocol #366

kevaundray · 2018-08-29T15:49:13Z

Below I list four new commands to add to the network protocol, plus a method to remove peers that stall other peers as a possible solution to optimise the network protocol.

#1132 Problem: When nodeA asks nodeB for a header, if nodeB does not have the header, he will not tell nodeA. nodeA will wait for nodeB until some set of time, which slows down nodeA.

Solution: Once nodeB cannot find the header, he should send back a notfound message with the hashes of the inv that he could not find, along with a separate message of the headers that he could find.

#1133 Problem: When a node sends an incompatible message during the handshake or a malformed message, the node silently disconnects. This makes it hard to debug why two nodes are not communicating, or why one node has disconnected from another.

Solution: If a node disconnects from another node, due to the message that that node had sent, they should send a reject message, stating why the node was disconnected and a byte to signify the reason for the computer.

#1134 Problem: When a node disconnects from another node, nodes do not have a way to check this, subsequently nodes propagate bad peers to other nodes, wasting the other nodes' resources in trying to connect to bad peers and makes the initial synchronisation slow.

Solution: I have implemented a seednode which filters out the latest available p2p nodes and sends them over the wire via a addr message, but this is more of a band-aid solution. Also, I believe it is doing too much filtering, after monitoring it for a day or so, the list tops out at ~35 nodes, after that the seednode spends most of it's time being stalled filtering out a large number of bad peers.

A better solution would be to add the heartbeat messages into the p2p protocol; ping and pong. So a node sends a ping message with a nonce, the other node then replies with a pong message with the same nonce to signify that the connection is still alive.

#1134 Problem: Nodes have no way to disconnect peers who are slowing them down in terms of message delivery as stated above. For example, if I ask a peer for blocks/headers and that peer is taking more than 45 seconds, the node should disconnect and search for a faster peer to request the same resources from. The problem with that peer being slow, will increase if more peers keep connecting to it and asking for inventory.

Solution: Implement a inflight message tracker for each peer that the node has connected to. If a getheaders message is sent and a headers is not heard within 30 seconds for example, the node should disconnect from that peer.

An example implementation of this can be seen here: https://github.com/decentralisedkev/neo-go/blob/v2/pkg/p2p/peer/stall/stall.go

With these solutions in place, nodes will stop keeping a large number of bad peers or peers that are ran for a short period of time.

In order to implement these, the protocol version would need to be bumped.

The text was updated successfully, but these errors were encountered:

ixje · 2018-12-13T15:51:05Z

I seem to have missed this issue, but I like it!

w.r.t. the 1st issue/solution; I like having a reaction if nothing is found. I'm not so sure about sending 2 messages if you have partial results. What benefit would that solution have? Your client code knows what data it requested and can determine itself what it's missing right?
w.r.t the 2nd issue. I have encountered this and it would not only help during development but also in some node quality monitoring and statistics gathering. 👍
w.r.t the 3rd issue
@decentralisedkev Are you proposing that e.g. Node_A keeps all connections to its known addresses open, then uses ping/pong messages in order to keep an accurate address list for when it receives getaddr requests? Not sure I'm following your solution.
w.r.t. the 4th issue. This seems more like a client update than something that should embedded in the protocol, no? I recently did exactly this in neo-python because ping is not a reliable health measure. Nodes can have great ping, but still not respond to header/block requests.

neo/neo/Network/P2P/ProtocolHandler.cs

Lines 103 to 106 in 8310035

    
           case "notfound": 
        
           case "ping": 
        
           case "pong": 
        
           case "reject":

@erikzhang it looks like the proposed commands are already in there. What would be needed to get this further implemented? All proposed solutions seem backwards compatible as unknown commands can be just be ignored.

kevaundray · 2018-12-18T19:40:35Z

w.r.t. the 1st issue/solution; I like having a reaction if nothing is found. I'm not so sure about sending 2 messages if you have partial results. What benefit would that solution have? Your client code knows what data it requested and can determine itself what it's missing right?

For example, if I request a header from you, and you do not have it, then I will wait X amount of time before coming to the conclusion that you do not have it. My thought was that this was a waste of time, if a node does not have a piece of data, that is requested then ideally they should tell the requesting party. With this implemented, then we can assume that if we do not hear from a node within a certain amount of time, that they have stopped working also. The client(requestor) which is me, would know what is missing, however I would not know whether the node who I requested it from, has the item. A bit long winded, but I hope that makes sense.

Are you proposing that e.g. Node_A keeps all connections to its known addresses open, then uses ping/pong messages in order to keep an accurate address list for when it receives getaddr requests? Not sure I'm following your solution.

For this, I was suggesting that NodeA keeps a list of all of it's known addresses and using pings, we can check which ones have disconnected from NodeA. If NodeA does not know whether the known nodes in his list are still connected, he would inevitably end up propagating bad nodes, when NodeB sends NodeA a getaddr request.

This seems more like a client update than something that should embedded in the protocol, no? I recently did exactly this in neo-python because ping is not a reliable health measure. Nodes can have great ping, but still not respond to header/block requests.

To my understanding, ping is made to check whether the node is still connected, while the stall manager checks whether the connected node is experiencing a large amount of traffic. An example for me would be that, I am trying to sync with the network and I ask nodeD for blocks 100-200, in this scenario, I think there should be a way to determine whether nodeD is experiencing a large amount of traffic and subsequently, we should ask another node for blocks 100-200. In this way, nodeD does not hold me up because of his backlog.

When I wrote those solutions, they were meant as a stepping stone to efficiently filter out the bad/overloaded nodes in the network allowing for faster block download times and less network congestion.

Let me know if I got anything wrong or anything is written poorly.

ixje · 2018-12-19T06:58:40Z

For example, if I request a header from you, and you do not have it, then I will wait X amount of time before coming to the conclusion that you do not have it. My thought was that this was a waste of time, if a node does not have a piece of data, that is requested then ideally they should tell the requesting party. With this implemented, then we can assume that if we do not hear from a node within a certain amount of time, that they have stopped working also.

This part was clear before and can be solved with the notfound. Which I assume is just a Message with the command field set to notfound and no payload to keep it as small as possible.

The client(requestor) which is me, would know what is missing, however I would not know whether the node who I requested it from, has the item. A bit long winded, but I hope that makes sense.

This part I'm not really following. My main remark was that I don't see a reason for 2 messages on a partial hit. It can just send what it has and the client can determine itself what it's still missing.

====

For this, I was suggesting that NodeA keeps a list of all of it's known addresses and using pings, we can check which ones have disconnected from NodeA. If NodeA does not know whether the known nodes in his list are still connected, he would inevitably end up propagating bad nodes, when NodeB sends NodeA a getaddr request.

I don't think ping will be sufficient to determine good or bad. Some nodes might disconnect you simply because you've not given them blocksor headers recently, or because they think there's a better node for them. This approach would now label them as bad/dead while they could actually still be valid peers for others (e.g. ones that are in the same geographical region)

I think there are too many factors in play that you can let another node decide for you if an address is good or bad.

====

To my understanding, ping is made to check whether the node is still connected, while the stall manager checks whether the connected node is experiencing a large amount of traffic. An example for me would be that, I am trying to sync with the network and I ask nodeD for blocks 100-200, in this scenario, I think there should be a way to determine whether nodeD is experiencing a large amount of traffic and subsequently, we should ask another node for blocks 100-200. In this way, nodeD does not hold me up because of his backlog.

I think this section was clear. I just believe that this should be a client implementation not in the protocol such that a client can decide after which delay it wants to disconnect.

lock9 · 2019-08-12T00:27:33Z

I think we should divide this into 3 issues. @neo-project/core what is your opinion?

shargon · 2019-08-12T05:46:56Z

I think we should divide this into 3 issues.

+1

lock9 · 2019-09-30T20:15:13Z

@decentralisedkev I will create new issues for this, ok? I believe some of them are important and can be implemented sooner than others.

lock9 · 2019-09-30T22:26:26Z

@decentralisedkev issues created. I edited your issue to mention it. I'm closing this issue now, thank you for your collaboration!

erikzhang added the Discussion Initial issue state - proposed but not yet accepted label Aug 29, 2018

ixje mentioned this issue Dec 13, 2018

Network protocol suggestion: getfullblocks #522

Closed

vncoelho mentioned this issue Mar 5, 2019

Seed nodes are still lagging behind #620

Closed

This was referenced Sep 30, 2019

Send "notfound" message when inventory is not found. #1132

Closed

Send disconnection reason when closing connection with a node #1133

Open

Avoid bad peers propagation #1134

Open

lock9 closed this as completed Sep 30, 2019

lock9 mentioned this issue Oct 25, 2019

Neo 3 Opmizations update list #1189

Closed

31 tasks

Thacryba pushed a commit to simplitech/neo that referenced this issue Feb 17, 2020

Add two API (neo-project#366)

5137066

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisations For The Network Protocol #366

Optimisations For The Network Protocol #366

kevaundray commented Aug 29, 2018 •

edited by lock9

Loading

ixje commented Dec 13, 2018

kevaundray commented Dec 18, 2018

ixje commented Dec 19, 2018

lock9 commented Aug 12, 2019

shargon commented Aug 12, 2019

lock9 commented Sep 30, 2019

lock9 commented Sep 30, 2019

Optimisations For The Network Protocol #366

Optimisations For The Network Protocol #366

Comments

kevaundray commented Aug 29, 2018 • edited by lock9 Loading

ixje commented Dec 13, 2018

kevaundray commented Dec 18, 2018

ixje commented Dec 19, 2018

lock9 commented Aug 12, 2019

shargon commented Aug 12, 2019

lock9 commented Sep 30, 2019

lock9 commented Sep 30, 2019

kevaundray commented Aug 29, 2018 •

edited by lock9

Loading