Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisations For The Network Protocol #366

Closed
kevaundray opened this issue Aug 29, 2018 · 7 comments
Closed

Optimisations For The Network Protocol #366

kevaundray opened this issue Aug 29, 2018 · 7 comments
Labels
Discussion Initial issue state - proposed but not yet accepted Enhancement Type - Changes that may affect performance, usability or add new features to existing modules. Housekeeping Small enhancements that need to be done in order to keep the project organised P2P Module - peer-to-peer message exchange and network optimisations, at TCP or UDP level (not HTTP).

Comments

@kevaundray
Copy link

kevaundray commented Aug 29, 2018

Below I list four new commands to add to the network protocol, plus a method to remove peers that stall other peers as a possible solution to optimise the network protocol.

#1132 Problem: When nodeA asks nodeB for a header, if nodeB does not have the header, he will not tell nodeA. nodeA will wait for nodeB until some set of time, which slows down nodeA.

Solution: Once nodeB cannot find the header, he should send back a notfound message with the hashes of the inv that he could not find, along with a separate message of the headers that he could find.

#1133 Problem: When a node sends an incompatible message during the handshake or a malformed message, the node silently disconnects. This makes it hard to debug why two nodes are not communicating, or why one node has disconnected from another.

Solution: If a node disconnects from another node, due to the message that that node had sent, they should send a reject message, stating why the node was disconnected and a byte to signify the reason for the computer.

#1134 Problem: When a node disconnects from another node, nodes do not have a way to check this, subsequently nodes propagate bad peers to other nodes, wasting the other nodes' resources in trying to connect to bad peers and makes the initial synchronisation slow.

Solution: I have implemented a seednode which filters out the latest available p2p nodes and sends them over the wire via a addr message, but this is more of a band-aid solution. Also, I believe it is doing too much filtering, after monitoring it for a day or so, the list tops out at ~35 nodes, after that the seednode spends most of it's time being stalled filtering out a large number of bad peers.

A better solution would be to add the heartbeat messages into the p2p protocol; ping and pong. So a node sends a ping message with a nonce, the other node then replies with a pong message with the same nonce to signify that the connection is still alive.

#1134 Problem: Nodes have no way to disconnect peers who are slowing them down in terms of message delivery as stated above. For example, if I ask a peer for blocks/headers and that peer is taking more than 45 seconds, the node should disconnect and search for a faster peer to request the same resources from. The problem with that peer being slow, will increase if more peers keep connecting to it and asking for inventory.

Solution: Implement a inflight message tracker for each peer that the node has connected to. If a getheaders message is sent and a headers is not heard within 30 seconds for example, the node should disconnect from that peer.

An example implementation of this can be seen here: https://github.com/decentralisedkev/neo-go/blob/v2/pkg/p2p/peer/stall/stall.go

With these solutions in place, nodes will stop keeping a large number of bad peers or peers that are ran for a short period of time.

In order to implement these, the protocol version would need to be bumped.

@erikzhang erikzhang added the Discussion Initial issue state - proposed but not yet accepted label Aug 29, 2018
@ixje
Copy link
Contributor

ixje commented Dec 13, 2018

I seem to have missed this issue, but I like it!

  • w.r.t. the 1st issue/solution; I like having a reaction if nothing is found. I'm not so sure about sending 2 messages if you have partial results. What benefit would that solution have? Your client code knows what data it requested and can determine itself what it's missing right?

  • w.r.t the 2nd issue. I have encountered this and it would not only help during development but also in some node quality monitoring and statistics gathering. 👍

  • w.r.t the 3rd issue
    @decentralisedkev Are you proposing that e.g. Node_A keeps all connections to its known addresses open, then uses ping/pong messages in order to keep an accurate address list for when it receives getaddr requests? Not sure I'm following your solution.

  • w.r.t. the 4th issue. This seems more like a client update than something that should embedded in the protocol, no? I recently did exactly this in neo-python because ping is not a reliable health measure. Nodes can have great ping, but still not respond to header/block requests.

case "notfound":
case "ping":
case "pong":
case "reject":

@erikzhang it looks like the proposed commands are already in there. What would be needed to get this further implemented? All proposed solutions seem backwards compatible as unknown commands can be just be ignored.

@kevaundray
Copy link
Author

  • w.r.t. the 1st issue/solution; I like having a reaction if nothing is found. I'm not so sure about sending 2 messages if you have partial results. What benefit would that solution have? Your client code knows what data it requested and can determine itself what it's missing right?

For example, if I request a header from you, and you do not have it, then I will wait X amount of time before coming to the conclusion that you do not have it. My thought was that this was a waste of time, if a node does not have a piece of data, that is requested then ideally they should tell the requesting party. With this implemented, then we can assume that if we do not hear from a node within a certain amount of time, that they have stopped working also. The client(requestor) which is me, would know what is missing, however I would not know whether the node who I requested it from, has the item. A bit long winded, but I hope that makes sense.

  • Are you proposing that e.g. Node_A keeps all connections to its known addresses open, then uses ping/pong messages in order to keep an accurate address list for when it receives getaddr requests? Not sure I'm following your solution.

For this, I was suggesting that NodeA keeps a list of all of it's known addresses and using pings, we can check which ones have disconnected from NodeA. If NodeA does not know whether the known nodes in his list are still connected, he would inevitably end up propagating bad nodes, when NodeB sends NodeA a getaddr request.

  • This seems more like a client update than something that should embedded in the protocol, no? I recently did exactly this in neo-python because ping is not a reliable health measure. Nodes can have great ping, but still not respond to header/block requests.

To my understanding, ping is made to check whether the node is still connected, while the stall manager checks whether the connected node is experiencing a large amount of traffic. An example for me would be that, I am trying to sync with the network and I ask nodeD for blocks 100-200, in this scenario, I think there should be a way to determine whether nodeD is experiencing a large amount of traffic and subsequently, we should ask another node for blocks 100-200. In this way, nodeD does not hold me up because of his backlog.

When I wrote those solutions, they were meant as a stepping stone to efficiently filter out the bad/overloaded nodes in the network allowing for faster block download times and less network congestion.

Let me know if I got anything wrong or anything is written poorly.

@ixje
Copy link
Contributor

ixje commented Dec 19, 2018

For example, if I request a header from you, and you do not have it, then I will wait X amount of time before coming to the conclusion that you do not have it. My thought was that this was a waste of time, if a node does not have a piece of data, that is requested then ideally they should tell the requesting party. With this implemented, then we can assume that if we do not hear from a node within a certain amount of time, that they have stopped working also.

This part was clear before and can be solved with the notfound. Which I assume is just a Message with the command field set to notfound and no payload to keep it as small as possible.

The client(requestor) which is me, would know what is missing, however I would not know whether the node who I requested it from, has the item. A bit long winded, but I hope that makes sense.

This part I'm not really following. My main remark was that I don't see a reason for 2 messages on a partial hit. It can just send what it has and the client can determine itself what it's still missing.

====

For this, I was suggesting that NodeA keeps a list of all of it's known addresses and using pings, we can check which ones have disconnected from NodeA. If NodeA does not know whether the known nodes in his list are still connected, he would inevitably end up propagating bad nodes, when NodeB sends NodeA a getaddr request.

I don't think ping will be sufficient to determine good or bad. Some nodes might disconnect you simply because you've not given them blocksor headers recently, or because they think there's a better node for them. This approach would now label them as bad/dead while they could actually still be valid peers for others (e.g. ones that are in the same geographical region)

I think there are too many factors in play that you can let another node decide for you if an address is good or bad.

====

To my understanding, ping is made to check whether the node is still connected, while the stall manager checks whether the connected node is experiencing a large amount of traffic. An example for me would be that, I am trying to sync with the network and I ask nodeD for blocks 100-200, in this scenario, I think there should be a way to determine whether nodeD is experiencing a large amount of traffic and subsequently, we should ask another node for blocks 100-200. In this way, nodeD does not hold me up because of his backlog.

I think this section was clear. I just believe that this should be a client implementation not in the protocol such that a client can decide after which delay it wants to disconnect.

@lock9
Copy link
Contributor

lock9 commented Aug 12, 2019

I think we should divide this into 3 issues. @neo-project/core what is your opinion?

@lock9 lock9 added Enhancement Type - Changes that may affect performance, usability or add new features to existing modules. Housekeeping Small enhancements that need to be done in order to keep the project organised P2P Module - peer-to-peer message exchange and network optimisations, at TCP or UDP level (not HTTP). labels Aug 12, 2019
@shargon
Copy link
Member

shargon commented Aug 12, 2019

I think we should divide this into 3 issues.

+1

@lock9
Copy link
Contributor

lock9 commented Sep 30, 2019

@decentralisedkev I will create new issues for this, ok? I believe some of them are important and can be implemented sooner than others.

@lock9
Copy link
Contributor

lock9 commented Sep 30, 2019

@decentralisedkev issues created. I edited your issue to mention it. I'm closing this issue now, thank you for your collaboration!

@lock9 lock9 closed this as completed Sep 30, 2019
Thacryba pushed a commit to simplitech/neo that referenced this issue Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Initial issue state - proposed but not yet accepted Enhancement Type - Changes that may affect performance, usability or add new features to existing modules. Housekeeping Small enhancements that need to be done in order to keep the project organised P2P Module - peer-to-peer message exchange and network optimisations, at TCP or UDP level (not HTTP).
Projects
None yet
Development

No branches or pull requests

5 participants