Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix premature disconnections from seeds #5057

Merged

Conversation

chimp1984
Copy link
Contributor

@chimp1984 chimp1984 commented Jan 6, 2021

Fixes issues with disconnections while peer is still in the initital data request phase.

We have several Request/Response cycles at startup:

  1. PreliminaryGetDataRequest -> GetDataResponse
  2. GetUpdatedDataRequest -> GetDataResponse
  3. GetBlocksRequest ->GetBlocksResponse
  4. GetDaoStateHashesRequest -> GetDaoStateHashesResponse
  5. GetBlindVoteStateHashesRequest -> GetBlindVoteStateHashesResponse
  6. GetProposalStateHashesRequest -> GetProposalStateHashesResponse

We set the PeerType to INITIAL_DATA_REQUEST but we had reset the PeerType to PEER once seed sent the GetDataResponse for the GetUpdatedDataRequest. This was too early as the PeerManager at connection management disconnects the low-prio connection (PEER) if there are too many connections.
This has led to the problem that the seed closed the connection before the peer can start the GetBlocksRequest.
Also the last GetDataResponse was at risk to not arrive as the connection might got closed too fast.

We changed that model now so that we count the requests and responses of this bootstrap phasse and if we reach the expected number we assume we are done and reset the PeerType to PEER. As we cannot rely on that we also start a timer to reset after 4 minutes.
When we exceed 20 connections at a normal peer or 34 at seed nodes (with maxConnections 20) we start to disconnect also peers with type INITIAL_DATA_EXCHANGE. We sort by oldest date when we sent or received a msg, so increase the chance that this peer is already done with bootstrapping.

We also removed the removeSuperfluousSeedNodes method which was a bit dangerous from the app side as it was protected only by the allowDisconnectSeedNodes flag which was handled by the GetBlockHandlers, but there was risk that we disconnect before we can request the StateHashes.

We have to take care that seed nodes do not get too weakly connected to healty nodes as they would miss then broadcast messages if all their connections would be occupied by bootstrapping nodes. I doubt that the changes here would have any impact on that behaviour but as its pretty complex we have to be careful with deployment.

I will test it on a small group of seed nodes first and add some more statistics to see which types of connections the seed has.

Remove unnecessary setPeerType calls. ConnectionState is handling that.
Only PeerManager does the setting of isSeedNode as we do not have the
required dependency in ConnectionState.
At INITIAL_DATA_EXCHANGE we sort
by lastInitialDataExchangeMessageTimeStamp
In the 3rd attempt we filter for
INITIAL_DATA_EXCHANGE peers.
Before we excluded 2 types and as PEER have been
already filtered earlier we would look up for SEED_NODE.
This was only called by non-seedNodes.
Rename maxConnectionsPeer to outBoundPeerTrigger
Rename maxConnectionsNonDirect to initialDataExchangeTrigger

Update comment
We handle the connections by INITIAL_DATA_EXCHANGE which
cover the seed nodes as well. Do have an parallel routine
is risky and make things more complex.
We handle it in ConnectionState by counting
requests and responses and adding a timer
@chimp1984 chimp1984 marked this pull request as draft January 6, 2021 03:00
@cbeams
Copy link
Contributor

cbeams commented Jan 6, 2021

Concept ACK. Described changes make sense. I've only scanned the code changes themselves, though, so not a real review. Thanks for turning this around so quickly, @chimp1984!

Improve statistic logging
chimp1984 added a commit to chimp1984/bisq that referenced this pull request Jan 6, 2021
It reports "Wanted but not invoked:" but when debugging
into it it is called. So seems to be some mock setup issue
Do check for closed socket after blocking read.
Move throttle code after blocking read.
@chimp1984 chimp1984 marked this pull request as ready for review January 6, 2021 21:52
@chimp1984
Copy link
Contributor Author

I deployed it to 3 seed nodes. Observing how they behave before extending to other seeds.

I added new statistics logging per connection. That helps to analyse the peer management and see where are bottlenecks.

Jan-06 21:49:33.738 [SeedNodeMain] INFO  b.n.p.p.PeerManager: Connection statistics: 

Connection 1
Age: 1 minute, 45.838 seconds
Peer: [Seed node] sn2bisqad7ncazupgbd3dcedqh5ptirgwofw63djwpdtftwhddo75oid.onion:8000 
Type: PEER 
Direction: Outbound
UID: e2e209db-b50e-4caa-972d-cced1afb90f8
Time since last message: 0.800 seconds
Time for response: [PreliminaryGetDataRequest/Response: 18.532 seconds, GetDaoStateHashesRequest/Response: 0.550 seconds, GetProposalStateHashesRequest/Response: 0.684 seconds, GetBlindVoteStateHashesRequest/Response: 0.621 seconds, GetUpdatedDataRequest/Response: 1.549 seconds]
Sent data: 665.76 kB; {PreliminaryGetDataRequest=1, AddDataMessage=6, GetDaoStateHashesRequest=1, GetProposalStateHashesRequest=1, GetBlindVoteStateHashesRequest=1, RefreshOfferMessage=13, GetUpdatedDataRequest=1, GetPeersRequest=1, BundleOfEnvelopes=7}
Received data: 5.068 MB; {AddDataMessage=2, GetDaoStateHashesResponse=1, GetPeersResponse=1, GetDataResponse=2, RefreshOfferMessage=11, GetProposalStateHashesResponse=1, RemoveDataMessage=1, GetBlindVoteStateHashesResponse=1}
CPU time spent on sending messages: 0.179 seconds
CPU time spent on receiving messages: 3.171 seconds

Connection 2
Age: 1 minute, 45.427 seconds
Peer: [Seed node] wizseedscybbttk4bmb2lzvbuk2jtect37lcpva4l3twktmkzemwbead.onion:8000 
Type: INITIAL_DATA_EXCHANGE 
Direction: Outbound
UID: f65a92c4-45b7-497f-a79c-107393b1287d
Time since last message: 0.701 seconds
Time for response: [PreliminaryGetDataRequest/Response: 19.199 seconds, GetDaoStateHashesRequest/Response: 0.704 seconds, GetProposalStateHashesRequest/Response: 0.834 seconds, GetBlindVoteStateHashesRequest/Response: 0.690 seconds]
Sent data: 337.351 kB; {PreliminaryGetDataRequest=1, AddDataMessage=11, GetDaoStateHashesRequest=1, GetProposalStateHashesRequest=1, GetBlindVoteStateHashesRequest=1, RefreshOfferMessage=17, RemoveDataMessage=1, RemoveMailboxDataMessage=4, BundleOfEnvelopes=10}
Received data: 5.052 MB; {GetDaoStateHashesResponse=1, GetDataResponse=1, GetProposalStateHashesResponse=1, GetBlindVoteStateHashesResponse=1}
CPU time spent on sending messages: 0.090 seconds
CPU time spent on receiving messages: 0.457 seconds

Connection 3
Age: 1 minute, 38.621 seconds
Peer: devinsn2teu33efff62bnvwbxmfgbfjlgqsu3ad4b4fudx3a725eqnyd.onion:8000 
Type: INITIAL_DATA_EXCHANGE 
Direction: Inbound
UID: 81eb9315-5186-4fd6-a153-f9597883e2e6
Time since last message: 2.445 seconds
Time for response: [GetDaoStateHashesRequest awaiting response... , GetProposalStateHashesRequest awaiting response... , GetBlindVoteStateHashesRequest awaiting response... , GetUpdatedDataRequest/Response: 0.008 seconds]
Sent data: 1.989 kB; {GetDataResponse=1, RefreshOfferMessage=1, RemoveDataMessage=1}
Received data: 369.567 kB; {AddDataMessage=8, GetDaoStateHashesRequest=1, GetProposalStateHashesRequest=1, GetBlindVoteStateHashesRequest=1, RefreshOfferMessage=25, GetUpdatedDataRequest=1}
CPU time spent on sending messages: 0.011 seconds
CPU time spent on receiving messages: 0.147 seconds


This can happen from the main shut down routine when a timeout gets triggered.
sqrrm
sqrrm previously approved these changes Jan 7, 2021
Copy link
Member

@sqrrm sqrrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

Will hold off with merging until test nodes have run for a bit

@ripcurlx
Copy link
Contributor

ripcurlx commented Jan 7, 2021

@chimp1984 Regarding the Codacy complaint. Do you want to keep the extra if statement to improve readability?

@chimp1984
Copy link
Contributor Author

@chimp1984 Regarding the Codacy complaint. Do you want to keep the extra if statement to improve readability?

Yes, I prefer to keep it as it is.

Copy link
Member

@sqrrm sqrrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants