This repository has been archived by the owner on Jun 20, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 672
Very slow TCP writes can hang a Weave process #445
Comments
What's the easiest way to reproduce this? |
I can reproduce this by starting 25 Weave peers then letting them all connect at once. It appears to be another deadlock: the other side of the cycle is:
In other words, when the TCP buffers fill up between peers, the fact that we send out new updates directly from the receiving thread means that we stop reading more, so nether side makes progress. In this context, setting a deadline on writes is insufficient - we need to decide what to do with further outgoing messages, to avoid blocking. I can see three alternatives:
|
This was referenced Mar 16, 2015
rade
added a commit
to rade/weave
that referenced
this issue
Mar 24, 2015
Introduce an intermediary - GossipSender - between GossipChannel and Connection. This accumulates gossip data (now represented by the new GossipData interface) from the former until it can be passed onto the latter, allowing the former to proceed when the latter may be blocked on i/o. This 1) prevents deadlocks that arise from cycles in the communication topology 2) improves performance by allowing GossipChannel and its calling code to proceed when connection i/o is blocked 3) improves performance by accumulating GossipData - the accumulated data often is considerably more compact than the sum of all the accumulated bits, and it is transmitted in a single communication event. In order to support accumulation, the GossipData interface has a Merge method. Furthermore, encoding is deferred until the data can actually be sent, since accumulation would be hard/impossible to encoded data. This required changes on some interfaces, most notably Gossiper. The implementation of GossipData for topology gossip is TopologyGossipData. It carries a reference to Peers and a set of PeerNames, indexing into Peers and referencing the Peer entries which have changed. Previously updates were represented as a list of Peer entries. The indirection via PeerNames allows Encode to omit entries which have been removed. The change in representation required changes to the signature of some methods on Peers. Note that all the above only applies to the "periodic gossip" portion of the Gossiper API, which topology gossip in fact uses for *all* gossip. GossipUnicast and GossipBroadcast are unchanged, but could conceivably receive the same treatment in the future. Fixes weaveworks#445.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
As seen in this stack trace, if the TCP write gets stuck for some reason, LocalPeer will make no progress, so no connections can be added, removed, the user cannot obtain status, etc.
I think it makes sense to set a deadline on writing similar to what we have for reading.
The text was updated successfully, but these errors were encountered: