Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Tear down connections on prolonged loss of UDP heartbeat #413

Merged
merged 4 commits into from
Feb 25, 2015

Conversation

awh
Copy link
Contributor

@awh awh commented Feb 23, 2015

Addresses #373 - detect udp connectivity breakage.

Once a connection moves to the established state indicating that the remote peer has received one of our heartbeats a timer is started. If we do not receive a UDP heartbeat from the remote peer within this time (default is three times the slow heartbeat interval) the connection is terminated:

connection shutting down due to error: timed out waiting for UDP heartbeat

At this point the existing connection resumption mechanism takes over. In normal operation, the timer is reset each time we receive a heartbeat.

Implementation notes:

  • It would be possible to merge establishedTimeout and heartbeatTimeout into a single timer; we would simply adjust duration and error messages depending on conn.established state.
  • Alternatively, the heartbeat timeout could run at the same time as the established timeout rather than being kicked off once we know our heartbeat has got through to the other side.

@@ -329,6 +332,9 @@ func (conn *LocalConnection) handleReceivedHeartbeat(remoteUDPAddr *net.UDPAddr)
conn.remoteUDPAddr = remoteUDPAddr
conn.receivedHeartbeat = true
conn.Unlock()
if conn.established {

This comment was marked as abuse.

@rade
Copy link
Member

rade commented Feb 23, 2015

What is the rationale for starting the heartbeat timer when becoming 'established'? The latter tells us that our heartbeats did get through, which says nothing about the other direction, which is what the timer is watching for.

Wouldn't it be better to instead start the timer after sending 'established' to the other side?

@rade
Copy link
Member

rade commented Feb 23, 2015

Wouldn't it be better to instead start the timer after sending 'established' to the other side?

Or, why not start the timer straight away? That would simplify the code since then a) the timer would always exist, and b) we won't have to check for a condition to start it.

I think that would also allow us to get rid of the 'established' timeout, i.e. we could rely on the heartbeat timeout to tear down the connection instead.

@awh
Copy link
Contributor Author

awh commented Feb 24, 2015

I was thinking of the case where we are originating from behind NAT; there's no way we can receive a heartbeat from the remote peer until our outbound connection is established, so I deferred starting the timer until that had happened.

I like the idea of simplifying down to a single timer very much, although it would have the effect of slightly reducing the rigour of the test (e.g. timely delivery of the established message over the TCP channel would no longer be checked) - are you happy with that? If not, I'd suggest treating the two cases as completely orthogonal and start both timers in run so that we can eliminate the conditional checks...

@rade
Copy link
Member

rade commented Feb 24, 2015

there's no way we can receive a heartbeat from the remote peer until our outbound connection is established

"almost established", i.e. our UDP packet must have made it across, but the Established TCP message may not have turned up yet.

So technically the initiating side should have a slightly longer timeout, but I can't imagine it's worth the hassle.

timely delivery of the established message over the TCP channel would no longer be checked

The ReadTimeout puts an upper bound on that.

@awh
Copy link
Contributor Author

awh commented Feb 25, 2015

Above commits address comments to date.

@rade
Copy link
Member

rade commented Feb 25, 2015

Great. Does it work? How have you tested this?

@awh
Copy link
Contributor Author

awh commented Feb 25, 2015

Confirmed with manual testing on two digital ocean VMs by using

iptables -I FORWARD 1 -p udp --dport 6783 -j DROP

to prevent weave related UDP traffic from being forwarded to the docker0 bridge. Tried the following cases:

  1. UDP traffic is blocked prior to launch; ensure that connection establishment times out but continues retrying
  2. Remove block; ensure connection establishes
  3. Restore block; ensure weave notices within configured timeout, tears down connection but continues retrying
  4. Remove block; ensure connection restablishes
  5. Left running for half an hour to observe stability

I used a combination of docker logs weave and weave status to observe the state on both peers at each step.

Would feel a lot more comfortable with #229 in place! It could use soak/torture testing...

rade added a commit that referenced this pull request Feb 25, 2015
Tear down connections on prolonged loss of UDP heartbeat.

Closes #373.
@rade rade merged commit 47ef9ff into weaveworks:master Feb 25, 2015
@awh awh deleted the 373_detect_udp_connectivity_breakage branch February 25, 2015 20:36
@rade rade modified the milestone: 0.10.0 Apr 18, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants