Tear down connections on prolonged loss of UDP heartbeat #413

awh · 2015-02-23T15:47:47Z

Addresses #373 - detect udp connectivity breakage.

Once a connection moves to the established state indicating that the remote peer has received one of our heartbeats a timer is started. If we do not receive a UDP heartbeat from the remote peer within this time (default is three times the slow heartbeat interval) the connection is terminated:

connection shutting down due to error: timed out waiting for UDP heartbeat

At this point the existing connection resumption mechanism takes over. In normal operation, the timer is reset each time we receive a heartbeat.

Implementation notes:

It would be possible to merge establishedTimeout and heartbeatTimeout into a single timer; we would simply adjust duration and error messages depending on conn.established state.
Alternatively, the heartbeat timeout could run at the same time as the established timeout rather than being kicked off once we know our heartbeat has got through to the other side.

router/connection.go

@@ -329,6 +332,9 @@ func (conn *LocalConnection) handleReceivedHeartbeat(remoteUDPAddr *net.UDPAddr)
 	conn.remoteUDPAddr = remoteUDPAddr
 	conn.receivedHeartbeat = true
 	conn.Unlock()
+	if conn.established {


rade · 2015-02-23T21:38:40Z

What is the rationale for starting the heartbeat timer when becoming 'established'? The latter tells us that our heartbeats did get through, which says nothing about the other direction, which is what the timer is watching for.

Wouldn't it be better to instead start the timer after sending 'established' to the other side?

rade · 2015-02-23T21:50:25Z

Wouldn't it be better to instead start the timer after sending 'established' to the other side?

Or, why not start the timer straight away? That would simplify the code since then a) the timer would always exist, and b) we won't have to check for a condition to start it.

I think that would also allow us to get rid of the 'established' timeout, i.e. we could rely on the heartbeat timeout to tear down the connection instead.

awh · 2015-02-24T17:44:05Z

I was thinking of the case where we are originating from behind NAT; there's no way we can receive a heartbeat from the remote peer until our outbound connection is established, so I deferred starting the timer until that had happened.

I like the idea of simplifying down to a single timer very much, although it would have the effect of slightly reducing the rigour of the test (e.g. timely delivery of the established message over the TCP channel would no longer be checked) - are you happy with that? If not, I'd suggest treating the two cases as completely orthogonal and start both timers in run so that we can eliminate the conditional checks...

rade · 2015-02-24T18:00:16Z

there's no way we can receive a heartbeat from the remote peer until our outbound connection is established

"almost established", i.e. our UDP packet must have made it across, but the Established TCP message may not have turned up yet.

So technically the initiating side should have a slightly longer timeout, but I can't imagine it's worth the hassle.

timely delivery of the established message over the TCP channel would no longer be checked

The ReadTimeout puts an upper bound on that.

Conflicts: router/connection.go

awh · 2015-02-25T11:23:22Z

Above commits address comments to date.

rade · 2015-02-25T15:52:59Z

Great. Does it work? How have you tested this?

awh · 2015-02-25T16:50:53Z

Confirmed with manual testing on two digital ocean VMs by using

iptables -I FORWARD 1 -p udp --dport 6783 -j DROP

to prevent weave related UDP traffic from being forwarded to the docker0 bridge. Tried the following cases:

UDP traffic is blocked prior to launch; ensure that connection establishment times out but continues retrying
Remove block; ensure connection establishes
Restore block; ensure weave notices within configured timeout, tears down connection but continues retrying
Remove block; ensure connection restablishes
Left running for half an hour to observe stability

I used a combination of docker logs weave and weave status to observe the state on both peers at each step.

Would feel a lot more comfortable with #229 in place! It could use soak/torture testing...

Tear down connections on prolonged loss of UDP heartbeat. Closes #373.

Tear down connections on prolonged loss of UDP heartbeat

0a0ed7a

rade reviewed Feb 23, 2015
View reviewed changes

router/connection.go

@@ -329,6 +332,9 @@ func (conn *LocalConnection) handleReceivedHeartbeat(remoteUDPAddr *net.UDPAddr)

conn.remoteUDPAddr = remoteUDPAddr

conn.receivedHeartbeat = true

conn.Unlock()

if conn.established {

This comment was marked as abuse.

Sign in to view

awh added 3 commits February 24, 2015 18:31

Simplify to single heartbeat timer

7618ae0

Merge branch 'master' into 373_detect_udp_connectivity_breakage

d031e8d

Conflicts: router/connection.go

Removed obsolete constant

cd168f5

rade added a commit that referenced this pull request Feb 25, 2015

Merge pull request #413 from awh/373_detect_udp_connectivity_breakage

47ef9ff

Tear down connections on prolonged loss of UDP heartbeat. Closes #373.

rade merged commit 47ef9ff into weaveworks:master Feb 25, 2015

awh deleted the 373_detect_udp_connectivity_breakage branch February 25, 2015 20:36

rade modified the milestone: 0.10.0 Apr 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tear down connections on prolonged loss of UDP heartbeat #413

Tear down connections on prolonged loss of UDP heartbeat #413

awh commented Feb 23, 2015

This comment was marked as abuse.

rade commented Feb 23, 2015

rade commented Feb 23, 2015

awh commented Feb 24, 2015

rade commented Feb 24, 2015

awh commented Feb 25, 2015

rade commented Feb 25, 2015

awh commented Feb 25, 2015

Tear down connections on prolonged loss of UDP heartbeat #413

Tear down connections on prolonged loss of UDP heartbeat #413

Conversation

awh commented Feb 23, 2015

This comment was marked as abuse.

rade commented Feb 23, 2015

rade commented Feb 23, 2015

awh commented Feb 24, 2015

rade commented Feb 24, 2015

awh commented Feb 25, 2015

rade commented Feb 25, 2015

awh commented Feb 25, 2015