Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

bogus unestablished connections due to add/remove race condition #529

Closed
rade opened this issue Apr 8, 2015 · 2 comments · Fixed by #530
Closed

bogus unestablished connections due to add/remove race condition #529

rade opened this issue Apr 8, 2015 · 2 comments · Fixed by #530
Labels
Milestone

Comments

@rade
Copy link
Member

rade commented Apr 8, 2015

Connection.run() runs LocalPeer.AddConnection() in a separate goroutine. There's a comment explaining why that is both necessary and safe. The latter turns out to be wrong. If LocalConnection.actorLoop() terminates early on, the LocalConnection.RemoveConnection() call in the deferred LocalConection.shutdown() may overtake the AddConnection().

Note that this actually cannot happen that easily. Ordinarily LocalConnection.actorLoop() will only terminate when processing a LocalConnection.Shutdown(). And that simply cannot get invoked until after LocalPeer.AddConnection() has been called since only then it is possible to grab hold of a reference to this LocalConnection.

But there is one other termination case: failure to receive a UDP heartbeat for HeartbeatTimeout(30) seconds. This does require the LocalPeer.AddConnection() call to be delayed by at least the same amount of time, which should only happen in unusual circumstances, e.g. massive load, paging, weird VM pausing/stalling behaviour, etc. "Massive load" is how I ran into this, namely running 50 weaves on my machine.

I subsequently managed to reproduce the issue with just two weaves using

diff --git a/router/connection.go b/router/connection.go
index 9c47834..d9fc2cb 100644
--- a/router/connection.go
+++ b/router/connection.go
@@ -304,6 +304,7 @@ func (conn *LocalConnection) run(actionChan <-chan Connectio
        // main connection loop, and c) it guards against potential
        // deadlocks.
        go func() {
+               time.Sleep(HeartbeatTimeout + 1*time.Second)
                conn.Router.Ourself.AddConnection(conn)
                conn.receiveTCP(dec)
        }()

The logs on the two weaves show

weave 2015/04/08 22:22:36.562526 Our name is 7a:fb:3b:8e:25:78(weave2)
weave 2015/04/08 22:22:37.507724 ->[172.17.4.147:6783] attempting connection
weave 2015/04/08 22:22:37.508453 ->[172.17.4.147:6783] completed handshake with 7a:a5:14:62:2e:d5(weave1)
weave 2015/04/08 22:23:07.508836 ->[7a:a5:14:62:2e:d5(weave1)]: connection shutting down due to error: timed out waiting for UDP heartbeat
weave 2015/04/08 22:23:08.508874 ->[7a:a5:14:62:2e:d5(weave1)]: connection added (new peer)
weave 2015/04/08 22:22:35.205254 Our name is 7a:a5:14:62:2e:d5(weave1)
weave 2015/04/08 22:22:37.507948 ->[172.17.4.148:48097] connection accepted
weave 2015/04/08 22:22:37.508387 ->[172.17.4.148:48097] completed handshake with 7a:fb:3b:8e:25:78(weave2)
weave 2015/04/08 22:23:07.508680 ->[7a:fb:3b:8e:25:78(weave2)]: connection shutting down due to error: timed out waiting for UDP heartbeat
weave 2015/04/08 22:23:08.508758 ->[7a:fb:3b:8e:25:78(weave2)]: connection added (new peer)

and the status shows

Our name is 7a:fb:3b:8e:25:78(weave2)
...
Peers:
Peer 7a:fb:3b:8e:25:78(weave2) (v1) (UID 7828146128608141902)
   -> 7a:a5:14:62:2e:d5(weave1) [172.17.4.147:6783 (unestablished)]
Peer 7a:a5:14:62:2e:d5(weave1) (v0) (UID 8931467520931168315)
Our name is 7a:a5:14:62:2e:d5(weave1)
...
Peers:
Peer 7a:a5:14:62:2e:d5(weave1) (v1) (UID 8931467520931168315)
   -> 7a:fb:3b:8e:25:78(weave2) [172.17.4.148:48097 (unestablished)]
Peer 7a:fb:3b:8e:25:78(weave2) (v0) (UID 7828146128608141902)
@rade rade added the bug label Apr 8, 2015
@rade
Copy link
Member Author

rade commented Apr 8, 2015

This could well be the root cause of #515. And #341.

bboreham added a commit that referenced this issue Apr 9, 2015
…_race

Very subtle, but it looks ok to me.  Fixes #529
@bboreham
Copy link
Contributor

bboreham commented Apr 9, 2015

Prior to recent decoupling of gossip send, LocalPeer.AddConnection() could get delayed by the outgoing TCP buffer being full.

@rade rade modified the milestone: 0.10.0 Apr 18, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants