Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

All peers become "unestablished" #515

Closed
thomascramer opened this issue Apr 5, 2015 · 39 comments · Fixed by #565
Closed

All peers become "unestablished" #515

thomascramer opened this issue Apr 5, 2015 · 39 comments · Fixed by #565
Assignees
Labels
Milestone

Comments

@thomascramer
Copy link

I'm not sure if it is the current weave image that was pushed out last week or if it is testing more instances with it or what; but essentially at one point spinning up weave instances or if we restart weave instances (ie weave stop, let it be dead for a while then spin it up again; basically testing a rolling update situation or network interruption/system restart/what have you), the weave peers start having connectivity issues and then eventually basically all weave peers across all weave instances show up as "unestablished." Also containers connected via the weave instance isn't able to route traffic while this is happening... Sometimes waiting a couple hours resolves the issues, but often it seems the only way to recover is to just shutting down instances until things are happy again. We are very eager to utilize weave, but we can't proceed with weave in this state...

I'm currently on version:

# weave version
weave script (unreleased version)
weave router git-b00be096f78a
Unable to find zettio/weavedns:latest image.

I'm also working on EC2 instances inside a VPC, and all EC2 instances are in the same reason and shouldn't have issues communicating with on another.... We are currently trying to wire up 12 instances together (I have set -connlimit 50);

The last time it happened, on one of my servers, I saw

weave 2015/04/05 09:04:58.507960 ->[7a:4c:0a:20:2e:0b(load-xxx-instance2)]: Effective PMTU set to 8921
weave 2015/04/05 09:04:58.585725 ->[7a:4c:0a:20:2e:0b(load-xxx-instance2)]: Effective PMTU verified at 8921
weave 2015/04/05 09:05:40.208252 Discovered remote MAC 92:ba:fd:55:a1:5a at 7a:4c:0a:20:2e:0b(load-xxx-instance2)
weave 2015/04/05 09:05:55.784310 Discovered remote MAC 6a:9d:16:ad:36:82 at 7a:ce:9d:00:a1:1d(load-logstash-instance2)
weave 2015/04/05 09:06:11.643978 ->[10.0.2.161:41710] connection accepted
weave 2015/04/05 09:06:11.644630 ->[10.0.2.161:41710] connection shutting down due to error during handshake: Already have connection to 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) at 10.0.2.161:6783
weave 2015/04/05 09:08:25.917098 ->[10.0.6.155:46605] connection accepted
weave 2015/04/05 09:08:25.917756 ->[10.0.6.155:46605] completed handshake with 7a:35:ff:e0:ad:29(demo-xxx-instance1)
weave 2015/04/05 09:08:25.917825 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection added (new peer)
weave 2015/04/05 09:08:25.932458 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection fully established
weave 2015/04/05 09:08:25.933251 EMSGSIZE on send, expecting PMTU update (IP packet was 60082 bytes, payload was 60074 bytes)
weave 2015/04/05 09:08:25.933283 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: Effective PMTU set to 8921
weave 2015/04/05 09:08:26.015285 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: Effective PMTU verified at 8921
weave 2015/04/05 09:08:48.961003 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection shutting down due to error: read tcp4 10.0.2.5:40315: connection reset by peer
weave 2015/04/05 09:08:48.961140 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection deleted
weave 2015/04/05 09:08:48.962933 ->[10.0.2.5:6783] attempting connection
weave 2015/04/05 09:08:48.963378 ->[10.0.2.5:44127] connection accepted
weave 2015/04/05 09:08:48.963923 ->[10.0.2.5:44127] completed handshake with 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:48.964479 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection added
weave 2015/04/05 09:08:48.966333 ->[10.0.2.5:6783] completed handshake with 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:48.967129 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection shutting down due to error: read tcp4 10.0.2.5:44127: connection reset by peer
weave 2015/04/05 09:08:48.967307 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection deleted
weave 2015/04/05 09:08:48.971950 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection added
weave 2015/04/05 09:08:50.047939 ->[10.0.2.5:44142] connection accepted
weave 2015/04/05 09:08:50.048831 ->[10.0.2.5:44142] completed handshake with 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:50.048955 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection shutting down due to error: Multiple connections to 7a:3d:12:77:03:8b(dev-xxx-instance1) added to 7a:1e:06:5f:50:07(dev-yyy-instance1)
weave 2015/04/05 09:08:51.195830 Received packet for unknown destination: 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:51.967899 ->[10.0.2.5:44154] connection accepted
weave 2015/04/05 09:08:51.968583 ->[10.0.2.5:44154] completed handshake with 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:51.968694 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection shutting down due to error: Multiple connections to 7a:3d:12:77:03:8b(dev-xxx-instance1) added to 7a:1e:06:5f:50:07(dev-yyy-instance1)
weave 2015/04/05 09:08:52.193015 Received packet for unknown destination: 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:53.168011 Received packet for unknown destination: 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:53.196992 Received packet for unknown destination: 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:53.409856 ->[10.0.6.42:59344] connection accepted
weave 2015/04/05 09:08:53.409887 ->[7a:0e:0c:7a:dd:fe(dev-zzz-instance1)]: connection shutting down due to error: read tcp4 10.0.6.42:56274: connection reset by peer
weave 2015/04/05 09:08:53.410055 ->[7a:0e:0c:7a:dd:fe(dev-zzz-instance1)]: connection deleted
weave 2015/04/05 09:08:53.410694 ->[10.0.6.42:59344] completed handshake with 7a:0e:0c:7a:dd:fe(dev-zzz-instance1)
weave 2015/04/05 09:08:53.410870 ->[7a:0e:0c:7a:dd:fe(dev-zzz-instance1)]: connection added
weave 2015/04/05 09:08:54.169060 Received packet for unknown destination: 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:54.169121 Received packet for unknown destination: 7a:0e:0c:7a:dd:fe(dev-zzz-instance1)
weave 2015/04/05 09:08:54.659941 ->[10.0.2.5:44180] connection accepted
weave 2015/04/05 09:08:54.660499 ->[10.0.2.5:44180] completed handshake with 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:54.660559 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection deleted
weave 2015/04/05 09:08:54.660632 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection added (new peer)
weave 2015/04/05 09:08:54.663561 ->[7a:3d:12:77:03:8b(dev-xxx-instance1)]: connection shutting down due to error: Multiple connections to 7a:3d:12:77:03:8b(dev-xxx-instance1) added to 7a:1e:06:5f:50:07(dev-yyy-instance1)
weave 2015/04/05 09:08:54.663941 Received packet for unknown destination: 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:54.857093 Received packet for unknown destination: 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:55.853065 Received packet for unknown destination: 7a:3d:12:77:03:8b(dev-xxx-instance1)

Looks like it was able to connect to the new server "demo-xxx-instance1" then afterwards machines started registering "Received packet for unknown destination"

On the first server that seemingly had issues, dev-xxx-instance1, it's logs have:

weave 2015/04/05 09:08:25.923434 ->[10.0.6.155:36708] connection accepted
weave 2015/04/05 09:08:25.925782 ->[10.0.6.155:6783] completed handshake with 7a:35:ff:e0:ad:29(demo-xxx-instance1)
weave 2015/04/05 09:08:25.925918 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection added (new peer)
weave 2015/04/05 09:08:25.935814 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection fully established
weave 2015/04/05 09:08:25.936201 EMSGSIZE on send, expecting PMTU update (IP packet was 60082 bytes, payload was 60074 bytes)
weave 2015/04/05 09:08:25.936296 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: Effective PMTU set to 8921
weave 2015/04/05 09:08:25.967294 ->[10.0.6.155:36708] connection shutting down due to error during handshake: Already have connection to 7a:35:ff:e0:ad:29(demo-xxx-instance1) at 10.0.6.155:6783
weave 2015/04/05 09:08:25.983811 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection shutting down due to error: Failed to decode packet: decryption failed; Suspected replay attack detected when decrypting UDP packet
weave 2015/04/05 09:08:25.984022 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection deleted
weave 2015/04/05 09:08:25.987252 ->[10.0.6.155:6783] attempting connection
weave 2015/04/05 09:08:25.994715 ->[10.0.6.155:6783] completed handshake with 7a:35:ff:e0:ad:29(demo-xxx-instance1)
weave 2015/04/05 09:08:25.995021 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection added
weave 2015/04/05 09:08:26.496358 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection shutting down due to error: write tcp4 10.0.6.155:6783: connection reset by peer
weave 2015/04/05 09:08:26.497218 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection deleted
weave 2015/04/05 09:08:26.497509 ->[10.0.6.155:6783] attempting connection
weave 2015/04/05 09:08:26.504692 ->[10.0.6.155:6783] completed handshake with 7a:35:ff:e0:ad:29(demo-xxx-instance1)
weave 2015/04/05 09:08:26.505891 ->[10.0.6.155:36726] connection accepted
weave 2015/04/05 09:08:26.515607 ->[10.0.6.155:36726] completed handshake with 7a:35:ff:e0:ad:29(demo-xxx-instance1)
weave 2015/04/05 09:08:26.528760 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection added
weave 2015/04/05 09:08:26.528887 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection deleted
weave 2015/04/05 09:08:26.529033 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection added
weave 2015/04/05 09:08:26.541359 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection shutting down due to error: Multiple connections to 7a:35:ff:e0:ad:29(demo-xxx-instance1) added to 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:26.606384 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection shutting down due to error: write tcp4 10.0.6.155:36726: connection reset by peer
weave 2015/04/05 09:08:26.606549 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection deleted
weave 2015/04/05 09:08:26.615125 ->[10.0.6.155:6783] attempting connection
weave 2015/04/05 09:08:26.615313 ->[10.0.6.155:36733] connection accepted
weave 2015/04/05 09:08:26.616143 ->[10.0.6.155:36733] completed handshake with 7a:35:ff:e0:ad:29(demo-xxx-instance1)
weave 2015/04/05 09:08:26.621874 ->[10.0.6.155:6783] completed handshake with 7a:35:ff:e0:ad:29(demo-xxx-instance1)
weave 2015/04/05 09:08:26.630949 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection added
weave 2015/04/05 09:08:26.642065 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection shutting down due to error: Multiple connections to 7a:35:ff:e0:ad:29(demo-xxx-instance1) added to 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:27.121552 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection shutting down due to error: read tcp4 10.0.6.155:36733: connection reset by peer
weave 2015/04/05 09:08:27.121797 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection deleted
weave 2015/04/05 09:08:27.123131 ->[10.0.6.155:6783] attempting connection
weave 2015/04/05 09:08:27.127139 ->[10.0.6.155:36743] connection accepted
weave 2015/04/05 09:08:27.128378 ->[10.0.6.155:6783] completed handshake with 7a:35:ff:e0:ad:29(demo-xxx-instance1)
weave 2015/04/05 09:08:27.129155 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection added
weave 2015/04/05 09:08:27.132866 ->[10.0.6.155:36743] completed handshake with 7a:35:ff:e0:ad:29(demo-xxx-instance1)
weave 2015/04/05 09:08:27.133033 ->[7a:35:ff:e0:ad:29(demo-xxx-instance1)]: connection shutting down due to error: Multiple connections to 7a:35:ff:e0:ad:29(demo-xxx-instance1) added to 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:48.199385 ->[7a:a3:3d:ff:e3:9b(load-xxx-instance1)]: connection shutting down due to error: timed out waiting for UDP heartbeat
weave 2015/04/05 09:08:48.199527 ->[7a:a3:3d:ff:e3:9b(load-xxx-instance1)]: connection deleted
weave 2015/04/05 09:08:48.201342 ->[10.0.2.189:6783] attempting connection
weave 2015/04/05 09:08:48.202971 ->[10.0.2.189:42206] connection accepted
weave 2015/04/05 09:08:48.204723 ->[10.0.2.189:42206] completed handshake with 7a:a3:3d:ff:e3:9b(load-xxx-instance1)
weave 2015/04/05 09:08:48.205011 ->[7a:a3:3d:ff:e3:9b(load-xxx-instance1)]: connection added
weave 2015/04/05 09:08:48.206166 ->[10.0.2.189:6783] completed handshake with 7a:a3:3d:ff:e3:9b(load-xxx-instance1)
weave 2015/04/05 09:08:48.207652 ->[7a:a3:3d:ff:e3:9b(load-xxx-instance1)]: connection shutting down due to error: write tcp4 10.0.2.189:6783: connection reset by peer
weave 2015/04/05 09:08:48.800278 ->[7a:44:3a:8c:91:fb(load-xxx-instance3)]: connection shutting down due to error: timed out waiting for UDP heartbeat
weave 2015/04/05 09:08:48.800409 ->[7a:44:3a:8c:91:fb(load-xxx-instance3)]: connection deleted
weave 2015/04/05 09:08:48.802190 ->[10.0.6.78:6783] attempting connection
weave 2015/04/05 09:08:48.806310 ->[10.0.6.78:44369] connection accepted
weave 2015/04/05 09:08:48.807851 ->[10.0.6.78:44369] completed handshake with 7a:44:3a:8c:91:fb(load-xxx-instance3)
weave 2015/04/05 09:08:48.807959 ->[7a:44:3a:8c:91:fb(load-xxx-instance3)]: connection added
weave 2015/04/05 09:08:48.808631 ->[10.0.6.78:6783] completed handshake with 7a:44:3a:8c:91:fb(load-xxx-instance3)
weave 2015/04/05 09:08:48.811043 ->[7a:44:3a:8c:91:fb(load-xxx-instance3)]: connection deleted
weave 2015/04/05 09:08:48.811107 ->[7a:44:3a:8c:91:fb(load-xxx-instance3)]: connection added
weave 2015/04/05 09:08:48.813256 ->[7a:44:3a:8c:91:fb(load-xxx-instance3)]: connection shutting down due to error: Multiple connections to 7a:44:3a:8c:91:fb(load-xxx-instance3) added to 7a:3d:12:77:03:8b(dev-xxx-instance1)
weave 2015/04/05 09:08:48.847521 ->[7a:4c:0a:20:2e:0b(load-xxx-instance2)]: connection shutting down due to error: timed out waiting for UDP heartbeat
weave 2015/04/05 09:08:48.847570 ->[7a:4c:0a:20:2e:0b(load-xxx-instance2)]: connection deleted
weave 2015/04/05 09:08:48.849113 ->[10.0.2.32:6783] attempting connection
weave 2015/04/05 09:08:48.871939 ->[10.0.2.32:6783] completed handshake with 7a:4c:0a:20:2e:0b(load-xxx-instance2)
weave 2015/04/05 09:08:48.872131 ->[7a:4c:0a:20:2e:0b(load-xxx-instance2)]: connection added

I also grabbed the dump from doing SIGQUIT:

weave 2015/04/05 10:35:00.572950 === received SIGQUIT ===
*** goroutine dump...
goroutine 1 [running]:
main.handleSignals(0xc209116200)
    /home/matthias/go/src/github.com/zettio/weave/weaver/main.go:209 +0x30c
main.main()
    /home/matthias/go/src/github.com/zettio/weave/weaver/main.go:128 +0x124d

goroutine 5 [syscall]:
os/signal.loop()
    /usr/local/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
    /usr/local/go/src/os/signal/signal_unix.go:27 +0x35

goroutine 17 [syscall, 122 minutes, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 18 [select]:
github.com/zettio/weave/router.(*LocalPeer).actorLoop(0xc20910e180, 0xc2091260c0)
    /home/matthias/go/src/github.com/zettio/weave/router/local_peer.go:157 +0x101
created by github.com/zettio/weave/router.(*LocalPeer).Start
    /home/matthias/go/src/github.com/zettio/weave/router/local_peer.go:25 +0x7e

goroutine 19 [chan receive]:
github.com/zettio/weave/router.(*Routes).actorLoop(0xc2091142c0, 0xc209126120)
    /home/matthias/go/src/github.com/zettio/weave/router/routes.go:87 +0x47
created by github.com/zettio/weave/router.(*Routes).Start
    /home/matthias/go/src/github.com/zettio/weave/router/routes.go:32 +0x7e

goroutine 20 [select]:
github.com/zettio/weave/router.(*ConnectionMaker).queryLoop(0xc209112300, 0xc209126180)
    /home/matthias/go/src/github.com/zettio/weave/router/connection_maker.go:99 +0x162
created by github.com/zettio/weave/router.(*ConnectionMaker).Start
    /home/matthias/go/src/github.com/zettio/weave/router/connection_maker.go:45 +0x7e

goroutine 21 [chan receive]:
github.com/zettio/weave/router.(*NaClDecryptorInstance).advanceState(0xc20a919290, 0xc209f00000, 0xa, 0x0, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/crypto.go:378 +0xd9
github.com/zettio/weave/router.(*NaClDecryptor).decrypt(0xc20a91ce20, 0xc2094618c0, 0x36, 0x36, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/crypto.go:353 +0x11c
github.com/zettio/weave/router.(*NaClDecryptor).IterateFrames(0xc20a91ce20, 0xc2094618c0, 0x36, 0x36, 0xc20946d230, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/crypto.go:336 +0x72
github.com/zettio/weave/router.(*Router).udpReader(0xc209116200, 0xc209124028, 0x7f8dadf5d820, 0xc209124018)
    /home/matthias/go/src/github.com/zettio/weave/router/router.go:242 +0x64c
created by github.com/zettio/weave/router.(*Router).listenUDP
    /home/matthias/go/src/github.com/zettio/weave/router/router.go:212 +0x324

goroutine 22 [IO wait]:
net.(*pollDesc).Wait(0xc20912a1b0, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20912a1b0, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc20912a150, 0x0, 0x7f8daf999730, 0xc208e4b2e8)
    /usr/local/go/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc209124048, 0xc20a2740b8, 0x0, 0x0)
    /usr/local/go/src/net/tcpsock_posix.go:234 +0x4e
github.com/zettio/weave/router.func·026()
    /home/matthias/go/src/github.com/zettio/weave/router/router.go:179 +0x5e
created by github.com/zettio/weave/router.(*Router).listenTCP
    /home/matthias/go/src/github.com/zettio/weave/router/router.go:186 +0x2e4

goroutine 23 [syscall, locked to thread]:
code.google.com/p/gopacket/pcap._Cfunc_pcap_next_ex(0x2184f30, 0xc2091143e8, 0xc2091143f0, 0x0)
    /home/matthias/go/src/code.google.com/p/gopacket/pcap/:172 +0x43
code.google.com/p/gopacket/pcap.(*Handle).getNextBufPtrLocked(0xc2091143c0, 0xc20a278ef8, 0x0, 0x0)
    /home/matthias/go/src/code.google.com/p/gopacket/pcap/pcap.go:302 +0x6b
code.google.com/p/gopacket/pcap.(*Handle).ZeroCopyReadPacketData(0xc2091143c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/matthias/go/src/code.google.com/p/gopacket/pcap/pcap.go:334 +0xb0
github.com/zettio/weave/router.(*PcapIO).ReadPacket(0xc209124010, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/pcap.go:64 +0x67
github.com/zettio/weave/router.func·025()
    /home/matthias/go/src/github.com/zettio/weave/router/router.go:114 +0x5d
created by github.com/zettio/weave/router.(*Router).sniff
    /home/matthias/go/src/github.com/zettio/weave/router/router.go:119 +0x40d

goroutine 24 [IO wait]:
net.(*pollDesc).Wait(0xc20912b170, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20912b170, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc20912b110, 0x0, 0x7f8daf999730, 0xc20900c720)
    /usr/local/go/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc2091240b0, 0x559ade, 0x0, 0x0)
    /usr/local/go/src/net/tcpsock_posix.go:234 +0x4e
net/http.tcpKeepAliveListener.Accept(0xc2091240b0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1976 +0x4c
net/http.(*Server).Serve(0xc209126540, 0x7f8dadf5eea0, 0xc2091240b0, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1728 +0x92
net/http.(*Server).ListenAndServe(0xc209126540, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1718 +0x154
net/http.ListenAndServe(0xc20911e840, 0x5, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1808 +0xba
main.handleHTTP(0xc209116200)
    /home/matthias/go/src/github.com/zettio/weave/weaver/main.go:195 +0x723
created by main.main
    /home/matthias/go/src/github.com/zettio/weave/weaver/main.go:127 +0x123d

goroutine 33562 [chan receive]:
github.com/zettio/weave/router.(*GossipSender).run(0xc20a54fb40)
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:56 +0x54
created by github.com/zettio/weave/router.(*GossipSender).Start
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:51 +0x6f

goroutine 33557 [IO wait]:
net.(*pollDesc).Wait(0xc209a0aed0, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc209a0aed0, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc209a0ae70, 0xc20a283000, 0x1000, 0x1000, 0x0, 0x7f8daf999730, 0xc208dfe7a8)
    /usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc209cb8bc8, 0xc20a283000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:121 +0xdc
bufio.(*Reader).fill(0xc20a970420)
    /usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).Read(0xc20a970420, 0xc208d70560, 0x1, 0x9, 0x100, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:174 +0x26c
io.ReadAtLeast(0x7f8dadf5d690, 0xc20a970420, 0xc208d70560, 0x1, 0x9, 0x1, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f8dadf5d690, 0xc20a970420, 0xc208d70560, 0x1, 0x9, 0x40, 0x0, 0x0)
    /usr/local/go/src/io/io.go:316 +0x6d
encoding/gob.decodeUintReader(0x7f8dadf5d690, 0xc20a970420, 0xc208d70560, 0x9, 0x9, 0x0, 0x1, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decode.go:121 +0x99
encoding/gob.(*Decoder).recvMessage(0xc20a280380, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:76 +0x55
encoding/gob.(*Decoder).decodeTypeSequence(0xc20a280380, 0xc2098a1100, 0x16)
    /usr/local/go/src/encoding/gob/decoder.go:140 +0x47
encoding/gob.(*Decoder).DecodeValue(0xc20a280380, 0x7cfee0, 0xc2098a1120, 0x16, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:208 +0x192
encoding/gob.(*Decoder).Decode(0xc20a280380, 0x7cfee0, 0xc2098a1120, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:185 +0x297
github.com/zettio/weave/router.(*LocalConnection).receiveTCP(0xc209e17340, 0xc20a280380)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:412 +0x152
github.com/zettio/weave/router.func·005()
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:308 +0x69
created by github.com/zettio/weave/router.(*LocalConnection).run
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:309 +0x518

goroutine 33661 [IO wait]:
net.(*pollDesc).Wait(0xc20a0cafb0, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20a0cafb0, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc20a0caf50, 0xc209f8d000, 0x1000, 0x1000, 0x0, 0x7f8daf999730, 0xc208e15220)
    /usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc20a2740b8, 0xc209f8d000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:121 +0xdc
bufio.(*Reader).fill(0xc209324c60)
    /usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).Read(0xc209324c60, 0xc208e713e0, 0x1, 0x9, 0x100, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:174 +0x26c
io.ReadAtLeast(0x7f8dadf5d690, 0xc209324c60, 0xc208e713e0, 0x1, 0x9, 0x1, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f8dadf5d690, 0xc209324c60, 0xc208e713e0, 0x1, 0x9, 0x40, 0x0, 0x0)
    /usr/local/go/src/io/io.go:316 +0x6d
encoding/gob.decodeUintReader(0x7f8dadf5d690, 0xc209324c60, 0xc208e713e0, 0x9, 0x9, 0x0, 0x1, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decode.go:121 +0x99
encoding/gob.(*Decoder).recvMessage(0xc209ed2e80, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:76 +0x55
encoding/gob.(*Decoder).decodeTypeSequence(0xc209ed2e80, 0xc20a167b00, 0x16)
    /usr/local/go/src/encoding/gob/decoder.go:140 +0x47
encoding/gob.(*Decoder).DecodeValue(0xc209ed2e80, 0x7cfee0, 0xc20a167b00, 0x16, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:208 +0x192
encoding/gob.(*Decoder).Decode(0xc209ed2e80, 0x7cfee0, 0xc20a167b00, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:185 +0x297
github.com/zettio/weave/router.(*LocalConnection).receiveTCP(0xc20a0f20e0, 0xc209ed2e80)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:412 +0x152
github.com/zettio/weave/router.func·005()
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:308 +0x69
created by github.com/zettio/weave/router.(*LocalConnection).run
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:309 +0x518

goroutine 33672 [select]:
github.com/zettio/weave/router.(*Forwarder).run(0xc209405c80, 0xc209a47200, 0xc209a47260, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:244 +0x457
created by github.com/zettio/weave/router.(*Forwarder).Start
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:220 +0xfc

goroutine 33590 [IO wait]:
net.(*pollDesc).Wait(0xc20a28bfe0, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20a28bfe0, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc20a28bf80, 0xc20a077000, 0x1000, 0x1000, 0x0, 0x7f8daf999730, 0xc208e0ada0)
    /usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc209b9fde8, 0xc20a077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:121 +0xdc
bufio.(*Reader).fill(0xc20a070420)
    /usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).Read(0xc20a070420, 0xc208d3f960, 0x1, 0x9, 0x100, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:174 +0x26c
io.ReadAtLeast(0x7f8dadf5d690, 0xc20a070420, 0xc208d3f960, 0x1, 0x9, 0x1, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f8dadf5d690, 0xc20a070420, 0xc208d3f960, 0x1, 0x9, 0x40, 0x0, 0x0)
    /usr/local/go/src/io/io.go:316 +0x6d
encoding/gob.decodeUintReader(0x7f8dadf5d690, 0xc20a070420, 0xc208d3f960, 0x9, 0x9, 0x0, 0x1, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decode.go:121 +0x99
encoding/gob.(*Decoder).recvMessage(0xc209f6f480, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:76 +0x55
encoding/gob.(*Decoder).decodeTypeSequence(0xc209f6f480, 0xc209284800, 0x16)
    /usr/local/go/src/encoding/gob/decoder.go:140 +0x47
encoding/gob.(*Decoder).DecodeValue(0xc209f6f480, 0x7cfee0, 0xc209284800, 0x16, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:208 +0x192
encoding/gob.(*Decoder).Decode(0xc209f6f480, 0x7cfee0, 0xc209284800, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:185 +0x297
github.com/zettio/weave/router.(*LocalConnection).receiveTCP(0xc20954c700, 0xc209f6f480)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:412 +0x152
github.com/zettio/weave/router.func·005()
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:308 +0x69
created by github.com/zettio/weave/router.(*LocalConnection).run
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:309 +0x518

goroutine 33433 [chan receive]:
github.com/zettio/weave/router.(*GossipSender).run(0xc209f1c260)
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:56 +0x54
created by github.com/zettio/weave/router.(*GossipSender).Start
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:51 +0x6f

goroutine 33454 [IO wait]:
net.(*pollDesc).Wait(0xc20931ffe0, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20931ffe0, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc20931ff80, 0xc209142000, 0x1000, 0x1000, 0x0, 0x7f8daf999730, 0xc208e05be0)
    /usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc20aa19f28, 0xc209142000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:121 +0xdc
bufio.(*Reader).fill(0xc2094a3920)
    /usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).Read(0xc2094a3920, 0xc208fa7d20, 0x1, 0x9, 0x100, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:174 +0x26c
io.ReadAtLeast(0x7f8dadf5d690, 0xc2094a3920, 0xc208fa7d20, 0x1, 0x9, 0x1, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f8dadf5d690, 0xc2094a3920, 0xc208fa7d20, 0x1, 0x9, 0x40, 0x0, 0x0)
    /usr/local/go/src/io/io.go:316 +0x6d
encoding/gob.decodeUintReader(0x7f8dadf5d690, 0xc2094a3920, 0xc208fa7d20, 0x9, 0x9, 0x0, 0x1, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decode.go:121 +0x99
encoding/gob.(*Decoder).recvMessage(0xc209ca9300, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:76 +0x55
encoding/gob.(*Decoder).decodeTypeSequence(0xc209ca9300, 0xc209285d00, 0x16)
    /usr/local/go/src/encoding/gob/decoder.go:140 +0x47
encoding/gob.(*Decoder).DecodeValue(0xc209ca9300, 0x7cfee0, 0xc209285d40, 0x16, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:208 +0x192
encoding/gob.(*Decoder).Decode(0xc209ca9300, 0x7cfee0, 0xc209285d40, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:185 +0x297
github.com/zettio/weave/router.(*LocalConnection).receiveTCP(0xc20a6600e0, 0xc209ca9300)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:412 +0x152
github.com/zettio/weave/router.func·005()
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:308 +0x69
created by github.com/zettio/weave/router.(*LocalConnection).run
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:309 +0x518

goroutine 33606 [select]:
github.com/zettio/weave/router.(*Forwarder).run(0xc20a281580, 0xc20a26acc0, 0xc20a26ad20, 0xc209e175e0)
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:244 +0x457
created by github.com/zettio/weave/router.(*Forwarder).Start
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:220 +0xfc

goroutine 33591 [select]:
github.com/zettio/weave/router.(*Forwarder).run(0xc20a194380, 0xc20930ff20, 0xc20930ff80, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:244 +0x457
created by github.com/zettio/weave/router.(*Forwarder).Start
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:220 +0xfc

goroutine 33602 [chan receive]:
github.com/zettio/weave/router.(*GossipSender).run(0xc20959de00)
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:56 +0x54
created by github.com/zettio/weave/router.(*GossipSender).Start
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:51 +0x6f

goroutine 33604 [IO wait]:
net.(*pollDesc).Wait(0xc20a91af40, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20a91af40, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc20a91aee0, 0xc20a4af000, 0x1000, 0x1000, 0x0, 0x7f8daf999730, 0xc208e2df60)
    /usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc209cb9f98, 0xc20a4af000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:121 +0xdc
bufio.(*Reader).fill(0xc2092f19e0)
    /usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).Read(0xc2092f19e0, 0xc208cc26a0, 0x1, 0x9, 0x100, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:174 +0x26c
io.ReadAtLeast(0x7f8dadf5d690, 0xc2092f19e0, 0xc208cc26a0, 0x1, 0x9, 0x1, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f8dadf5d690, 0xc2092f19e0, 0xc208cc26a0, 0x1, 0x9, 0x40, 0x0, 0x0)
    /usr/local/go/src/io/io.go:316 +0x6d
encoding/gob.decodeUintReader(0x7f8dadf5d690, 0xc2092f19e0, 0xc208cc26a0, 0x9, 0x9, 0x0, 0x1, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decode.go:121 +0x99
encoding/gob.(*Decoder).recvMessage(0xc20a280e80, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:76 +0x55
encoding/gob.(*Decoder).decodeTypeSequence(0xc20a280e80, 0xc2092c8000, 0x16)
    /usr/local/go/src/encoding/gob/decoder.go:140 +0x47
encoding/gob.(*Decoder).DecodeValue(0xc20a280e80, 0x7cfee0, 0xc2092c80e0, 0x16, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:208 +0x192
encoding/gob.(*Decoder).Decode(0xc20a280e80, 0x7cfee0, 0xc2092c80e0, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:185 +0x297
github.com/zettio/weave/router.(*LocalConnection).receiveTCP(0xc209e17500, 0xc20a280e80)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:412 +0x152
github.com/zettio/weave/router.func·005()
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:308 +0x69
created by github.com/zettio/weave/router.(*LocalConnection).run
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:309 +0x518

goroutine 33605 [select]:
github.com/zettio/weave/router.(*Forwarder).run(0xc20a281500, 0xc20a26ac00, 0xc20a26ac60, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:244 +0x457
created by github.com/zettio/weave/router.(*Forwarder).Start
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:220 +0xfc

goroutine 33630 [chan receive]:
github.com/zettio/weave/router.(*GossipSender).run(0xc20a754720)
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:56 +0x54
created by github.com/zettio/weave/router.(*GossipSender).Start
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:51 +0x6f

goroutine 33455 [select]:
github.com/zettio/weave/router.(*Forwarder).run(0xc20975b080, 0xc20a149080, 0xc20a1490e0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:244 +0x457
created by github.com/zettio/weave/router.(*Forwarder).Start
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:220 +0xfc

goroutine 33607 [chan receive]:
github.com/zettio/weave/router.(*GossipSender).run(0xc209f5a0a0)
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:56 +0x54
created by github.com/zettio/weave/router.(*GossipSender).Start
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:51 +0x6f

goroutine 33643 [select]:
github.com/zettio/weave/router.(*LocalConnection).actorLoop(0xc20a0f20e0, 0xc209a46180, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:347 +0x2a2
github.com/zettio/weave/router.(*LocalConnection).run(0xc20a0f20e0, 0xc209a46180, 0xc209a461e0, 0x8ea401)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:327 +0x8ba
created by github.com/zettio/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:119 +0xdd

goroutine 33614 [chan receive]:
github.com/zettio/weave/router.(*GossipSender).run(0xc209bda080)
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:56 +0x54
created by github.com/zettio/weave/router.(*GossipSender).Start
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:51 +0x6f

goroutine 33618 [select]:
github.com/zettio/weave/router.(*Forwarder).run(0xc20962d100, 0xc209598300, 0xc209598360, 0xc20a0f29a0)
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:244 +0x457
created by github.com/zettio/weave/router.(*Forwarder).Start
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:220 +0xfc

goroutine 33592 [select]:
github.com/zettio/weave/router.(*Forwarder).run(0xc20a194400, 0xc20a1a8000, 0xc20a1a8060, 0xc20a6607e0)
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:244 +0x457
created by github.com/zettio/weave/router.(*Forwarder).Start
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:220 +0xfc

goroutine 33669 [select]:
github.com/zettio/weave/router.(*LocalConnection).actorLoop(0xc20a0f28c0, 0xc209a46420, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:347 +0x2a2
github.com/zettio/weave/router.(*LocalConnection).run(0xc20a0f28c0, 0xc209a46420, 0xc209a46480, 0x8ea400)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:327 +0x8ba
created by github.com/zettio/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:119 +0xdd

goroutine 33673 [select]:
github.com/zettio/weave/router.(*Forwarder).run(0xc209405d00, 0xc209a472c0, 0xc209a47320, 0xc20a0f2fc0)
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:244 +0x457
created by github.com/zettio/weave/router.(*Forwarder).Start
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:220 +0xfc

goroutine 33438 [IO wait]:
net.(*pollDesc).Wait(0xc2097b1b10, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc2097b1b10, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc2097b1ab0, 0xc20a072000, 0x1000, 0x1000, 0x0, 0x7f8daf999730, 0xc208e259e0)
    /usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc20a275170, 0xc20a072000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:121 +0xdc
bufio.(*Reader).fill(0xc20a023080)
    /usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).Read(0xc20a023080, 0xc208d7d5e0, 0x1, 0x9, 0x100, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:174 +0x26c
io.ReadAtLeast(0x7f8dadf5d690, 0xc20a023080, 0xc208d7d5e0, 0x1, 0x9, 0x1, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f8dadf5d690, 0xc20a023080, 0xc208d7d5e0, 0x1, 0x9, 0x40, 0x0, 0x0)
    /usr/local/go/src/io/io.go:316 +0x6d
encoding/gob.decodeUintReader(0x7f8dadf5d690, 0xc20a023080, 0xc208d7d5e0, 0x9, 0x9, 0x0, 0x1, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decode.go:121 +0x99
encoding/gob.(*Decoder).recvMessage(0xc2097aca00, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:76 +0x55
encoding/gob.(*Decoder).decodeTypeSequence(0xc2097aca00, 0xc209888800, 0x16)
    /usr/local/go/src/encoding/gob/decoder.go:140 +0x47
encoding/gob.(*Decoder).DecodeValue(0xc2097aca00, 0x7cfee0, 0xc2098888e0, 0x16, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:208 +0x192
encoding/gob.(*Decoder).Decode(0xc2097aca00, 0x7cfee0, 0xc2098888e0, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:185 +0x297
github.com/zettio/weave/router.(*LocalConnection).receiveTCP(0xc20a660460, 0xc2097aca00)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:412 +0x152
github.com/zettio/weave/router.func·005()
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:308 +0x69
created by github.com/zettio/weave/router.(*LocalConnection).run
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:309 +0x518

goroutine 33622 [select]:
github.com/zettio/weave/router.(*LocalConnection).actorLoop(0xc20a0f2c40, 0xc209aca180, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:347 +0x2a2
github.com/zettio/weave/router.(*LocalConnection).run(0xc20a0f2c40, 0xc209aca180, 0xc209aca1e0, 0x8ea401)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:327 +0x8ba
created by github.com/zettio/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:119 +0xdd

goroutine 33663 [chan receive]:
github.com/zettio/weave/router.(*GossipSender).run(0xc209ea4500)
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:56 +0x54
created by github.com/zettio/weave/router.(*GossipSender).Start
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:51 +0x6f

goroutine 33635 [chan receive]:
github.com/zettio/weave/router.(*GossipSender).run(0xc209ae3ec0)
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:56 +0x54
created by github.com/zettio/weave/router.(*GossipSender).Start
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:51 +0x6f

goroutine 33551 [select]:
github.com/zettio/weave/router.(*LocalConnection).actorLoop(0xc20954c540, 0xc208045f80, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:347 +0x2a2
github.com/zettio/weave/router.(*LocalConnection).run(0xc20954c540, 0xc208045f80, 0xc20a070000, 0x8ea400)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:327 +0x8ba
created by github.com/zettio/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:119 +0xdd

goroutine 33470 [select]:
github.com/zettio/weave/router.(*LocalConnection).actorLoop(0xc20a6600e0, 0xc2094a3800, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:347 +0x2a2
github.com/zettio/weave/router.(*LocalConnection).run(0xc20a6600e0, 0xc2094a3800, 0xc2094a3860, 0x8ea400)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:327 +0x8ba
created by github.com/zettio/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:119 +0xdd

goroutine 33456 [select]:
github.com/zettio/weave/router.(*Forwarder).run(0xc20975b100, 0xc20a149140, 0xc20a1491a0, 0xc209e167e0)
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:244 +0x457
created by github.com/zettio/weave/router.(*Forwarder).Start
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:220 +0xfc

goroutine 33603 [select]:
github.com/zettio/weave/router.(*LocalConnection).actorLoop(0xc209e17500, 0xc2092f18c0, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:347 +0x2a2
github.com/zettio/weave/router.(*LocalConnection).run(0xc209e17500, 0xc2092f18c0, 0xc2092f1920, 0x8ea400)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:327 +0x8ba
created by github.com/zettio/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:119 +0xdd

goroutine 33440 [IO wait]:
net.(*pollDesc).Wait(0xc209eca060, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc209eca060, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc209eca000, 0xc209fb7000, 0x1000, 0x1000, 0x0, 0x7f8daf999730, 0xc208e03ae0)
    /usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc209b9fda8, 0xc209fb7000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:121 +0xdc
bufio.(*Reader).fill(0xc20a0700c0)
    /usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).Read(0xc20a0700c0, 0xc208d3ec64, 0x1, 0x9, 0x100, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:174 +0x26c
io.ReadAtLeast(0x7f8dadf5d690, 0xc20a0700c0, 0xc208d3ec64, 0x1, 0x9, 0x1, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f8dadf5d690, 0xc20a0700c0, 0xc208d3ec64, 0x1, 0x9, 0x40, 0x0, 0x0)
    /usr/local/go/src/io/io.go:316 +0x6d
encoding/gob.decodeUintReader(0x7f8dadf5d690, 0xc20a0700c0, 0xc208d3ec64, 0x9, 0x9, 0x0, 0x1, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decode.go:121 +0x99
encoding/gob.(*Decoder).recvMessage(0xc209f6f200, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:76 +0x55
encoding/gob.(*Decoder).decodeTypeSequence(0xc209f6f200, 0xc2098a0b00, 0x16)
    /usr/local/go/src/encoding/gob/decoder.go:140 +0x47
encoding/gob.(*Decoder).DecodeValue(0xc209f6f200, 0x7cfee0, 0xc2098a0b00, 0x16, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:208 +0x192
encoding/gob.(*Decoder).Decode(0xc209f6f200, 0x7cfee0, 0xc2098a0b00, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:185 +0x297
github.com/zettio/weave/router.(*LocalConnection).receiveTCP(0xc20954c540, 0xc209f6f200)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:412 +0x152
github.com/zettio/weave/router.func·005()
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:308 +0x69
created by github.com/zettio/weave/router.(*LocalConnection).run
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:309 +0x518

goroutine 33584 [chan receive]:
github.com/zettio/weave/router.(*GossipSender).run(0xc209ae3c20)
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:56 +0x54
created by github.com/zettio/weave/router.(*GossipSender).Start
    /home/matthias/go/src/github.com/zettio/weave/router/gossip.go:51 +0x6f

goroutine 33441 [select]:
github.com/zettio/weave/router.(*Forwarder).run(0xc20962d080, 0xc209598240, 0xc2095982a0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:244 +0x457
created by github.com/zettio/weave/router.(*Forwarder).Start
    /home/matthias/go/src/github.com/zettio/weave/router/forwarder.go:220 +0xfc

goroutine 33671 [IO wait]:
net.(*pollDesc).Wait(0xc209a44fb0, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc209a44fb0, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc209a44f50, 0xc209929000, 0x1000, 0x1000, 0x0, 0x7f8daf999730, 0xc208e302a0)
    /usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc209705b10, 0xc209929000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:121 +0xdc
bufio.(*Reader).fill(0xc209a46540)
    /usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).Read(0xc209a46540, 0xc2099322a0, 0x1, 0x9, 0x100, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:174 +0x26c
io.ReadAtLeast(0x7f8dadf5d690, 0xc209a46540, 0xc2099322a0, 0x1, 0x9, 0x1, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f8dadf5d690, 0xc209a46540, 0xc2099322a0, 0x1, 0x9, 0x40, 0x0, 0x0)
    /usr/local/go/src/io/io.go:316 +0x6d
encoding/gob.decodeUintReader(0x7f8dadf5d690, 0xc209a46540, 0xc2099322a0, 0x9, 0x9, 0x0, 0x1, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decode.go:121 +0x99
encoding/gob.(*Decoder).recvMessage(0xc209405000, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:76 +0x55
encoding/gob.(*Decoder).decodeTypeSequence(0xc209405000, 0xc20a0cde00, 0x16)
    /usr/local/go/src/encoding/gob/decoder.go:140 +0x47
encoding/gob.(*Decoder).DecodeValue(0xc209405000, 0x7cfee0, 0xc20a0cde20, 0x16, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:208 +0x192
encoding/gob.(*Decoder).Decode(0xc209405000, 0x7cfee0, 0xc20a0cde20, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:185 +0x297
github.com/zettio/weave/router.(*LocalConnection).receiveTCP(0xc20a0f28c0, 0xc209405000)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:412 +0x152
github.com/zettio/weave/router.func·005()
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:308 +0x69
created by github.com/zettio/weave/router.(*LocalConnection).run
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:309 +0x518

goroutine 33599 [IO wait]:
net.(*pollDesc).Wait(0xc20a0a1b80, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20a0a1b80, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc20a0a1b20, 0xc20a2b2000, 0x1000, 0x1000, 0x0, 0x7f8daf999730, 0xc208df87a0)
    /usr/local/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc20a375a20, 0xc20a2b2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:121 +0xdc
bufio.(*Reader).fill(0xc209aca360)
    /usr/local/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).Read(0xc209aca360, 0xc208c54320, 0x1, 0x9, 0x100, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:174 +0x26c
io.ReadAtLeast(0x7f8dadf5d690, 0xc209aca360, 0xc208c54320, 0x1, 0x9, 0x1, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f8dadf5d690, 0xc209aca360, 0xc208c54320, 0x1, 0x9, 0x40, 0x0, 0x0)
    /usr/local/go/src/io/io.go:316 +0x6d
encoding/gob.decodeUintReader(0x7f8dadf5d690, 0xc209aca360, 0xc208c54320, 0x9, 0x9, 0x0, 0x1, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decode.go:121 +0x99
encoding/gob.(*Decoder).recvMessage(0xc20962de00, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:76 +0x55
encoding/gob.(*Decoder).decodeTypeSequence(0xc20962de00, 0xc2098a1e00, 0x16)
    /usr/local/go/src/encoding/gob/decoder.go:140 +0x47
encoding/gob.(*Decoder).DecodeValue(0xc20962de00, 0x7cfee0, 0xc2098a1e60, 0x16, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:208 +0x192
encoding/gob.(*Decoder).Decode(0xc20962de00, 0x7cfee0, 0xc2098a1e60, 0x0, 0x0)
    /usr/local/go/src/encoding/gob/decoder.go:185 +0x297
github.com/zettio/weave/router.(*LocalConnection).receiveTCP(0xc20a0f2c40, 0xc20962de00)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:412 +0x152
github.com/zettio/weave/router.func·005()
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:308 +0x69
created by github.com/zettio/weave/router.(*LocalConnection).run
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:309 +0x518

goroutine 33553 [select]:
github.com/zettio/weave/router.(*LocalConnection).actorLoop(0xc20954c700, 0xc20a070300, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:347 +0x2a2
github.com/zettio/weave/router.(*LocalConnection).run(0xc20954c700, 0xc20a070300, 0xc20a070360, 0x8ea400)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:327 +0x8ba
created by github.com/zettio/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:119 +0xdd

goroutine 33556 [select]:
github.com/zettio/weave/router.(*LocalConnection).actorLoop(0xc209e17340, 0xc20a970300, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:347 +0x2a2
github.com/zettio/weave/router.(*LocalConnection).run(0xc209e17340, 0xc20a970300, 0xc20a970360, 0x8ea401)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:327 +0x8ba
created by github.com/zettio/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:119 +0xdd

goroutine 33512 [select]:
github.com/zettio/weave/router.(*LocalConnection).actorLoop(0xc20a660460, 0xc20a022de0, 0x0, 0x0)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:347 +0x2a2
github.com/zettio/weave/router.(*LocalConnection).run(0xc20a660460, 0xc20a022de0, 0xc20a022e40, 0x8ea401)
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:327 +0x8ba
created by github.com/zettio/weave/router.(*LocalConnection).Start
    /home/matthias/go/src/github.com/zettio/weave/router/connection.go:119 +0xdd

*** end
@rade
Copy link
Member

rade commented Apr 5, 2015

Is this reproducible? And what's the minimum number of peers for which you see this happening?

@thomascramer
Copy link
Author

I did create an ansible script and easily can reproduce this. I'd say from trying different options to find a fix, I'd say that it seems fairly brittle and easy to repro. And the number of peers that are required doesn't seem fairly fixed, I've had it happen on as few as 6 as many 12...

The typical operations I take is:

> weave stop # In some cases it is already stopped
> sleep 30 # Make sure servers have forgotten about me
> weave setup # Make sure I'm working 
> weave launch -password password123 -connlimit 50 1.2.3.4
> sleep 20 # Give the server a chance to come up...
> weave connect 1.2.3.4
> # The documentation isn't quite clear on the best practice here.
> # I've tried just using the peer in the launch statement, I've tried not doing that and using connect to a single peer.  
># I've tried iterating over a list of peers that host would most likely contact...and try connecting to them... 
> # So I've tried the above steps in different in the above to steps....
> sleep 20 # Give time for the machines to gossip
> weave run $CIDR_STUFF_HERE -dt --name consul_host_x our_private_repo/consul_server
> docker exec consul_host_x consul join member1 member2

We do spin up consul inside of weave, from the weave blog it seems like using consul in weave shouldn't be an issue...

Also adding more "sleep" into the operations seems to help... but not always....

@thomascramer
Copy link
Author

There also does seem to be a correlation between removing a node and adding it back in that does seem to effect the issue. I've had some success if I've stopped weave on an instance for a little over a minute then re-launch it; but even 30 seconds between stop and starting an instance (a network outage could even be less then that) everything breaks down.

Also, this is the output from weave status after an hour of it starting to have problems:

root@dev-yyy-instance1:~# weave status
weave router git-b00be096f78a
Encryption on
Our name is 7a:1e:06:5f:50:07(dev-yyy-instance1)
Sniffing traffic on &{829 65535 ethwe 2e:ff:43:f6:4c:8a up|broadcast|multicast}
MACs:
b2:23:f9:bd:cd:75 -> 7a:1e:06:5f:50:07(dev-yyy-instance1) (2015-04-05 20:48:17.536999386 +0000 UTC)
Peers:
Peer 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) (v2897) (UID 12280626226082426067)
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:6783 (unestablished)]
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:51969 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:6783 (unestablished)]
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:45998 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:59599 (unestablished)]
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:53664 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:52629 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:51111 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:6783 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:39382 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:6783 (unestablished)]
Peer 7a:df:e9:0d:8b:cf(dev-yyy-instance2) (v2770) (UID 10007994579226185843)
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:6783 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:40217 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:6783 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:44839 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:6783 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:6783 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:6783 (unestablished)]
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:6783 (unestablished)]
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:39510 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:51418 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:33462 (unestablished)]
Peer 7a:3d:12:77:03:8b(dev-xxx-instance1) (v2845) (UID 15191444818634156398)
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:42465 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:6783 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:6783 (unestablished)]
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:6783 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:6783 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:60706 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:46969 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:50628 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:50317 (unestablished)]
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:6783 (unestablished)]
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:38836 (unestablished)]
Peer 7a:a3:3d:ff:e3:9b(load-xxx-instance1) (v2669) (UID 13735760652716010416)
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:6783 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:32926 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:55447 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:6783 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:43860 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:40835 (unestablished)]
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:49832 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:6783 (unestablished)]
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:37546 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:6783 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:58923 (unestablished)]
Peer 7a:1f:66:f0:e7:03(demo-xxx-instance1) (v2764) (UID 8269526085528382263)
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:51665 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:56124 (unestablished)]
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:6783 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:6783 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:6783 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:6783 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:46253 (unestablished)]
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:6783 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:51642 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:58675 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:6783 (unestablished)]
Peer 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) (v2896) (UID 3368015640829124851)
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:6783 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:60667 (unestablished)]
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:6783 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:6783 (unestablished)]
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:6783 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:6783 (unestablished)]
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:44124 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:6783 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:38818 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:6783 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:6783 (unestablished)]
Peer 7a:5a:e8:8a:8d:1b(load-logstash-instance1) (v2716) (UID 5845835772630068329)
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:43249 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:44636 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:6783 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:55088 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:6783 (unestablished)]
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:6783 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:6783 (unestablished)]
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:6783 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:36174 (unestablished)]
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:6783 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:56727 (unestablished)]
Peer 7a:44:3a:8c:91:fb(load-xxx-instance3) (v2636) (UID 13509111914800438809)
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:56608 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:51060 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:6783 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:57325 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:56028 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:51330 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:6783 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:6783 (unestablished)]
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:6783 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:41199 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:6783 (unestablished)]
Peer 7a:ce:9d:00:a1:1d(load-logstash-instance2) (v2636) (UID 2508379089085002628)
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:6783 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:56345 (unestablished)]
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:6783 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:6783 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:6783 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:6783 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:6783 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:6783 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:6783 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:52673 (unestablished)]
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:58201 (unestablished)]
Peer 7a:1e:06:5f:50:07(dev-yyy-instance1) (v2794) (UID 17443413885944306055)
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:53530 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:6783 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:6783 (unestablished)]
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:6783 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:60984 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:57986 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:54047 (unestablished)]
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:52889 (unestablished)]
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:6783 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:6783 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:48803 (unestablished)]
Peer 7a:4c:0a:20:2e:0b(load-xxx-instance2) (v2558) (UID 14481987951028582289)
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:6783 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:34731 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-zzz-instance2) [10.0.2.50:44362 (unestablished)]
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:6783 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:60945 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:43711 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:40903 (unestablished)]
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:6783 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:48782 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:6783 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:6783 (unestablished)]
Peer 7a:55:42:45:1b:29(dev-zzz-instance2) (v2926) (UID 12099675440449745044)
   -> 7a:4c:0a:20:2e:0b(load-xxx-instance2) [10.0.2.32:6783 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-xxx-instance1) [10.0.2.5:6783 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-zzz-instance1) [10.0.6.42:56150 (unestablished)]
   -> 7a:44:3a:8c:91:fb(load-xxx-instance3) [10.0.6.78:60915 (unestablished)]
   -> 7a:5a:e8:8a:8d:1b(load-logstash-instance1) [10.0.2.210:55230 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-yyy-instance3) [10.0.2.161:49934 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-yyy-instance1) [10.0.6.226:33078 (unestablished)]
   -> 7a:ce:9d:00:a1:1d(load-logstash-instance2) [10.0.6.7:6783 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-yyy-instance2) [10.0.2.156:6783 (unestablished)]
   -> 7a:a3:3d:ff:e3:9b(load-xxx-instance1) [10.0.2.189:38508 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-xxx-instance1) [10.0.6.155:53074 (unestablished)]
Routes:
unicast:
7a:1e:06:5f:50:07 -> 00:00:00:00:00:00
7a:44:3a:8c:91:fb -> []
7a:ce:9d:00:a1:1d -> []
7a:1e:06:5f:50:07 -> []
7a:df:e9:0d:8b:cf -> []
7a:a3:3d:ff:e3:9b -> []
7a:0e:0c:7a:dd:fe -> []
7a:5a:e8:8a:8d:1b -> []
7a:1f:66:f0:e7:03 -> []
7a:3d:12:77:03:8b -> []
7a:b7:0a:c0:f3:f0 -> []
7a:4c:0a:20:2e:0b -> []
7a:55:42:45:1b:29 -> []
Reconnects:
10.0.2.161:6783 (next try at 2015-04-05 20:49:29.822982449 +0000 UTC)

weavedns container is not present; have you launched it?

I would have expected that at least one instance to have been blacklisted, but not everything to lock up like this. Also, any insights on how to recover from this kind of issue, other then going around to servers and stoping weave one by one until it recovers, would be appreciated.. I've tried a couple other options like:

weave status | grep unestablished | sed -re 's/.*\[(.*):.*/\1/' | sort -u | xargs -r -L1 weave connect
AND
weave status | grep unestablished | sed -re 's/.*\[(.*):.*/\1/' | sort -u | xargs -r -L1 weave forget

But neither seems to do anything...

@rade
Copy link
Member

rade commented Apr 5, 2015

I've had it happen on as few as 6 as many 12...

Any smaller number than 6?

weave launch -password password123 -connlimit 50 1.2.3.4

Does the problem occur without encryption?

I've tried just using the peer in the launch statement, I've tried not doing that and using connect to a single peer.

The two are equivalent. In your situation I'd go for the former, since then you don't need the 'sleep'.

The documentation isn't quite clear on the best practice here.

The main purpose of connect is to connect to addresses that weren't known at the time weave was launched (since otherwise one could have just supplied those addresses to launch). Especially when those addresses refer to new peers which cannot reach our peer due to firewall restrictions (since otherwise one could have just supplied our peer's address when launching the new peers).

Suggestions welcome on what the docs for connect should say to make this clearer.

weave run $CIDR_STUFF_HERE -dt --name consul_host_x our_private_repo/consul_server
docker exec consul_host_x consul join member1 member2

The use of consul here should be coincidental. Indeed the problem should still arise without running any application containers. Would be nice to confirm that one way or the other.

More generally, it would be great if you provided exact steps that would allow us to reproduce the problem. Even if they contain steps like "do xyz a few times, until it breaks".

@thomascramer
Copy link
Author

  • So far I've not been able to repro with smaller numbers...
  • I've not tried without doing it unencrypted. I did review the logs and did come across this on occasion: weave 2015/04/05 11:00:28.287196 ->[7a:1e:06:5f:50:07(dev-yyy-instance1)]: connection shutting down due to error: Failed to decode packet: decryption failed; Unable to decrypt UDP packet
  • I posted the basic steps, unfortunately I can't get them don't to something exactly reproducible as it is somewhat irregular when it happens (i.e. after how many servers till it breaks). But irregardless it is something that I can get to happen consistently.... I've mainly had some success with adding a lot of long sleep times through out the process (something I've not done with other services that use gossip base protocols like serf or consul).

@thomascramer
Copy link
Author

I've had much more success when I've disabled the encryption.

@rade
Copy link
Member

rade commented Apr 6, 2015

I've had much more success when I've disabled the encryption.

"much more" == cannot get it to break at all?

Note that as well as the log entry you mentioned, one of the stack traces shows a connection being stuck in the decryption code.

That doesn't necessarily mean that the crypto code is faulty; it could simply be a consequence of subtly altered timing. Nevertheless, knowing whether the problem occurs at all w/o crypto would be incredibly useful.

Similarly, as I mentioned it would be good to know whether the problem also arises w/o any application containers.

I've mainly had some success with adding a lot of long sleep times through out the process

You really shouldn't have to do that.

So, taking all the above into account, does the problem arise when you run

weave stop
weave setup
weave launch

on one node and then repeatedly run

weave stop
weave setup
weave launch 1.2.3.4 #IP of first node

across, say, 8 nodes, and check the connectivity status by running weave status on the first node?

And if you cannot get it to break that way, try with encryption enabled.

@errordeveloper errordeveloper self-assigned this Apr 7, 2015
@errordeveloper
Copy link
Contributor

@thomascramer: "I did create an ansible script and easily can reproduce this."

It would help a lot if you could share this, feel free to email help@weave.works.

@thomascramer
Copy link
Author

Sorry, was busy yesterday and wasn't able to follow back here:

@rade: regarding:

"much more" == cannot get it to break at all?

I only tried it twice, so hardly exhaustive, but in those attempts it didn't break and was able to reduce the wait times significantly I had put in without issue. Unfortunately insecure transmission of data isn't really an option for us

I really don't have an active cluster I can easily test with so it isn't really that easy to break this process to most simplest steps and try to get something overly easy to produce.

@errordeveloper: regarding the ansible scripts, I'll try to clean things up a bit in my scripts and send them to your email to see if they help inform...

@thomascramer
Copy link
Author

Also, it didn't seem that back on "git-066d8001dd6d" image that we had that much issue. Though at the time I can't say for sure that it wasn't just coincidence. But maybe there was something changed in there that would be a good place to start?

@rade
Copy link
Member

rade commented Apr 11, 2015

@thomascramer I've just published new images (git-59ae50eb4ef9). Please give these a whirl by grabbing the latest weave script and running weave setup.

@thomascramer
Copy link
Author

I've run through my stuff with a couple times with the git-59ae50eb4ef9 images, and seems to be working like a champ.

Looking at the weave script, but wanted to confirm, that if I wanted to keep to that version, the best way to "pin" it is to export VERSION=git-59ae50eb4ef9 before running the weave command options like setup or launch?

@rade
Copy link
Member

rade commented Apr 13, 2015

the best way to "pin" it is to export VERSION=git-59ae50eb4ef9 before running the weave command

Yes, either that or invoking weave as VERSION=git-... weave ...

@thomascramer
Copy link
Author

May have spoken too soon.. I've been able to repro in another environment, double checking my setup here...

@rade
Copy link
Member

rade commented Apr 14, 2015

@thomascramer

May have spoken too soon

And did you?

@thomascramer
Copy link
Author

Yes, it seems like in my dev environment, I'm still seeing some issues; though not able to repro in the test setup environment I was testing with earlier. It does seem better since that fix, and seems a simple combo of consul and weave start up just fine...

As best as I've come up with so far is that when I activate a weave instance on a server that already has elasticsearch on it and starts to replicate a lot of shards, everything again starts to shutdown... This is proving trying to replicate in a test environment...

Not sure if it helps but do see a lot of errors from:

  • connection shutting down due to error: timed out waiting for UDP heartbeat
  • connection shutting down due to error: read tcp4 10.0.2.161:6783: connection reset by peer
  • connection shutting down due to error: Failed to decode packet: decryption failed; Unable to decrypt UDP packet

And I have confirmed that all my instances are using the new image...

@rade
Copy link
Member

rade commented Apr 14, 2015

That may be a different problem. The symptoms of the error that we fixed is weave status showing unestablished connections that never appear to go away. As per the title of this issue ;) Are you still seeing those specific symptoms?

@rade
Copy link
Member

rade commented Apr 14, 2015

As best as I've come up with so far is that when I activate a weave instance on a server that already has elasticsearch on it and starts to replicate a lot of shards

plus

connection shutting down due to error: timed out waiting for UDP heartbeat

do point to this possibly being a simple load-induced problem, i.e. heartbeats go missing due to high system load and hence weave thinks the associated connections are broken and will attempt to re-establish them. weave should recover from that once the load subsides. Either way though, it would be a separate issue.

@thomascramer
Copy link
Author

Yes, once I start to see a couple instances show up as "unestablished" they all go "unestablished' quite rapidly, and it seems to stay that way unless I start to pull nodes out by manually going out and doing weave stop on some of the nodes..

@rade
Copy link
Member

rade commented Apr 14, 2015

ok. how easy is this to reproduce in your dev env? and could we get access to that?

@rade
Copy link
Member

rade commented Apr 14, 2015

Actually, let me refine the symptoms of what we fixed... you also shouldn't see many reconnects at the end of weave status. E.g. notice that the status you posted contains just one reconnect.

If you do see lots if reconnects listed then that again would be consistent with just seeing a load-induced connectivity breakage.

@thomascramer
Copy link
Author

It is rather easy for me to repro in my dev environment, but can't provide access to it... I was able to repro in a public test environment I could grant you access to, but since git-59ae50eb4ef9 it is proving harder to do it in there, but I'm still working at it...

This is my current weave status output (and has basically been this way for the about the last 30 mins or so):

# weave status
weave router git-59ae50eb4ef9
Encryption on
Our name is 7a:1e:06:5f:50:07(dev-mmm-instance1)
Sniffing traffic on &{1005 65535 ethwe 6e:23:02:a7:1b:35 up|broadcast|multicast}
MACs:
6e:23:02:a7:1b:35 -> 7a:1e:06:5f:50:07(dev-mmm-instance1) (2015-04-14 06:33:15.314474143 +0000 UTC)
da:70:89:55:1c:06 -> 7a:1e:06:5f:50:07(dev-mmm-instance1) (2015-04-14 06:39:16.509054748 +0000 UTC)
9e:f0:47:7f:d8:db -> 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) (2015-04-14 06:33:26.090087185 +0000 UTC)
0a:a3:17:c3:ba:73 -> 7a:3d:12:77:03:8b(dev-ooo-instance1) (2015-04-14 06:33:26.831898419 +0000 UTC)
ea:8d:af:85:00:bc -> 7a:1f:66:f0:e7:03(demo-ooo-instance1) (2015-04-14 06:33:31.792771633 +0000 UTC)
0e:27:85:53:25:bd -> 7a:1e:06:5f:50:07(dev-mmm-instance1) (2015-04-14 06:33:16.116991194 +0000 UTC)
02:36:1b:fd:0a:74 -> 7a:10:33:9a:ad:bd(load-ooo-instance3) (2015-04-14 06:33:26.799566544 +0000 UTC)
3e:43:d7:40:36:db -> 7a:10:33:9a:ad:bd(load-ooo-instance3) (2015-04-14 06:33:32.836233028 +0000 UTC)
d6:03:fb:5d:24:50 -> 7a:3d:12:77:03:8b(dev-ooo-instance1) (2015-04-14 06:33:46.920918627 +0000 UTC)
fe:6c:fc:b1:1b:19 -> 7a:7c:b5:90:a9:fa(load-ooo-instance2) (2015-04-14 06:33:21.127602836 +0000 UTC)
d2:1f:1d:72:bb:de -> 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) (2015-04-14 06:33:33.39422005 +0000 UTC)
46:51:e4:54:6f:30 -> 7a:1e:06:5f:50:07(dev-mmm-instance1) (2015-04-14 06:39:13.053562277 +0000 UTC)
0e:94:d6:78:fc:f1 -> 7a:7c:b5:90:a9:fa(load-ooo-instance2) (2015-04-14 06:33:42.824163026 +0000 UTC)
9a:f9:85:8b:4a:4e -> 7a:1e:06:5f:50:07(dev-mmm-instance1) (2015-04-14 06:33:15.881065986 +0000 UTC)
Peers:
Peer 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) (v3605) (UID 13364322988343335218)
   -> 7a:10:33:9a:ad:bd(load-ooo-instance3) [10.0.6.78:39780 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-mmm-instance1) [10.0.6.226:6783 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-mmm-instance3) [10.0.2.161:6783 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-mmm-instance2) [10.0.2.156:6783 (unestablished)]
   -> 7a:7c:b5:90:a9:fa(load-ooo-instance2) [10.0.2.32:36817 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-ooo-instance1) [10.0.2.5:6783 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-rrr-instance2) [10.0.2.50:33746 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-ooo-instance1) [10.0.6.155:44540 (unestablished)]
Peer 7a:3d:12:77:03:8b(dev-ooo-instance1) (v14704) (UID 1116063947414095624)
   -> 7a:1f:66:f0:e7:03(demo-ooo-instance1) [10.0.6.155:6783 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-rrr-instance2) [10.0.2.50:6783 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-mmm-instance3) [10.0.2.161:41149 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-mmm-instance1) [10.0.6.226:52157 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-mmm-instance2) [10.0.2.156:47264 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) [10.0.6.42:51261 (unestablished)]
   -> 7a:7c:b5:90:a9:fa(load-ooo-instance2) [10.0.2.32:6783 (unestablished)]
   -> 7a:10:33:9a:ad:bd(load-ooo-instance3) [10.0.6.78:6783 (unestablished)]
Peer 7a:10:33:9a:ad:bd(load-ooo-instance3) (v13758) (UID 13451469968417862438)
   -> 7a:55:42:45:1b:29(dev-rrr-instance2) [10.0.2.50:6783 (unestablished)]
   -> 7a:7c:b5:90:a9:fa(load-ooo-instance2) [10.0.2.32:51427 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-mmm-instance2) [10.0.2.156:6783 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-ooo-instance1) [10.0.2.5:50586 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-mmm-instance3) [10.0.2.161:6783 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-ooo-instance1) [10.0.6.155:45348 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) [10.0.6.42:6783 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-mmm-instance1) [10.0.6.226:45529 (unestablished)]
Peer 7a:df:e9:0d:8b:cf(dev-mmm-instance2) (v1040) (UID 9650975846057162330)
   -> 7a:7c:b5:90:a9:fa(load-ooo-instance2) [10.0.2.32:6783 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-rrr-instance2) [10.0.2.50:48699 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-ooo-instance1) [10.0.2.5:6783 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-ooo-instance1) [10.0.6.155:6783 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-mmm-instance3) [10.0.2.161:54749 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-mmm-instance1) [10.0.6.226:40470 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) [10.0.6.42:34056 (unestablished)]
   -> 7a:10:33:9a:ad:bd(load-ooo-instance3) [10.0.6.78:39207 (unestablished)]
Peer 7a:55:42:45:1b:29(dev-rrr-instance2) (v3653) (UID 14737763866826679482)
   -> 7a:1e:06:5f:50:07(dev-mmm-instance1) [10.0.6.226:6783 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-ooo-instance1) [10.0.6.155:6783 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) [10.0.6.42:6783 (unestablished)]
   -> 7a:10:33:9a:ad:bd(load-ooo-instance3) [10.0.6.78:57028 (unestablished)]
   -> 7a:7c:b5:90:a9:fa(load-ooo-instance2) [10.0.2.32:6783 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-mmm-instance2) [10.0.2.156:6783 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-ooo-instance1) [10.0.2.5:50023 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-mmm-instance3) [10.0.2.161:6783 (unestablished)]
Peer 7a:1e:06:5f:50:07(dev-mmm-instance1) (v386) (UID 3358896077549137594)
   -> 7a:3d:12:77:03:8b(dev-ooo-instance1) [10.0.2.5:6783 (unestablished)]
   -> 7a:10:33:9a:ad:bd(load-ooo-instance3) [10.0.6.78:6783 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-rrr-instance2) [10.0.2.50:48576 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-mmm-instance2) [10.0.2.156:6783 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-ooo-instance1) [10.0.6.155:35337 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) [10.0.6.42:51008 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-mmm-instance3) [10.0.2.161:41927 (unestablished)]
   -> 7a:7c:b5:90:a9:fa(load-ooo-instance2) [10.0.2.32:6783 (unestablished)]
Peer 7a:1f:66:f0:e7:03(demo-ooo-instance1) (v14531) (UID 8702599462315273547)
   -> 7a:1e:06:5f:50:07(dev-mmm-instance1) [10.0.6.226:6783 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-rrr-instance2) [10.0.2.50:45396 (unestablished)]
   -> 7a:10:33:9a:ad:bd(load-ooo-instance3) [10.0.6.78:6783 (unestablished)]
   -> 7a:7c:b5:90:a9:fa(load-ooo-instance2) [10.0.2.32:57234 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) [10.0.6.42:6783 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-ooo-instance1) [10.0.2.5:36150 (unestablished)]
   -> 7a:df:e9:0d:8b:cf(dev-mmm-instance2) [10.0.2.156:45289 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-mmm-instance3) [10.0.2.161:6783 (unestablished)]
Peer 7a:b7:0a:c0:f3:f0(dev-mmm-instance3) (v687) (UID 2721317289170900899)
   -> 7a:df:e9:0d:8b:cf(dev-mmm-instance2) [10.0.2.156:6783 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-ooo-instance1) [10.0.2.5:6783 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-rrr-instance2) [10.0.2.50:43850 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-ooo-instance1) [10.0.6.155:56648 (unestablished)]
   -> 7a:10:33:9a:ad:bd(load-ooo-instance3) [10.0.6.78:36785 (unestablished)]
   -> 7a:7c:b5:90:a9:fa(load-ooo-instance2) [10.0.2.32:6783 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) [10.0.6.42:37455 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-mmm-instance1) [10.0.6.226:6783 (unestablished)]
Peer 7a:7c:b5:90:a9:fa(load-ooo-instance2) (v14265) (UID 12194306702996906081)
   -> 7a:df:e9:0d:8b:cf(dev-mmm-instance2) [10.0.2.156:50253 (unestablished)]
   -> 7a:10:33:9a:ad:bd(load-ooo-instance3) [10.0.6.78:6783 (unestablished)]
   -> 7a:b7:0a:c0:f3:f0(dev-mmm-instance3) [10.0.2.161:51244 (unestablished)]
   -> 7a:0e:0c:7a:dd:fe(dev-rrr-instance1) [10.0.6.42:6783 (unestablished)]
   -> 7a:1e:06:5f:50:07(dev-mmm-instance1) [10.0.6.226:33406 (unestablished)]
   -> 7a:3d:12:77:03:8b(dev-ooo-instance1) [10.0.2.5:59105 (unestablished)]
   -> 7a:1f:66:f0:e7:03(demo-ooo-instance1) [10.0.6.155:6783 (unestablished)]
   -> 7a:55:42:45:1b:29(dev-rrr-instance2) [10.0.2.50:39216 (unestablished)]
Routes:
unicast:
7a:1e:06:5f:50:07 -> 00:00:00:00:00:00
broadcast:
7a:3d:12:77:03:8b -> []
7a:10:33:9a:ad:bd -> []
7a:1e:06:5f:50:07 -> []
7a:55:42:45:1b:29 -> []
7a:df:e9:0d:8b:cf -> []
7a:0e:0c:7a:dd:fe -> []
7a:1f:66:f0:e7:03 -> []
7a:b7:0a:c0:f3:f0 -> []
7a:7c:b5:90:a9:fa -> []
Reconnects:

@rade
Copy link
Member

rade commented Apr 14, 2015

well, that does show all the symptoms :(

Can you post the complete logs for all three nodes somewhere?

@rade
Copy link
Member

rade commented Apr 14, 2015

all three

I meant at least three, including the node from which you got the above status.

@rade
Copy link
Member

rade commented Apr 14, 2015

I am curious whether the logs still show connection attempts going on. Or whether basically everything has ground to a halt.

@thomascramer
Copy link
Author

They say they are attempting... But yes I can dump them on an s3 bucket, can you email some sort of account name or something I can grant access to? I think you were on the one help@weave email chain...

@rade
Copy link
Member

rade commented Apr 14, 2015

They say they are attempting...

So you see a fairly continuous stream of connection attempts in the logs?

re dumping the log files... please put them somewhere and email help@weave.works with the details.

@rade rade added this to the 0.10.0 milestone Apr 14, 2015
@rade
Copy link
Member

rade commented Apr 14, 2015

The logs do indeed contain a continuous stream of connection attempts. Looks like UDP connectivity between peers is very patchy. This could be due to the elasticsearch replication load. Does that load ever subside? I would expect connectivity to recover then.

@rade
Copy link
Member

rade commented Apr 14, 2015

incidentally, when running with encryption weave has a lower tolerance for UDP packet loss.

@thomascramer
Copy link
Author

Well, there isn't that much load per-say, and the load does go away once the weave network breaks down... obviously there is no way for it replicate if there is no route to host. Also sometimes I can add the instances just fine so maybe it is just a coincidence. Initially when I saw these issues I had seen some recovery after several hours, but haven't had the luxury to set and let it wait to come back to life.

Shutting down weave on a couple nodes does seem to help it recover. Also slowly (adding a lot of sleep) around doing a rolling restart of weave seems to help. but this typically requires bringing the weave service down for about a minute or so and then bringing back up and then waiting another minute or so before starting up consul.

I haven't tried with the current image, but the previous one it didn't seem as problematic without the encryption. However, one of our main interest points is using weave with the encryption.

@rade
Copy link
Member

rade commented Apr 14, 2015

Reproduced.

$ NUM_WEAVES=10 bin/multiweave launch -password pass
a635ab81f19dbd511554e509fde5507ecaa74b769f193117d65c472a137d238d
5c0a1f05ed2c2176f3351a5fb3bbc75ec0b84c9bab413616df1177974f916252
5c7d3288da30cb92583785b6a6225f1f5c9d104af1289e64a84c20e7abd9c445
e42157e466bd0896535dbf9a659c4cb93136c6ab035dff85af032972f0549460
4aad85ebcd8c4a1274a1cec6838ea143cc280126cd9c75f57d34d3e7d8ffb5e8
a8ff4a963d68a371c6ab08ec28cf13588c71d6f0c86c00aff404bdfee2616d19
8029cbc86e81dceb92c8ae88d5a6351d61f83593b643b2f1eac6091f7cc5a5c1
4f47ef1a24818aacedae9c6d2fba0088b9acbb7dd2b0c022a2e70f13fb761c4b
cafc3a02e3877f6f733e6c02e6aae20d9c84d12977f2ee6852da460ac7ccd924
89caba07efa8acfea51239287a03a4e8144bce84545a258bd01aa089c2719c9b
connecting weave2 to 172.17.0.39
connecting weave3 to 172.17.0.40
connecting weave4 to 172.17.0.41
connecting weave5 to 172.17.0.42
connecting weave6 to 172.17.0.43
connecting weave7 to 172.17.0.44
connecting weave8 to 172.17.0.45
connecting weave9 to 172.17.0.46
connecting weave10 to 172.17.0.47

$ WEAVE_CONTAINER_NAME=weave1 weave status
weave router git-f9ea3f54d1b4
Encryption on
Our name is 7a:7a:cf:c3:a5:81(weave1)
Sniffing traffic on <nil>
MACs:
Peers:
Peer 7a:7a:cf:c3:a5:81(weave1) (v41) (UID 4185936310212770267)
   -> 7a:c3:17:53:31:eb(weave6) [172.17.0.44:6783]
   -> 7a:fd:58:1f:c9:7a(weave10) [172.17.0.48:45362 (unestablished)]
   -> 7a:63:7f:24:96:27(weave4) [172.17.0.42:6783 (unestablished)]
   -> 7a:c5:9e:a3:a9:4c(weave9) [172.17.0.47:43911 (unestablished)]
   -> 7a:de:61:7e:cc:f5(weave3) [172.17.0.41:6783 (unestablished)]
   -> 7a:d6:8c:9d:71:a9(weave8) [172.17.0.46:6783 (unestablished)]
   -> 7a:e4:72:04:bc:b9(weave5) [172.17.0.43:6783]
   -> 7a:77:d4:e2:28:dc(weave7) [172.17.0.45:42101 (unestablished)]
   -> 7a:1e:bf:17:f0:a0(weave2) [172.17.0.40:49528]
Peer 7a:63:7f:24:96:27(weave4) (v27) (UID 16479170325487000555)
   -> 7a:d6:8c:9d:71:a9(weave8) [172.17.0.46:6783 (unestablished)]
   -> 7a:1e:bf:17:f0:a0(weave2) [172.17.0.40:6783 (unestablished)]
   -> 7a:77:d4:e2:28:dc(weave7) [172.17.0.45:6783 (unestablished)]
   -> 7a:de:61:7e:cc:f5(weave3) [172.17.0.41:6783 (unestablished)]
   -> 7a:c3:17:53:31:eb(weave6) [172.17.0.44:37108 (unestablished)]
   -> 7a:fd:58:1f:c9:7a(weave10) [172.17.0.48:60916 (unestablished)]
   -> 7a:e4:72:04:bc:b9(weave5) [172.17.0.43:50465]
   -> 7a:c5:9e:a3:a9:4c(weave9) [172.17.0.47:6783 (unestablished)]
   -> 7a:7a:cf:c3:a5:81(weave1) [172.17.0.39:39935 (unestablished)]
Peer 7a:e4:72:04:bc:b9(weave5) (v39) (UID 90544003228960036)
   -> 7a:63:7f:24:96:27(weave4) [172.17.0.42:6783]
   -> 7a:77:d4:e2:28:dc(weave7) [172.17.0.45:51705 (unestablished)]
   -> 7a:7a:cf:c3:a5:81(weave1) [172.17.0.39:52795]
   -> 7a:d6:8c:9d:71:a9(weave8) [172.17.0.46:6783 (unestablished)]
   -> 7a:fd:58:1f:c9:7a(weave10) [172.17.0.48:42663 (unestablished)]
   -> 7a:c5:9e:a3:a9:4c(weave9) [172.17.0.47:6783 (unestablished)]
   -> 7a:de:61:7e:cc:f5(weave3) [172.17.0.41:58144 (unestablished)]
   -> 7a:c3:17:53:31:eb(weave6) [172.17.0.44:43442]
   -> 7a:1e:bf:17:f0:a0(weave2) [172.17.0.40:6783 (unestablished)]
Peer 7a:c3:17:53:31:eb(weave6) (v19) (UID 5336887308644303066)
   -> 7a:77:d4:e2:28:dc(weave7) [172.17.0.45:42957]
   -> 7a:d6:8c:9d:71:a9(weave8) [172.17.0.46:6783 (unestablished)]
   -> 7a:fd:58:1f:c9:7a(weave10) [172.17.0.48:51009 (unestablished)]
   -> 7a:1e:bf:17:f0:a0(weave2) [172.17.0.40:51049 (unestablished)]
   -> 7a:7a:cf:c3:a5:81(weave1) [172.17.0.39:38578 (unestablished)]
   -> 7a:de:61:7e:cc:f5(weave3) [172.17.0.41:45822 (unestablished)]
   -> 7a:c5:9e:a3:a9:4c(weave9) [172.17.0.47:55639 (unestablished)]
   -> 7a:e4:72:04:bc:b9(weave5) [172.17.0.43:6783]
   -> 7a:63:7f:24:96:27(weave4) [172.17.0.42:6783 (unestablished)]
Peer 7a:77:d4:e2:28:dc(weave7) (v20) (UID 1545220434078612623)
   -> 7a:de:61:7e:cc:f5(weave3) [172.17.0.41:6783 (unestablished)]
   -> 7a:c5:9e:a3:a9:4c(weave9) [172.17.0.47:44281 (unestablished)]
   -> 7a:c3:17:53:31:eb(weave6) [172.17.0.44:6783]
   -> 7a:d6:8c:9d:71:a9(weave8) [172.17.0.46:39503 (unestablished)]
   -> 7a:63:7f:24:96:27(weave4) [172.17.0.42:53533 (unestablished)]
   -> 7a:e4:72:04:bc:b9(weave5) [172.17.0.43:6783]
   -> 7a:7a:cf:c3:a5:81(weave1) [172.17.0.39:6783 (unestablished)]
   -> 7a:1e:bf:17:f0:a0(weave2) [172.17.0.40:6783 (unestablished)]
   -> 7a:fd:58:1f:c9:7a(weave10) [172.17.0.48:6783]
Peer 7a:d6:8c:9d:71:a9(weave8) (v15) (UID 12613709493398724114)
   -> 7a:c3:17:53:31:eb(weave6) [172.17.0.44:33603 (unestablished)]
   -> 7a:fd:58:1f:c9:7a(weave10) [172.17.0.48:6783 (unestablished)]
   -> 7a:63:7f:24:96:27(weave4) [172.17.0.42:47945 (unestablished)]
   -> 7a:de:61:7e:cc:f5(weave3) [172.17.0.41:57119 (unestablished)]
   -> 7a:c5:9e:a3:a9:4c(weave9) [172.17.0.47:42775 (unestablished)]
   -> 7a:e4:72:04:bc:b9(weave5) [172.17.0.43:53153 (unestablished)]
   -> 7a:1e:bf:17:f0:a0(weave2) [172.17.0.40:6783 (unestablished)]
   -> 7a:7a:cf:c3:a5:81(weave1) [172.17.0.39:42603 (unestablished)]
   -> 7a:77:d4:e2:28:dc(weave7) [172.17.0.45:6783 (unestablished)]
Peer 7a:fd:58:1f:c9:7a(weave10) (v19) (UID 14248038865278751315)
   -> 7a:c5:9e:a3:a9:4c(weave9) [172.17.0.47:6783 (unestablished)]
   -> 7a:77:d4:e2:28:dc(weave7) [172.17.0.45:41856 (unestablished)]
   -> 7a:de:61:7e:cc:f5(weave3) [172.17.0.41:6783 (unestablished)]
   -> 7a:e4:72:04:bc:b9(weave5) [172.17.0.43:6783 (unestablished)]
   -> 7a:c3:17:53:31:eb(weave6) [172.17.0.44:6783 (unestablished)]
   -> 7a:7a:cf:c3:a5:81(weave1) [172.17.0.39:6783 (unestablished)]
   -> 7a:63:7f:24:96:27(weave4) [172.17.0.42:6783 (unestablished)]
   -> 7a:1e:bf:17:f0:a0(weave2) [172.17.0.40:58576 (unestablished)]
   -> 7a:d6:8c:9d:71:a9(weave8) [172.17.0.46:37815 (unestablished)]
Peer 7a:1e:bf:17:f0:a0(weave2) (v35) (UID 5021194947216650000)
   -> 7a:fd:58:1f:c9:7a(weave10) [172.17.0.48:6783 (unestablished)]
   -> 7a:63:7f:24:96:27(weave4) [172.17.0.42:37245 (unestablished)]
   -> 7a:e4:72:04:bc:b9(weave5) [172.17.0.43:40721 (unestablished)]
   -> 7a:d6:8c:9d:71:a9(weave8) [172.17.0.46:40435 (unestablished)]
   -> 7a:c3:17:53:31:eb(weave6) [172.17.0.44:6783]
   -> 7a:77:d4:e2:28:dc(weave7) [172.17.0.45:44135 (unestablished)]
   -> 7a:c5:9e:a3:a9:4c(weave9) [172.17.0.47:6783 (unestablished)]
   -> 7a:7a:cf:c3:a5:81(weave1) [172.17.0.39:6783]
   -> 7a:de:61:7e:cc:f5(weave3) [172.17.0.41:52867]
Peer 7a:de:61:7e:cc:f5(weave3) (v30) (UID 1251886731187251561)
   -> 7a:e4:72:04:bc:b9(weave5) [172.17.0.43:6783]
   -> 7a:d6:8c:9d:71:a9(weave8) [172.17.0.46:6783 (unestablished)]
   -> 7a:fd:58:1f:c9:7a(weave10) [172.17.0.48:54953 (unestablished)]
   -> 7a:7a:cf:c3:a5:81(weave1) [172.17.0.39:52178 (unestablished)]
   -> 7a:63:7f:24:96:27(weave4) [172.17.0.42:57967 (unestablished)]
   -> 7a:c5:9e:a3:a9:4c(weave9) [172.17.0.47:38393 (unestablished)]
   -> 7a:c3:17:53:31:eb(weave6) [172.17.0.44:6783]
   -> 7a:77:d4:e2:28:dc(weave7) [172.17.0.45:47986 (unestablished)]
   -> 7a:1e:bf:17:f0:a0(weave2) [172.17.0.40:6783]
Peer 7a:c5:9e:a3:a9:4c(weave9) (v15) (UID 4458146123765981932)
   -> 7a:77:d4:e2:28:dc(weave7) [172.17.0.45:6783 (unestablished)]
   -> 7a:d6:8c:9d:71:a9(weave8) [172.17.0.46:6783 (unestablished)]
   -> 7a:c3:17:53:31:eb(weave6) [172.17.0.44:6783 (unestablished)]
   -> 7a:63:7f:24:96:27(weave4) [172.17.0.42:48868 (unestablished)]
   -> 7a:e4:72:04:bc:b9(weave5) [172.17.0.43:52938 (unestablished)]
   -> 7a:7a:cf:c3:a5:81(weave1) [172.17.0.39:6783 (unestablished)]
   -> 7a:fd:58:1f:c9:7a(weave10) [172.17.0.48:35634 (unestablished)]
   -> 7a:de:61:7e:cc:f5(weave3) [172.17.0.41:6783 (unestablished)]
   -> 7a:1e:bf:17:f0:a0(weave2) [172.17.0.40:49822 (unestablished)]
Routes:
unicast:
7a:77:d4:e2:28:dc -> 7a:e4:72:04:bc:b9
7a:7a:cf:c3:a5:81 -> 00:00:00:00:00:00
7a:e4:72:04:bc:b9 -> 7a:e4:72:04:bc:b9
7a:1e:bf:17:f0:a0 -> 7a:1e:bf:17:f0:a0
7a:de:61:7e:cc:f5 -> 7a:1e:bf:17:f0:a0
7a:c3:17:53:31:eb -> 7a:e4:72:04:bc:b9
7a:63:7f:24:96:27 -> 7a:e4:72:04:bc:b9
broadcast:
7a:7a:cf:c3:a5:81 -> [7a:1e:bf:17:f0:a0 7a:e4:72:04:bc:b9]
7a:63:7f:24:96:27 -> [7a:1e:bf:17:f0:a0]
7a:e4:72:04:bc:b9 -> [7a:1e:bf:17:f0:a0]
7a:1e:bf:17:f0:a0 -> [7a:e4:72:04:bc:b9]
7a:c3:17:53:31:eb -> [7a:1e:bf:17:f0:a0]
7a:77:d4:e2:28:dc -> [7a:1e:bf:17:f0:a0]
7a:d6:8c:9d:71:a9 -> []
7a:fd:58:1f:c9:7a -> []
7a:de:61:7e:cc:f5 -> [7a:e4:72:04:bc:b9]
7a:c5:9e:a3:a9:4c -> []
Reconnects:

And if I run status again a few minutes later then all connections are unestablished.

@rade
Copy link
Member

rade commented Apr 14, 2015

With NUM_WEAVES=7 I saw a few unestablished connections that hang around for a while but then disappeared. With 10 the above reproduces quite reliably. YMMW depending on hardware spec.

@rade
Copy link
Member

rade commented Apr 15, 2015

My initial suspicion was that crypto loses sync due to excessive packet loss. But it's hard to see how that would happen with just four peers and no traffic. (4 peers is where I see the first crypto errors appearing). Also, the first error I am seeing is usually "Unable to decrypt UDP packet", and with some debugging added I can see that the offset is 0. Odds are that this really is genuinely the first packet. So why is the decryption failing?

@bboreham
Copy link
Contributor

Here's what I think is happening:

  • Two peers initiate a connection to each other.
  • Nonces and heartbeats and stuff are sent down.
  • One connection is dropped (because it is a duplicate).
  • A UDP packet arrives, and the router looks up a connection for that address.
  • No guarantee it will find the connection that matches the nonces sent by the sender.

The first few times this happens, the result is an 'unable to decrypt UDP packet' error. After a while, the same flaw causes the decryptor to stall, expecting a new nonce to come in. It's only when the other side times out and re-starts the connection that a nonce does come in, which lets the code do a few more things. But fundamentally it's all stymied at this point.

@rade rade assigned rade and unassigned errordeveloper Apr 15, 2015
@thomascramer
Copy link
Author

What is interesting, about the issue, is that it would be one thing if it was that if a new server, X, came up and registered with servers, A, B, and C, and for whatever reason, X was causing a timeout issues with A, B, and C. However, it seems like this issue starts to spread, and suddenly A, B, and C who were connected and communicating just fine start to "timeout" and basically become unestablished to each other... I'm not as familiar with go, though have been looking at your actor setup; but wondering if there could also be an issue that when an actor or whatever is dealing with the "issues" from one "bad egg" if it in fact hinders their ability to keep up with other requests they are processing?

@thomascramer
Copy link
Author

Also, this may be a dumb question, but is possible to force the communication to be only tcp based? I don't see an obvious option for that, but it does look like it tries different options in communicating. Or would working solely via tcp be too much overhead?

@rade
Copy link
Member

rade commented Apr 15, 2015

wondering if there could also be an issue that when an actor or whatever is dealing with the "issues" from one "bad egg"

#564

is possible to force the communication to be only tcp based

#443

bboreham added a commit that referenced this issue Apr 16, 2015
Big improvement in observed behaviour.  LGTM; closes #515
@rade
Copy link
Member

rade commented Apr 16, 2015

@thomascramer I've just published another new set of images (git-36b13b704df4), which should hopefully fix the problem. As before, please grab the latest weave script and run weave setup.

@thomascramer
Copy link
Author

@rade, Thanks, I've been putting it through the paces here today, and so far so good.

@awh awh added the bug label Oct 6, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants