Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

new peers cannot join when connection limit has been reached #426

Open
bboreham opened this issue Feb 26, 2015 · 13 comments
Open

new peers cannot join when connection limit has been reached #426

bboreham opened this issue Feb 26, 2015 · 13 comments

Comments

@bboreham
Copy link
Contributor

bboreham commented Feb 26, 2015

We've since increased the default limit to 30 100 200.

If you have 11 peers all fully-connected, then a 12th peer is unable to make a connection. This is because there is a limit of 10 connections per peer and all of them are at that limit.

Evidence:

weave status from one of 11 connected peers:

weave router git-495419e28c2d
Encryption off
Our name is 7e:ac:3f:97:e5:4d (testweave1)
Sniffing traffic on &{2513 1500 ethwe 7e:ac:3f:97:e5:4d up|broadcast}
MACs:
7e:ac:3f:97:e5:4d -> 7e:ac:3f:97:e5:4d (2015-02-26 11:08:16.482111269 +0000 UTC)
3e:0e:81:a4:96:8f -> 2a:7a:39:40:c1:d8 (2015-02-26 11:08:20.977117117 +0000 UTC)
32:28:c3:82:3e:1b -> a6:e6:bd:a7:b8:a0 (2015-02-26 11:08:21.188593636 +0000 UTC)
02:f9:02:d8:5f:38 -> c2:30:43:55:19:5f (2015-02-26 11:08:21.775328828 +0000 UTC)
a2:c1:e7:e1:96:d8 -> 7e:ac:3f:97:e5:4d (2015-02-26 11:08:17.32618667 +0000 UTC)
ce:42:ec:fe:e2:51 -> c2:e1:f5:a0:42:01 (2015-02-26 11:08:20.846974172 +0000 UTC)
f2:df:f3:4f:84:6e -> 5a:34:38:a6:05:8f (2015-02-26 11:08:21.193589022 +0000 UTC)
56:a9:91:8d:95:b3 -> 46:25:59:ec:27:5c (2015-02-26 11:08:21.200578214 +0000 UTC)
fa:11:59:92:31:b4 -> 0a:9a:bd:ac:d7:16 (2015-02-26 11:08:21.200831703 +0000 UTC)
7e:ad:b1:79:a4:7e -> fe:37:85:f1:94:99 (2015-02-26 11:08:21.201126642 +0000 UTC)
a6:5f:db:b4:0e:65 -> 62:7c:f7:a2:40:8b (2015-02-26 11:08:21.201333412 +0000 UTC)
ce:77:55:85:a0:c8 -> 2a:5a:9c:ac:05:a9 (2015-02-26 11:08:21.794248927 +0000 UTC)
Peers:
Peer 0a:9a:bd:ac:d7:16 (testweave7) (v28) (UID 12198133855536853396)
   -> c2:30:43:55:19:5f (testweave3) [172.17.2.124:56900]
   -> 2a:5a:9c:ac:05:a9 (testweave4) [172.17.2.125:6783]
   -> a6:e6:bd:a7:b8:a0 (testweave6) [172.17.2.127:46153]
   -> 5a:34:38:a6:05:8f (testweave8) [172.17.2.129:6783]
   -> 62:7c:f7:a2:40:8b (testweave10) [172.17.2.131:6783]
   -> 2a:7a:39:40:c1:d8 (testweave9) [172.17.2.130:47788]
   -> fe:37:85:f1:94:99 (testweave0) [172.17.2.121:44456]
   -> 7e:ac:3f:97:e5:4d (testweave1) [172.17.2.122:6783]
   -> 46:25:59:ec:27:5c (testweave5) [172.17.2.126:6783]
   -> c2:e1:f5:a0:42:01 (testweave2) [172.17.2.123:6783]
Peer 7e:ac:3f:97:e5:4d (testweave1) (v28) (UID 9345990439251632325)
   -> 2a:7a:39:40:c1:d8 (testweave9) [172.17.2.130:6783]
   -> a6:e6:bd:a7:b8:a0 (testweave6) [172.17.2.127:6783]
   -> 2a:5a:9c:ac:05:a9 (testweave4) [172.17.2.125:6783]
   -> fe:37:85:f1:94:99 (testweave0) [172.17.2.121:55913]
   -> c2:30:43:55:19:5f (testweave3) [172.17.2.124:6783]
   -> 62:7c:f7:a2:40:8b (testweave10) [172.17.2.131:43878]
   -> 0a:9a:bd:ac:d7:16 (testweave7) [172.17.2.128:50433]
   -> 5a:34:38:a6:05:8f (testweave8) [172.17.2.129:6783]
   -> c2:e1:f5:a0:42:01 (testweave2) [172.17.2.123:6783]
   -> 46:25:59:ec:27:5c (testweave5) [172.17.2.126:6783]
Peer c2:30:43:55:19:5f (testweave3) (v28) (UID 12798035255233588601)
   -> 2a:5a:9c:ac:05:a9 (testweave4) [172.17.2.125:6783]
   -> 7e:ac:3f:97:e5:4d (testweave1) [172.17.2.122:41621]
   -> fe:37:85:f1:94:99 (testweave0) [172.17.2.121:37034]
   -> 62:7c:f7:a2:40:8b (testweave10) [172.17.2.131:34423]
   -> 2a:7a:39:40:c1:d8 (testweave9) [172.17.2.130:45018]
   -> 0a:9a:bd:ac:d7:16 (testweave7) [172.17.2.128:6783]
   -> 5a:34:38:a6:05:8f (testweave8) [172.17.2.129:46177]
   -> a6:e6:bd:a7:b8:a0 (testweave6) [172.17.2.127:40224]
   -> 46:25:59:ec:27:5c (testweave5) [172.17.2.126:6783]
   -> c2:e1:f5:a0:42:01 (testweave2) [172.17.2.123:58442]
Peer 2a:5a:9c:ac:05:a9 (testweave4) (v34) (UID 16079008686669939436)
   -> 5a:34:38:a6:05:8f (testweave8) [172.17.2.129:58075]
   -> 62:7c:f7:a2:40:8b (testweave10) [172.17.2.131:48261]
   -> 2a:7a:39:40:c1:d8 (testweave9) [172.17.2.130:6783]
   -> a6:e6:bd:a7:b8:a0 (testweave6) [172.17.2.127:48315]
   -> c2:30:43:55:19:5f (testweave3) [172.17.2.124:40917]
   -> 7e:ac:3f:97:e5:4d (testweave1) [172.17.2.122:33079]
   -> fe:37:85:f1:94:99 (testweave0) [172.17.2.121:6783]
   -> 0a:9a:bd:ac:d7:16 (testweave7) [172.17.2.128:36718]
   -> 46:25:59:ec:27:5c (testweave5) [172.17.2.126:6783]
   -> c2:e1:f5:a0:42:01 (testweave2) [172.17.2.123:6783]
Peer fe:37:85:f1:94:99 (testweave0) (v27) (UID 4476469674713649865)
   -> 2a:5a:9c:ac:05:a9 (testweave4) [172.17.2.125:49958 (unestablished)]
   -> 2a:7a:39:40:c1:d8 (testweave9) [172.17.2.130:50302]
   -> 5a:34:38:a6:05:8f (testweave8) [172.17.2.129:55964]
   -> 0a:9a:bd:ac:d7:16 (testweave7) [172.17.2.128:6783]
   -> a6:e6:bd:a7:b8:a0 (testweave6) [172.17.2.127:6783]
   -> 62:7c:f7:a2:40:8b (testweave10) [172.17.2.131:6783]
   -> 7e:ac:3f:97:e5:4d (testweave1) [172.17.2.122:6783]
   -> c2:30:43:55:19:5f (testweave3) [172.17.2.124:6783]
   -> c2:e1:f5:a0:42:01 (testweave2) [172.17.2.123:56589]
   -> 46:25:59:ec:27:5c (testweave5) [172.17.2.126:6783]
Peer a6:e6:bd:a7:b8:a0 (testweave6) (v28) (UID 747023844343995687)
   -> 2a:7a:39:40:c1:d8 (testweave9) [172.17.2.130:57992]
   -> 5a:34:38:a6:05:8f (testweave8) [172.17.2.129:56465]
   -> fe:37:85:f1:94:99 (testweave0) [172.17.2.121:43702]
   -> 7e:ac:3f:97:e5:4d (testweave1) [172.17.2.122:33599]
   -> c2:30:43:55:19:5f (testweave3) [172.17.2.124:6783]
   -> 2a:5a:9c:ac:05:a9 (testweave4) [172.17.2.125:6783]
   -> 0a:9a:bd:ac:d7:16 (testweave7) [172.17.2.128:6783]
   -> 62:7c:f7:a2:40:8b (testweave10) [172.17.2.131:51252]
   -> 46:25:59:ec:27:5c (testweave5) [172.17.2.126:59566]
   -> c2:e1:f5:a0:42:01 (testweave2) [172.17.2.123:6783]
Peer 2a:7a:39:40:c1:d8 (testweave9) (v26) (UID 6096942486184792905)
   -> fe:37:85:f1:94:99 (testweave0) [172.17.2.121:6783]
   -> 62:7c:f7:a2:40:8b (testweave10) [172.17.2.131:6783]
   -> 5a:34:38:a6:05:8f (testweave8) [172.17.2.129:36469]
   -> 0a:9a:bd:ac:d7:16 (testweave7) [172.17.2.128:6783]
   -> a6:e6:bd:a7:b8:a0 (testweave6) [172.17.2.127:6783]
   -> 2a:5a:9c:ac:05:a9 (testweave4) [172.17.2.125:60148]
   -> c2:30:43:55:19:5f (testweave3) [172.17.2.124:6783]
   -> 7e:ac:3f:97:e5:4d (testweave1) [172.17.2.122:57432]
   -> c2:e1:f5:a0:42:01 (testweave2) [172.17.2.123:6783]
   -> 46:25:59:ec:27:5c (testweave5) [172.17.2.126:6783]
Peer 62:7c:f7:a2:40:8b (testweave10) (v33) (UID 2318310089864029143)
   -> fe:37:85:f1:94:99 (testweave0) [172.17.2.121:56714]
   -> 2a:7a:39:40:c1:d8 (testweave9) [172.17.2.130:38057]
   -> 0a:9a:bd:ac:d7:16 (testweave7) [172.17.2.128:47515]
   -> 5a:34:38:a6:05:8f (testweave8) [172.17.2.129:6783]
   -> a6:e6:bd:a7:b8:a0 (testweave6) [172.17.2.127:6783]
   -> c2:30:43:55:19:5f (testweave3) [172.17.2.124:6783]
   -> 7e:ac:3f:97:e5:4d (testweave1) [172.17.2.122:6783]
   -> 2a:5a:9c:ac:05:a9 (testweave4) [172.17.2.125:6783]
   -> c2:e1:f5:a0:42:01 (testweave2) [172.17.2.123:42999]
   -> 46:25:59:ec:27:5c (testweave5) [172.17.2.126:56942]
Peer 5a:34:38:a6:05:8f (testweave8) (v30) (UID 3150229553387776875)
   -> c2:30:43:55:19:5f (testweave3) [172.17.2.124:6783]
   -> fe:37:85:f1:94:99 (testweave0) [172.17.2.121:6783]
   -> 7e:ac:3f:97:e5:4d (testweave1) [172.17.2.122:46493]
   -> 0a:9a:bd:ac:d7:16 (testweave7) [172.17.2.128:52245]
   -> 2a:7a:39:40:c1:d8 (testweave9) [172.17.2.130:6783]
   -> 62:7c:f7:a2:40:8b (testweave10) [172.17.2.131:55504]
   -> a6:e6:bd:a7:b8:a0 (testweave6) [172.17.2.127:6783]
   -> 2a:5a:9c:ac:05:a9 (testweave4) [172.17.2.125:6783]
   -> 46:25:59:ec:27:5c (testweave5) [172.17.2.126:35803]
   -> c2:e1:f5:a0:42:01 (testweave2) [172.17.2.123:6783]
Peer c2:e1:f5:a0:42:01 (testweave2) (v34) (UID 17627454856843464028)
   -> 46:25:59:ec:27:5c (testweave5) [172.17.2.126:41859]
   -> 5a:34:38:a6:05:8f (testweave8) [172.17.2.129:45829]
   -> 62:7c:f7:a2:40:8b (testweave10) [172.17.2.131:6783]
   -> 0a:9a:bd:ac:d7:16 (testweave7) [172.17.2.128:34186]
   -> 7e:ac:3f:97:e5:4d (testweave1) [172.17.2.122:42741]
   -> c2:30:43:55:19:5f (testweave3) [172.17.2.124:6783]
   -> 2a:5a:9c:ac:05:a9 (testweave4) [172.17.2.125:39211]
   -> fe:37:85:f1:94:99 (testweave0) [172.17.2.121:6783]
   -> 2a:7a:39:40:c1:d8 (testweave9) [172.17.2.130:42466]
   -> a6:e6:bd:a7:b8:a0 (testweave6) [172.17.2.127:49836]
Peer 46:25:59:ec:27:5c (testweave5) (v28) (UID 14892158120879598741)
   -> 2a:5a:9c:ac:05:a9 (testweave4) [172.17.2.125:52043]
   -> 7e:ac:3f:97:e5:4d (testweave1) [172.17.2.122:59420]
   -> c2:30:43:55:19:5f (testweave3) [172.17.2.124:55530]
   -> a6:e6:bd:a7:b8:a0 (testweave6) [172.17.2.127:6783]
   -> fe:37:85:f1:94:99 (testweave0) [172.17.2.121:49016]
   -> 0a:9a:bd:ac:d7:16 (testweave7) [172.17.2.128:45583]
   -> 62:7c:f7:a2:40:8b (testweave10) [172.17.2.131:6783]
   -> 2a:7a:39:40:c1:d8 (testweave9) [172.17.2.130:47410]
   -> 5a:34:38:a6:05:8f (testweave8) [172.17.2.129:6783]
   -> c2:e1:f5:a0:42:01 (testweave2) [172.17.2.123:6783]
Routes:
unicast:
46:25:59:ec:27:5c -> 46:25:59:ec:27:5c
c2:e1:f5:a0:42:01 -> c2:e1:f5:a0:42:01
2a:7a:39:40:c1:d8 -> 2a:7a:39:40:c1:d8
a6:e6:bd:a7:b8:a0 -> a6:e6:bd:a7:b8:a0
2a:5a:9c:ac:05:a9 -> 2a:5a:9c:ac:05:a9
fe:37:85:f1:94:99 -> fe:37:85:f1:94:99
c2:30:43:55:19:5f -> c2:30:43:55:19:5f
62:7c:f7:a2:40:8b -> 62:7c:f7:a2:40:8b
0a:9a:bd:ac:d7:16 -> 0a:9a:bd:ac:d7:16
7e:ac:3f:97:e5:4d -> 00:00:00:00:00:00
5a:34:38:a6:05:8f -> 5a:34:38:a6:05:8f
broadcast:
c2:e1:f5:a0:42:01 -> []
46:25:59:ec:27:5c -> []
0a:9a:bd:ac:d7:16 -> []
7e:ac:3f:97:e5:4d -> [c2:e1:f5:a0:42:01 46:25:59:ec:27:5c a6:e6:bd:a7:b8:a0 2a:5a:9c:ac:05:a9 fe:37:85:f1:94:99 c2:30:43:55:19:5f 62:7c:f7:a2:40:8b 0a:9a:bd:ac:d7:16 5a:34:38:a6:05:8f 2a:7a:39:40:c1:d8]
c2:30:43:55:19:5f -> []
2a:5a:9c:ac:05:a9 -> []
fe:37:85:f1:94:99 -> []
a6:e6:bd:a7:b8:a0 -> []
2a:7a:39:40:c1:d8 -> []
62:7c:f7:a2:40:8b -> []
5a:34:38:a6:05:8f -> []
Reconnects:

Log from new peer:

weave 2015/02/26 11:11:58.716929 Command line options: map[iface:ethwe name:7a:91:bb:cb:02:e6 nickname:number12 wait:20]
weave 2015/02/26 11:11:58.717151 Command line peers: [172.17.2.123]
weave 2015/02/26 11:11:59.717809 Communication between peers is unencrypted.
weave 2015/02/26 11:11:59.718067 Our name is 7a:91:bb:cb:02:e6 (number12)
weave 2015/02/26 11:11:59.753879 Sniffing traffic on &{2557 65535 ethwe b6:a8:01:69:e4:a2 up|broadcast|multicast}
weave 2015/02/26 11:11:59.753929 Discovered our MAC b6:a8:01:69:e4:a2
weave 2015/02/26 11:11:59.755044 ->[172.17.2.123:6783] attempting connection
weave 2015/02/26 11:11:59.756032 ->[172.17.2.123:6783] completed handshake with c2:e1:f5:a0:42:01
weave 2015/02/26 11:11:59.756510 ->[c2:e1:f5:a0:42:01]: connection shutting down due to error: write tcp4 172.17.2.123:6783: connection reset by peer
weave 2015/02/26 11:11:59.756913 ->[c2:e1:f5:a0:42:01]: connection added
weave 2015/02/26 11:11:59.757022 Removed unreachable Peer c2:e1:f5:a0:42:01 (testweave2) (v0) (UID 17627454856843464028)
weave 2015/02/26 11:11:59.757130 ->[c2:e1:f5:a0:42:01]: connection deleted
weave 2015/02/26 11:12:00.377456 Discovered local MAC 4a:c9:42:10:f3:14
weave 2015/02/26 11:12:02.533374 ->[172.17.2.123:6783] attempting connection
weave 2015/02/26 11:12:02.534473 ->[172.17.2.123:6783] completed handshake with c2:e1:f5:a0:42:01
weave 2015/02/26 11:12:02.535093 ->[c2:e1:f5:a0:42:01]: connection shutting down due to error: write tcp4 172.17.2.123:6783: connection reset by peer
weave 2015/02/26 11:12:02.535223 ->[c2:e1:f5:a0:42:01]: connection added
weave 2015/02/26 11:12:02.535269 Removed unreachable Peer c2:e1:f5:a0:42:01 (testweave2) (v0) (UID 17627454856843464028)
weave 2015/02/26 11:12:02.535290 ->[c2:e1:f5:a0:42:01]: connection deleted

weave status from new peer:

weave router git-495419e28c2d
Encryption off
Our name is 7a:91:bb:cb:02:e6 (number12)
Sniffing traffic on &{2557 65535 ethwe b6:a8:01:69:e4:a2 up|broadcast|multicast}
MACs:
b6:a8:01:69:e4:a2 -> 7a:91:bb:cb:02:e6 (2015-02-26 11:11:59.753923742 +0000 UTC)
4a:c9:42:10:f3:14 -> 7a:91:bb:cb:02:e6 (2015-02-26 11:12:00.377420763 +0000 UTC)
Peers:
Peer 7a:91:bb:cb:02:e6 (number12) (v14) (UID 15673637195766780103)
Routes:
unicast:
7a:91:bb:cb:02:e6 -> 00:00:00:00:00:00
broadcast:
7a:91:bb:cb:02:e6 -> []
Reconnects:
172.17.2.123:6783 (next try at 2015-02-26 11:12:58.046571925 +0000 UTC)

Logs from existing peer:

weave 2015/02/26 11:11:59.755390 ->[172.17.2.133:42194] connection accepted
weave 2015/02/26 11:11:59.756041 ->[172.17.2.133:42194] completed handshake with 7a:91:bb:cb:02:e6
weave 2015/02/26 11:11:59.756138 ->[7a:91:bb:cb:02:e6]: connection shutting down due to error: Connection limit reached (10)
weave 2015/02/26 11:12:02.533778 ->[172.17.2.133:42196] connection accepted
weave 2015/02/26 11:12:02.534486 ->[172.17.2.133:42196] completed handshake with 7a:91:bb:cb:02:e6
weave 2015/02/26 11:12:02.534626 ->[7a:91:bb:cb:02:e6]: connection shutting down due to error: Connection limit reached (10)
@bboreham bboreham added the bug label Feb 26, 2015
@rade
Copy link
Member

rade commented Mar 6, 2015

As an interim improvement, any objections to raising the connection limit to 100? That's 10k connections in a fully connected network, which I should think modern network kit has no trouble handling.

Beyond that... some algorithm that decides whether to "bump" existing connections for new ones based on "connectivity":

  1. a connection to a new peer, i.e. one that no other peer is connected to, bumps any connection to a peer that is connected to other peer(s).
  2. a connection to a peer which provides the sole route to/from that peer for some other peer, bumps any connection for which that isn't the case.

and a few more like that. The objective is to maximise routability between peers and to minimise hop count. That should do for starters; we can get more clever later, taking into account latencies, bandwidth, congestion, etc, etc.

@pixie79
Copy link

pixie79 commented Mar 11, 2015

Sounds a reasonable solution. If I can get weave working well for my test senario we could easily have a network of thousands.

@fermayo
Copy link

fermayo commented Mar 19, 2015

+1 to allow more than 11 peers to join the network ;-)

@bboreham
Copy link
Contributor Author

Note that there is a command-line parameter -connlimit which lets you override the default (I didn't know that when I filed this issue).
However, when testing with higher peer numbers we saw other issues - #445 - so leaving the default limit of 10 unchanged until we are confident they are resolved.

rade added a commit that referenced this issue Apr 2, 2015
I can get 29 weaves running on my laptop, and once all connections are
established, the avg cpu load is ~10%. 30 runs into difficulty because
the concurrent connection establishment does create a very high load,
resulting in some timeouts. But that wouldn't happen when running each
weave in a separate machine (with more CPU resources overall than my
single laptop).

Stop gap solution for #426.
@rade rade changed the title Weave number 12 is unable to join a network Weave number 32 is unable to join a network Apr 2, 2015
@rade rade changed the title Weave number 32 is unable to join a network new peers cannot join when connection limit has been reached Apr 12, 2015
@rade
Copy link
Member

rade commented Apr 12, 2015

Recent changes have made weave much better behaved when there are large numbers of fully-connected peers. I can run 50 fully-connected weaves on my laptop, and could no doubt go much further if each weave was on a separate machine, as is typically the case.

Therefore in most deployments of >30 peers it should be possible to avoid the problems described in this issue by supplying an increased connlimit. I doubt one would be able to wire up hundreds of peers that way, but there are probably other issues with such large deployments anyway.

@inetfuture
Copy link

Just hit this issue, new nodes can not join, took some time to debug, wondering if our machine has network issues...

This limit really should be well documented, especially in site/troubleshooting.md

@llarsson
Copy link

Please make it possible to set this -connlimit in a way that is compatible with how to configure the Kubernetes Add-on, because as it stands, that is needlessly hard to do.

@bboreham
Copy link
Contributor Author

bboreham commented Jan 12, 2018

@llarsson it can be configured via an environment variable CONN_LIMIT, but I see this is not in the docs. (edit: fixed in d6ae1f8)

Since this variable existed before you posted your comment, perhaps I am missing your point. It may be better to open a new issue and explain there.

@llarsson
Copy link

Since I did not perform a deep dive into the code, but rather just assumed that the docs would mention how to modify such a crucial value, I honestly believed based on your edit in the first post in this thread ("We've since increased the default limit to 30.") that the only way to modify the level would be to recompile and push my own image.

What happens in an auto-scaling scenario, and my once-correct connection limit value is no longer valid? Can I update my weave network in a rolling fashion with new CONN_LIMIT values and everything works?

@bboreham
Copy link
Contributor Author

Sorry this was missed from the docs. We should really set the default a bit higher - it's set very cautiously because we were interested to hear from users what sorts of sizes and topologies they had, but 20 million "pulls" later that's not happening.

Yes, rolling update is fine - there will be a brief break in communications on each node as the routing and policy rules are recreated, but this should be invisible because higher-level protocols (e.g. TCP) will retransmit.

@itskingori
Copy link

@bboreham About this issue ...

We should really set the default a bit higher - it's set very cautiously because we were interested to hear from users what sorts of sizes and topologies they had, ...

We had a cluster scale up to 33 nodes yesterday. 2 nodes could not connect to the rest of the cluster. We fixed it by setting the CONN_LIMIT to 100. We're not 💯% sure of the implications of this.

However, when testing with higher peer numbers we saw other issues - #445 - so leaving the default limit of 10 unchanged until we are confident they are resolved.

Should we worry about this? 😅

@bboreham
Copy link
Contributor Author

bboreham commented Feb 8, 2018

No, don't worry, that comment was nearly three years ago, and the bug was fixed.

Default limit will be raised to 100 by #3234.

@itskingori
Copy link

@bboreham Perfect! Thanks. 🙌

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants