Skip to content
This repository was archived by the owner on Jun 20, 2024. It is now read-only.

Kubernetes weave container killed OOM #3650

Closed
chrisghill opened this issue Jun 13, 2019 · 6 comments
Closed

Kubernetes weave container killed OOM #3650

chrisghill opened this issue Jun 13, 2019 · 6 comments

Comments

@chrisghill
Copy link

What you expected to happen?

Weave container to have stable memory usage. I've done some searching and it appears weave had a memory leak in a previous version (2.3?) but I'm on 2.5.1. Wondering if there are other sources of runaway memory usage.

What happened?

TL;DR The weave container on one of our kubernetes nodes had memory slowly grow until it was killed (OOM) causing network issues in the cluster.

Long version: We've been experiencing periodic (every couple of weeks) network issues in our cluster. We'll have networking fail in parts of the cluster and it was unclear the cause. I think it may be related to this. Basically we can see from pod metrics that the weave container slowly grew in memory over the course of a few hours until it OOMed at 200 MiB. After the crash, it appears that, while the container was restarted, the cluster did not recover to a healthy state. I think that other weave containers (on other nodes) were failing to recognize that the node had recovered. Or perhaps the container restart didn't fully recover the node?

image

You can see that even after the container restarted, network traffic didn't recover for a few hours later. It's unclear what caused it to recover as we were attempting multiple things - but none of them were directly related to weave. Mostly it was restarting other pods.

How to reproduce it?

Unknown

Anything else we need to know?

AWS, Kops v. 1.12, Kubernetes version 1.12.9, Weave container v. 2.5.1.
From our cluster spec, this is about all I can see referencing weave:

  networking:
    weave:
      mtu: 8912

Versions:

$ weave version - Container is version 2.5.1
$ docker version
Client:
 Version:           18.06.3-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        d7080c1
 Built:             Wed Feb 20 02:28:26 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.3-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       d7080c1
  Built:            Wed Feb 20 02:26:51 2019
  OS/Arch:          linux/amd64
  Experimental:     false

$ uname -a
Linux ip-172-20-77-63 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1 (2019-04-12) x86_64 GNU/Linux

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.9", GitCommit:"e09f5c40b55c91f681a46ee17f9bc447eeacee57", GitTreeState:"clean", BuildDate:"2019-05-27T16:08:57Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.9", GitCommit:"e09f5c40b55c91f681a46ee17f9bc447eeacee57", GitTreeState:"clean", BuildDate:"2019-05-27T15:58:45Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Failed container logs

INFO: 2019/06/12 03:27:22.542439 overlay_switch ->[ee:e4:3b:6c:fc:89(ip-172-20-120-236.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:27:35.250176 ->[172.20.115.28:48309|42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:27:35.250225 ->[172.20.115.28:48309|42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:27:35.254492 ->[172.20.115.28:6783] attempting connection
INFO: 2019/06/12 03:27:35.257751 ->[172.20.115.28:6783|42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)]: connection added
INFO: 2019/06/12 03:27:35.257733 overlay_switch ->[42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:27:35.257675 ->[172.20.115.28:6783|42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:27:35.259257 overlay_switch ->[42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)] using sleeve
INFO: 2019/06/12 03:27:35.259276 ->[172.20.115.28:6783|42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)]: connection fully established
INFO: 2019/06/12 03:27:35.259260 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/06/12 03:27:35.260272 sleeve ->[172.20.115.28:6783|42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)]: Effective MTU verified at 8939
INFO: 2019/06/12 03:27:35.259983 overlay_switch ->[42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:27:56.374438 ->[172.20.120.225:46433|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:27:56.374481 ->[172.20.120.225:46433|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:27:56.377088 ->[172.20.120.225:48039] connection accepted
INFO: 2019/06/12 03:27:56.377613 ->[172.20.120.225:6783] attempting connection
INFO: 2019/06/12 03:27:56.378233 ->[172.20.120.225:48039|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:27:56.378289 overlay_switch ->[22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:27:56.378305 ->[172.20.120.225:48039|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection added
INFO: 2019/06/12 03:27:56.379526 ->[172.20.120.225:48039|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection fully established
INFO: 2019/06/12 03:27:56.380558 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/06/12 03:27:56.380639 ->[172.20.120.225:6783|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:27:56.380690 overlay_switch ->[22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:27:56.380709 ->[172.20.120.225:48039|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:27:56.380747 ->[172.20.120.225:48039|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal) added to c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:27:56.380830 overlay_switch ->[22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)] sleeve write tcp4 172.20.77.63:6783->172.20.120.225:48039: use of closed network connection
INFO: 2019/06/12 03:27:56.380862 ->[172.20.120.225:6783|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection added
INFO: 2019/06/12 03:27:56.384456 ->[172.20.120.225:6783|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection fully established
INFO: 2019/06/12 03:27:56.882355 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/06/12 03:27:56.883463 sleeve ->[172.20.120.225:6783|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: Effective MTU verified at 8939
INFO: 2019/06/12 03:28:00.004620 ->[172.20.92.152:62679|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:00.004671 ->[172.20.92.152:62679|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:00.007161 ->[172.20.92.152:26187] connection accepted
INFO: 2019/06/12 03:28:00.007736 ->[172.20.92.152:26187|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:00.007783 overlay_switch ->[ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:00.007864 ->[172.20.92.152:6783] attempting connection
INFO: 2019/06/12 03:28:00.007902 ->[172.20.92.152:26187|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection added
INFO: 2019/06/12 03:28:00.009198 ->[172.20.92.152:6783|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:00.009250 overlay_switch ->[ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:00.009267 ->[172.20.92.152:26187|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:00.009322 ->[172.20.92.152:26187|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal) added to c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:00.009338 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/06/12 03:28:00.009405 ->[172.20.92.152:6783|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection added
INFO: 2019/06/12 03:28:00.010391 ->[172.20.92.152:6783|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection fully established
INFO: 2019/06/12 03:28:00.010584 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/06/12 03:28:00.011732 sleeve ->[172.20.92.152:6783|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: Effective MTU verified at 8939
Killed
DEBU: 2019/06/12 03:28:05.729401 [kube-peers] Checking peer "c2:95:ed:08:12:ac" against list &{[{12:7a:6e:42:85:13 ip-172-20-55-55.us-west-2.compute.internal} {12:13:13:11:a2:1b ip-172-20-80-164.us-west-2.compute.internal} {92:58:12:01:e1:a2 ip-172-20-108-122.us-west-2.compute.internal} {ba:3b:95:ed:6b:93 ip-172-20-92-152.us-west-2.compute.internal} {c6:fc:7b:d1:51:8c ip-172-20-71-78.us-west-2.compute.internal} {c2:95:ed:08:12:ac ip-172-20-77-63.us-west-2.compute.internal} {b6:e8:a9:0b:a9:6c ip-172-20-58-253.us-west-2.compute.internal} {42:a4:3d:9d:c2:74 ip-172-20-115-28.us-west-2.compute.internal} {96:f9:c0:e3:6c:b5 ip-172-20-59-203.us-west-2.compute.internal} {aa:15:9a:5e:26:2c ip-172-20-93-127.us-west-2.compute.internal} {b2:91:38:70:1c:b9 ip-172-20-48-245.us-west-2.compute.internal} {32:12:a4:e8:c9:8a ip-172-20-107-231.us-west-2.compute.internal} {ee:e4:3b:6c:fc:89 ip-172-20-120-236.us-west-2.compute.internal} {22:6c:27:f1:30:a6 ip-172-20-120-225.us-west-2.compute.internal}]}
INFO: 2019/06/12 03:28:05.812569 weave  2.5.1
INFO: 2019/06/12 03:28:05.812534 Command line options: map[expect-npc:true name:c2:95:ed:08:12:ac nickname:ip-172-20-77-63.us-west-2.compute.internal port:6783 conn-limit:100 docker-api: metrics-addr:0.0.0.0:6782 no-dns:true datapath:datapath host-root:/host http-addr:127.0.0.1:6784 mtu:8912 db-prefix:/weavedb/weave-net ipalloc-init:consensus=14 ipalloc-range:100.96.0.0/11]
INFO: 2019/06/12 03:28:05.988687 Re-exposing 100.122.0.0/11 on bridge "weave"
INFO: 2019/06/12 03:28:06.021622 Communication between peers is unencrypted.
INFO: 2019/06/12 03:28:06.021610 Bridge type is bridged_fastdp
INFO: 2019/06/12 03:28:06.104506 Launch detected - using supplied peer list: [172.20.107.231 172.20.108.122 172.20.115.28 172.20.120.225 172.20.120.236 172.20.48.245 172.20.55.55 172.20.58.253 172.20.59.203 172.20.71.78 172.20.77.63 172.20.80.164 172.20.92.152 172.20.93.127]
INFO: 2019/06/12 03:28:06.104010 Our name is c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.114787 weave bridge has address 100.122.0.0/11
INFO: 2019/06/12 03:28:06.114704 Checking for pre-existing addresses on weave bridge
INFO: 2019/06/12 03:28:06.121347 Found address 100.122.0.12/11 for ID _
INFO: 2019/06/12 03:28:06.121494 Found address 100.122.0.3/11 for ID _
INFO: 2019/06/12 03:28:06.120934 Found address 100.122.0.1/11 for ID _
INFO: 2019/06/12 03:28:06.121025 Found address 100.122.0.3/11 for ID _
INFO: 2019/06/12 03:28:06.121436 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.121283 Found address 100.122.0.12/11 for ID _
INFO: 2019/06/12 03:28:06.121958 Found address 100.122.0.10/11 for ID _
INFO: 2019/06/12 03:28:06.121175 Found address 100.122.0.1/11 for ID _
INFO: 2019/06/12 03:28:06.121867 Found address 100.122.0.7/11 for ID _
INFO: 2019/06/12 03:28:06.121715 Found address 100.122.0.4/11 for ID _
INFO: 2019/06/12 03:28:06.122681 Found address 100.122.0.26/11 for ID _
INFO: 2019/06/12 03:28:06.122874 Found address 100.122.0.16/11 for ID _
INFO: 2019/06/12 03:28:06.122205 Found address 100.122.0.4/11 for ID _
INFO: 2019/06/12 03:28:06.122273 Found address 100.122.0.21/11 for ID _
INFO: 2019/06/12 03:28:06.122141 Found address 100.122.0.19/11 for ID _
INFO: 2019/06/12 03:28:06.122935 Found address 100.122.0.19/11 for ID _
INFO: 2019/06/12 03:28:06.122340 Found address 100.122.0.22/11 for ID _
INFO: 2019/06/12 03:28:06.122811 Found address 100.122.0.10/11 for ID _
INFO: 2019/06/12 03:28:06.122408 Found address 100.122.0.23/11 for ID _
INFO: 2019/06/12 03:28:06.122542 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.122612 Found address 100.122.0.24/11 for ID _
INFO: 2019/06/12 03:28:06.122475 Found address 100.122.0.7/11 for ID _
INFO: 2019/06/12 03:28:06.122750 Found address 100.122.0.27/11 for ID _
INFO: 2019/06/12 03:28:06.122068 Found address 100.122.0.16/11 for ID _
INFO: 2019/06/12 03:28:06.123918 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.123407 Found address 100.122.0.21/11 for ID _
INFO: 2019/06/12 03:28:06.123253 Found address 100.122.0.21/11 for ID _
INFO: 2019/06/12 03:28:06.123144 Found address 100.122.0.23/11 for ID _
INFO: 2019/06/12 03:28:06.123701 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.123865 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.123085 Found address 100.122.0.23/11 for ID _
INFO: 2019/06/12 03:28:06.122996 Found address 100.122.0.21/11 for ID _
INFO: 2019/06/12 03:28:06.123527 Found address 100.122.0.22/11 for ID _
INFO: 2019/06/12 03:28:06.123197 Found address 100.122.0.23/11 for ID _
INFO: 2019/06/12 03:28:06.123359 Found address 100.122.0.23/11 for ID _
INFO: 2019/06/12 03:28:06.123309 Found address 100.122.0.21/11 for ID _
INFO: 2019/06/12 03:28:06.123645 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.123474 Found address 100.122.0.22/11 for ID _
INFO: 2019/06/12 03:28:06.123968 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.123581 Found address 100.122.0.22/11 for ID _
INFO: 2019/06/12 03:28:06.123812 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.123757 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124686 Found address 100.122.0.27/11 for ID _
INFO: 2019/06/12 03:28:06.124125 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124074 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124330 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124222 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124797 Found address 100.122.0.26/11 for ID _
INFO: 2019/06/12 03:28:06.124280 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124379 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124966 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124748 Found address 100.122.0.26/11 for ID _
INFO: 2019/06/12 03:28:06.124545 Found address 100.122.0.22/11 for ID _
INFO: 2019/06/12 03:28:06.124021 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124902 Found address 100.122.0.26/11 for ID _
INFO: 2019/06/12 03:28:06.124487 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124173 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.124611 Found address 100.122.0.24/11 for ID _
INFO: 2019/06/12 03:28:06.124848 Found address 100.122.0.26/11 for ID _
INFO: 2019/06/12 03:28:06.124429 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.125015 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.125717 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.125206 Found address 100.106.0.0/11 for ID _
INFO: 2019/06/12 03:28:06.125273 Found address 100.106.0.1/11 for ID _
INFO: 2019/06/12 03:28:06.125917 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.125482 Found address 100.106.0.1/11 for ID _
INFO: 2019/06/12 03:28:06.125667 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.125432 Found address 100.106.0.1/11 for ID _
INFO: 2019/06/12 03:28:06.125329 Found address 100.106.0.0/11 for ID _
INFO: 2019/06/12 03:28:06.125617 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.125868 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.125560 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.125130 Found address 100.122.0.27/11 for ID _
INFO: 2019/06/12 03:28:06.125076 Found address 100.122.0.6/11 for ID _
INFO: 2019/06/12 03:28:06.125769 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.125376 Found address 100.106.0.0/11 for ID _
INFO: 2019/06/12 03:28:06.125819 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.125964 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126260 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126404 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126211 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126014 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126510 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126065 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126557 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126114 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126308 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126807 [allocator c2:95:ed:08:12:ac] Initialising with persisted data
INFO: 2019/06/12 03:28:06.126356 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126463 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.126161 Found address 100.122.0.5/11 for ID _
INFO: 2019/06/12 03:28:06.127684 ->[172.20.108.122:6783] attempting connection
INFO: 2019/06/12 03:28:06.127886 ->[172.20.77.63:6783|c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)]: connection shutting down due to error: cannot connect to ourself
INFO: 2019/06/12 03:28:06.127277 ->[172.20.107.231:6783] attempting connection
INFO: 2019/06/12 03:28:06.127600 ->[172.20.115.28:6783] attempting connection
INFO: 2019/06/12 03:28:06.127568 ->[172.20.120.236:6783] attempting connection
INFO: 2019/06/12 03:28:06.127492 ->[172.20.120.225:6783] attempting connection
INFO: 2019/06/12 03:28:06.127405 ->[172.20.48.245:6783] attempting connection
INFO: 2019/06/12 03:28:06.127813 ->[172.20.77.63:23883|c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)]: connection shutting down due to error: cannot connect to ourself
INFO: 2019/06/12 03:28:06.127332 ->[172.20.80.164:6783] attempting connection
INFO: 2019/06/12 03:28:06.127370 ->[172.20.55.55:6783] attempting connection
INFO: 2019/06/12 03:28:06.127540 ->[172.20.59.203:6783] attempting connection
INFO: 2019/06/12 03:28:06.127429 ->[172.20.58.253:6783] attempting connection
INFO: 2019/06/12 03:28:06.127119 Sniffing traffic on datapath (via ODP)
INFO: 2019/06/12 03:28:06.127711 ->[172.20.71.78:6783] attempting connection
INFO: 2019/06/12 03:28:06.127454 ->[172.20.77.63:23883] connection accepted
INFO: 2019/06/12 03:28:06.127787 ->[172.20.93.127:6783] attempting connection
INFO: 2019/06/12 03:28:06.127652 ->[172.20.92.152:6783] attempting connection
INFO: 2019/06/12 03:28:06.127303 ->[172.20.77.63:6783] attempting connection
INFO: 2019/06/12 03:28:06.128451 ->[172.20.71.78:6783|c6:fc:7b:d1:51:8c(ip-172-20-71-78.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.128589 ->[172.20.80.164:6783|12:13:13:11:a2:1b(ip-172-20-80-164.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.128575 overlay_switch ->[c6:fc:7b:d1:51:8c(ip-172-20-71-78.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.128487 ->[172.20.80.164:6783|12:13:13:11:a2:1b(ip-172-20-80-164.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.128581 overlay_switch ->[aa:15:9a:5e:26:2c(ip-172-20-93-127.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.128564 overlay_switch ->[12:13:13:11:a2:1b(ip-172-20-80-164.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.128471 ->[172.20.93.127:6783|aa:15:9a:5e:26:2c(ip-172-20-93-127.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.129286 ->[172.20.93.127:6783|aa:15:9a:5e:26:2c(ip-172-20-93-127.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.129920 ->[172.20.71.78:6783|c6:fc:7b:d1:51:8c(ip-172-20-71-78.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.129657 Listening for HTTP control messages on 127.0.0.1:6784
INFO: 2019/06/12 03:28:06.129156 ->[172.20.92.152:6783|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.129091 ->[172.20.71.78:6783|c6:fc:7b:d1:51:8c(ip-172-20-71-78.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.129313 overlay_switch ->[ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.130485 ->[172.20.59.203:6783|96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.130395 overlay_switch ->[b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.130052 Listening for metrics requests on 0.0.0.0:6782
INFO: 2019/06/12 03:28:06.130520 ->[172.20.92.152:6783|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.130430 ->[172.20.48.245:6783|b2:91:38:70:1c:b9(ip-172-20-48-245.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.130908 ->[172.20.80.164:6783|12:13:13:11:a2:1b(ip-172-20-80-164.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.130535 overlay_switch ->[96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.130471 overlay_switch ->[b2:91:38:70:1c:b9(ip-172-20-48-245.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.130853 ->[172.20.58.253:6783|b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.130346 ->[172.20.58.253:6783|b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.131248 ->[172.20.93.127:6783|aa:15:9a:5e:26:2c(ip-172-20-93-127.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.131761 ->[172.20.58.253:6783|b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.131555 overlay_switch ->[ee:e4:3b:6c:fc:89(ip-172-20-120-236.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.131033 ->[172.20.107.231:6783|32:12:a4:e8:c9:8a(ip-172-20-107-231.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.131553 overlay_switch ->[aa:15:9a:5e:26:2c(ip-172-20-93-127.us-west-2.compute.internal)] fastdp write tcp4 172.20.77.63:35797->172.20.93.127:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.131159 ->[172.20.55.55:6783|12:7a:6e:42:85:13(ip-172-20-55-55.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.131922 ->[172.20.115.28:6783|42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.131726 overlay_switch ->[22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.131649 ->[172.20.120.225:6783|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.131193 overlay_switch ->[12:7a:6e:42:85:13(ip-172-20-55-55.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.131475 ->[172.20.120.236:6783|ee:e4:3b:6c:fc:89(ip-172-20-120-236.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.131699 ->[172.20.59.203:6783|96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.131262 ->[172.20.48.245:6783|b2:91:38:70:1c:b9(ip-172-20-48-245.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.131807 overlay_switch ->[ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)] fastdp write tcp4 172.20.77.63:14957->172.20.92.152:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.131337 ->[172.20.92.152:6783|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.131597 overlay_switch ->[ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)] sleeve write tcp4 172.20.77.63:14957->172.20.92.152:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.131069 overlay_switch ->[32:12:a4:e8:c9:8a(ip-172-20-107-231.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.131575 overlay_switch ->[aa:15:9a:5e:26:2c(ip-172-20-93-127.us-west-2.compute.internal)] using sleeve
INFO: 2019/06/12 03:28:06.132897 ->[172.20.80.164:6783|12:13:13:11:a2:1b(ip-172-20-80-164.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.132473 overlay_switch ->[12:13:13:11:a2:1b(ip-172-20-80-164.us-west-2.compute.internal)] fastdp write tcp4 172.20.77.63:54779->172.20.80.164:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.131976 overlay_switch ->[42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.132814 overlay_switch ->[96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)] using sleeve
INFO: 2019/06/12 03:28:06.132198 ->[172.20.108.122:6783|92:58:12:01:e1:a2(ip-172-20-108-122.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.132410 overlay_switch ->[c6:fc:7b:d1:51:8c(ip-172-20-71-78.us-west-2.compute.internal)] using sleeve
INFO: 2019/06/12 03:28:06.132406 ->[172.20.107.231:6783|32:12:a4:e8:c9:8a(ip-172-20-107-231.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.132593 ->[172.20.120.236:6783|ee:e4:3b:6c:fc:89(ip-172-20-120-236.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.132487 overlay_switch ->[12:13:13:11:a2:1b(ip-172-20-80-164.us-west-2.compute.internal)] using sleeve
INFO: 2019/06/12 03:28:06.132511 ->[172.20.55.55:6783|12:7a:6e:42:85:13(ip-172-20-55-55.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.132035 ->[172.20.48.245:6783|b2:91:38:70:1c:b9(ip-172-20-48-245.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.132650 ->[172.20.120.225:6783|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.132392 overlay_switch ->[c6:fc:7b:d1:51:8c(ip-172-20-71-78.us-west-2.compute.internal)] fastdp write tcp4 172.20.77.63:11035->172.20.71.78:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.132716 overlay_switch ->[96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)] fastdp write tcp4 172.20.77.63:18605->172.20.59.203:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.132020 ->[172.20.71.78:6783|c6:fc:7b:d1:51:8c(ip-172-20-71-78.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.132526 overlay_switch ->[b2:91:38:70:1c:b9(ip-172-20-48-245.us-west-2.compute.internal)] sleeve write tcp4 172.20.77.63:55919->172.20.48.245:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.132355 overlay_switch ->[92:58:12:01:e1:a2(ip-172-20-108-122.us-west-2.compute.internal)] using fastdp
INFO: 2019/06/12 03:28:06.132363 ->[172.20.59.203:6783|96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.132818 ->[172.20.115.28:6783|42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.133259 ->[172.20.107.231:6783|32:12:a4:e8:c9:8a(ip-172-20-107-231.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.133225 overlay_switch ->[96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)] sleeve write tcp4 172.20.77.63:18605->172.20.59.203:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.133756 ->[172.20.92.152:6783|ba:3b:95:ed:6b:93(ip-172-20-92-152.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.133874 ->[172.20.120.236:6783|ee:e4:3b:6c:fc:89(ip-172-20-120-236.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.133106 overlay_switch ->[b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)] fastdp write tcp4 172.20.77.63:61087->172.20.58.253:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.133124 overlay_switch ->[b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)] using sleeve
INFO: 2019/06/12 03:28:06.133608 ->[172.20.93.127:6783|aa:15:9a:5e:26:2c(ip-172-20-93-127.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.134029 ->[172.20.115.28:6783|42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.134124 overlay_switch ->[32:12:a4:e8:c9:8a(ip-172-20-107-231.us-west-2.compute.internal)] sleeve write tcp4 172.20.77.63:10841->172.20.107.231:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.134314 ->[172.20.120.225:6783|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.134732 overlay_switch ->[22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)] sleeve write tcp4 172.20.77.63:40187->172.20.120.225:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.134268 overlay_switch ->[42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)] sleeve write tcp4 172.20.77.63:14727->172.20.115.28:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.134522 ->[172.20.108.122:6783|92:58:12:01:e1:a2(ip-172-20-108-122.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.134107 overlay_switch ->[32:12:a4:e8:c9:8a(ip-172-20-107-231.us-west-2.compute.internal)] using sleeve
INFO: 2019/06/12 03:28:06.134235 ->[172.20.58.253:6783|b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.134706 ->[172.20.48.245:6783|b2:91:38:70:1c:b9(ip-172-20-48-245.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.134097 overlay_switch ->[32:12:a4:e8:c9:8a(ip-172-20-107-231.us-west-2.compute.internal)] fastdp write tcp4 172.20.77.63:10841->172.20.107.231:6783: use of closed network connection
INFO: 2019/06/12 03:28:06.135800 ->[172.20.107.231:6783|32:12:a4:e8:c9:8a(ip-172-20-107-231.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.135618 ->[172.20.59.203:6783|96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.135767 ->[172.20.108.122:6783|92:58:12:01:e1:a2(ip-172-20-108-122.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959
INFO: 2019/06/12 03:28:06.135905 ->[172.20.115.28:6783|42:a4:3d:9d:c2:74(ip-172-20-115-28.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.136055 ->[172.20.120.225:6783|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.136178 ->[172.20.108.122:6783|92:58:12:01:e1:a2(ip-172-20-108-122.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.136004 ->[172.20.120.236:6783|ee:e4:3b:6c:fc:89(ip-172-20-120-236.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.139803 Removed unreachable peer 22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.139815 Removed unreachable peer aa:15:9a:5e:26:2c(ip-172-20-93-127.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.139800 Removed unreachable peer b2:91:38:70:1c:b9(ip-172-20-48-245.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.139818 Removed unreachable peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.139832 Removed unreachable peer 92:58:12:01:e1:a2(ip-172-20-108-122.us-west-2.compute.internal)

LOTS of logs there, but I wanted to give context before, during and after crashing. You can see the `Killed` statement on ~line 50. That gives you context for before and after.

Here are the logs from all of the other Weave containers shortly after the bad container crashed. You'll see they all start removing.

Other weave containers

INFO: 2019/06/12 03:28:06.456690 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.455035 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.454723 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.454129 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.454564 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.454133 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.454124 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.453252 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.453326 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.453909 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.452621 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.451732 ->[172.20.77.63:6783|c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)]: connection deleted
INFO: 2019/06/12 03:28:06.451907 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.451554 ->[172.20.77.63:6783|c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)]: connection shutting down due to error: EOF
INFO: 2019/06/12 03:28:06.451576 overlay_switch ->[c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)] sleeve write tcp4 172.20.59.203:27951->172.20.77.63:6783: write: connection reset by peer
INFO: 2019/06/12 03:28:06.451520 ->[172.20.77.63:6783|c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)]: connection fully established
INFO: 2019/06/12 03:28:06.450480 ->[172.20.77.63:6783|c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/06/12 03:28:06.450544 ->[172.20.77.63:6783|c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/06/12 03:28:06.450926 Removed unreachable peer c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.450526 overlay_switch ->[c2:95:ed:08:12:ac(ip-172-20-77-63.us-west-2.compute.internal)] using fastdp
ERRO: 2019/06/12 03:28:06.448016 Captured frame from MAC (b6:e8:a9:0b:a9:6c) to (42:81:94:d1:a1:f4) associated with another peer b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.447047 Captured frame from MAC (b6:e8:a9:0b:a9:6c) to (42:81:94:d1:a1:f4) associated with another peer b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.447281 Captured frame from MAC (b6:e8:a9:0b:a9:6c) to (42:81:94:d1:a1:f4) associated with another peer b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)
INFO: 2019/06/12 03:28:06.447845 ->[172.20.77.63:6783] attempting connection
ERRO: 2019/06/12 03:28:06.447519 Captured frame from MAC (b6:e8:a9:0b:a9:6c) to (42:81:94:d1:a1:f4) associated with another peer b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.447642 Captured frame from MAC (b6:e8:a9:0b:a9:6c) to (42:81:94:d1:a1:f4) associated with another peer b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.446172 Captured frame from MAC (b6:e8:a9:0b:a9:6c) to (42:81:94:d1:a1:f4) associated with another peer b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.445530 Captured frame from MAC (b6:e8:a9:0b:a9:6c) to (42:81:94:d1:a1:f4) associated with another peer b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.444159 Captured frame from MAC (b6:e8:a9:0b:a9:6c) to (42:81:94:d1:a1:f4) associated with another peer b6:e8:a9:0b:a9:6c(ip-172-20-58-253.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.437974 Captured frame from MAC (96:f9:c0:e3:6c:b5) to (42:81:94:d1:a1:f4) associated with another peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.437484 Captured frame from MAC (96:f9:c0:e3:6c:b5) to (42:81:94:d1:a1:f4) associated with another peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.437294 Captured frame from MAC (96:f9:c0:e3:6c:b5) to (42:81:94:d1:a1:f4) associated with another peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.437595 Captured frame from MAC (96:f9:c0:e3:6c:b5) to (42:81:94:d1:a1:f4) associated with another peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.437232 Captured frame from MAC (96:f9:c0:e3:6c:b5) to (42:81:94:d1:a1:f4) associated with another peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.436090 Captured frame from MAC (96:f9:c0:e3:6c:b5) to (42:81:94:d1:a1:f4) associated with another peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.436952 Captured frame from MAC (96:f9:c0:e3:6c:b5) to (42:81:94:d1:a1:f4) associated with another peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.434134 Captured frame from MAC (96:f9:c0:e3:6c:b5) to (42:81:94:d1:a1:f4) associated with another peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.425822 Captured frame from MAC (8a:97:64:6f:72:78) to (c2:95:ed:08:12:ac) associated with another peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
ERRO: 2019/06/12 03:28:06.424199 Captured frame from MAC (8a:97:64:6f:72:78) to (c2:95:ed:08:12:ac) associated with another peer 96:f9:c0:e3:6c:b5(ip-172-20-59-203.us-west-2.compute.internal)
...

Network:

$ ip route
default via 172.20.64.1 dev eth0 
100.96.0.0/11 dev weave proto kernel scope link src 100.122.0.0 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
172.20.64.0/19 dev eth0 proto kernel scope link src 172.20.77.63

$ ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 172.20.77.63/19 brd 172.20.95.255 scope global eth0\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\       valid_lft forever preferred_lft forever
6: weave    inet 100.122.0.0/11 brd 100.127.255.255 scope global weave\       valid_lft forever preferred_lft forever
@bboreham
Copy link
Contributor

For the "steady growth" portion, we do not expect that. Would you be able to grab a memory profile next time it does that?

We do expect some management data and buffers per connection, so depending on the number of nodes in your cluster 200MB may be too low.

Note that, as the memory usage gets close to 200MB, Linux will demand-page the weaver binary, which will make things extremely slow. See #3614

This line indicates a particular problem:

INFO: 2019/06/12 03:27:56.374438 ->[172.20.120.225:46433|22:6c:27:f1:30:a6(ip-172-20-120-225.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.106.0.0 v16954: incoming message says owner 3e:15:4e:5d:6e:ee v16959

Weave Net cannot tolerate inconsistent data like this, so connections will be broken continuously. In the current version there is no solution except to delete the bad data and restart weave containers.
The next release should heal itself better (see #3637).

Here are the logs from all of the other Weave containers

I see one log. Can you clarify what I should see?

@chrisghill
Copy link
Author

chrisghill commented Jun 14, 2019

@bboreham thanks for your response.

I'll do my best to grab a memory profile, but the challenge will be noticing it is happening before it does. Additionally, I've never performed a memory profile on a docker container - do you have any suggestions on how to do that? Should I exec into the pod? Do you have a specific tool you suggest using?

Our kubernetes cluster isn't particularly large. At max we're maybe 20-25 nodes, and usually closer to 10-15. I expect weave should have no scaling issues at that size?

As for the other logs - I apologize. Those are consolidated logs pulled from elasticsearch for all weave-net containers across the cluster. In the format I posted here you lose context of seeing they are coming from different nodes/containers. I was just showing the rest of the network reacting to the OOM container dying. I was unsure what data would be relevant to the bug so I figured I'd give you as much as possible, but they might be totally irrelevant.

@bboreham
Copy link
Contributor

10-15 nodes should run fine within 200MB. Above 100 I'd expect issues.

To grab a memory profile, run this on the host: curl http://127.0.0.1:6784/debug/pprof/heap > weave.mem

The memory growth could be connected with the messages "error: Received update for IP range I own".

@itskingori
Copy link

This seems related to my experience (see #3659 (comment)). We've increased the limit to 300MB as the weave container operates to close to 200MB. The interesting this is the container usage really dropped down after increasing the limit 👇

Screenshot 2019-07-15 at 09 16 52

Waiting for another memory spike so that I can profile it.

@itskingori
Copy link

I thought it's important to note that the above profile is from weave 2.5.2 and we've just hit the same issue (without it hitting 300MB) so it's not a memory issue (at least for 2.5.2). We've opted to roll back to 2.5.1 but keep the 300MB limit.

@bboreham
Copy link
Contributor

We changed a few things in 2.6.0 to avoid OOMs - going to close this.

If you see steady growth again and can get a heap profile please re-open.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants