[SCALING] weave-net pod CrashLoopBackOff #3593

murali-reddy · 2019-02-05T12:36:47Z

What you expected to happen?

On empty cluster with no workload traffic (dataplane traffic) with just Weave-net control plane traffic weave-net pods should scale to 100's and even thousands of nodes

What happened?

While running scaling tests one of the symptom noticed was some of the weave-net pods go into CrashLoopBackOff. This can happen on any cluster beyond 100 nodes (CONN_LIMIT set to larger value than default 100)

Some pods were explicitly marked as OOM killed will open separate issue for that.

How to reproduce it?

Launch cluster with more than 100 nodes with CONN_LIMIT set accordingly)

Anything else we need to know?

Versions:

$ weave version
2.5.1

Logs:

Weave net logs indicate no error there is abrupt Killed message

INFO: 2019/02/05 11:56:11.355798 ->[172.20.63.172:6783|2a:12:2e:32:dc:3d(ip-172-20-63-172.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.355898 InvalidateRoutes
INFO: 2019/02/05 11:56:11.356176 ->[172.20.52.215:6783|42:69:b4:33:e1:31(ip-172-20-52-215.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.356276 InvalidateRoutes
INFO: 2019/02/05 11:56:11.356585 ->[172.20.52.143:6783|ae:d4:86:29:75:55(ip-172-20-52-143.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.356681 InvalidateRoutes
INFO: 2019/02/05 11:56:11.357009 ->[172.20.63.1:6783|76:1e:52:bf:fc:72(ip-172-20-63-1.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.357114 InvalidateRoutes
INFO: 2019/02/05 11:56:11.357449 ->[172.20.34.183:6783|32:d1:37:f3:ac:da(ip-172-20-34-183.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.357552 InvalidateRoutes
INFO: 2019/02/05 11:56:11.357918 ->[172.20.59.216:6783|b6:63:b7:7b:6c:f8(ip-172-20-59-216.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.358013 InvalidateRoutes
INFO: 2019/02/05 11:56:11.358412 ->[172.20.68.73:6783|42:11:98:23:d0:0f(ip-172-20-68-73.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.358510 InvalidateRoutes
INFO: 2019/02/05 11:56:11.358952 ->[172.20.48.161:6783|fe:40:8b:5c:ee:a5(ip-172-20-48-161.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.359053 InvalidateRoutes
INFO: 2019/02/05 11:56:11.359548 ->[172.20.86.83:6783|a2:db:47:21:2f:8e(ip-172-20-86-83.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.359655 InvalidateRoutes
INFO: 2019/02/05 11:56:11.360132 ->[172.20.33.98:59573|1a:5d:81:03:16:cf(ip-172-20-33-98.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.360237 InvalidateRoutes
DEBU: 2019/02/05 11:56:11.365999 sleeve ->[<nil>|b2:65:d6:64:fd:1e(ip-172-20-49-60.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.366069 sleeve ->[172.20.49.60:6783|b2:65:d6:64:fd:1e(ip-172-20-49-60.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.376753 sleeve ->[172.20.61.199:6783|66:7a:03:79:2a:bb(ip-172-20-61-199.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.386788 sleeve ->[172.20.45.58:6783|b2:f4:b4:79:61:1d(ip-172-20-45-58.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.390747 sleeve ->[172.20.78.159:6783|6e:45:c6:53:fb:b7(ip-172-20-78-159.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.398153 fastdp ->[172.20.50.203:6784|32:d3:1f:07:17:c3(ip-172-20-50-203.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.403672 sleeve ->[172.20.93.143:6783|96:5f:2d:b6:91:8b(ip-172-20-93-143.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.403759 sleeve ->[172.20.93.20:6783|ce:39:f3:49:e2:34(ip-172-20-93-20.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.403827 sleeve ->[172.20.40.201:6783|0a:f9:a1:7a:2b:14(ip-172-20-40-201.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.403962 sleeve ->[172.20.57.74:6783|36:29:96:a5:2c:60(ip-172-20-57-74.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.404036 sleeve ->[172.20.45.80:6783|aa:d9:60:09:e7:57(ip-172-20-45-80.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.404091 sleeve ->[172.20.52.177:6783|b6:69:7a:01:26:23(ip-172-20-52-177.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.404470 sleeve ->[172.20.40.46:6783|fa:9f:91:cc:0a:93(ip-172-20-40-46.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.404789 fastdp ->[172.20.78.105:6784|ea:0e:47:ec:86:1e(ip-172-20-78-105.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.404891 sleeve ->[172.20.78.105:6783|ea:0e:47:ec:86:1e(ip-172-20-78-105.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.404943 fastdp ->[172.20.50.54:6784|a2:28:38:27:df:3d(ip-172-20-50-54.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.405069 sleeve ->[172.20.50.54:6783|a2:28:38:27:df:3d(ip-172-20-50-54.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.405147 fastdp ->[172.20.42.123:6784|2a:f5:97:de:ad:44(ip-172-20-42-123.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.405220 sleeve ->[172.20.42.123:6783|2a:f5:97:de:ad:44(ip-172-20-42-123.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.405337 sleeve ->[172.20.50.203:6783|32:d3:1f:07:17:c3(ip-172-20-50-203.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.405404 fastdp ->[172.20.57.135:6784|8a:84:6d:61:9b:d8(ip-172-20-57-135.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.405482 fastdp ->[172.20.50.127:6784|0e:28:03:89:b6:aa(ip-172-20-50-127.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.405614 fastdp ->[172.20.43.159:6784|8a:f2:98:b9:02:c0(ip-172-20-43-159.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.405707 fastdp ->[172.20.40.46:6784|fa:9f:91:cc:0a:93(ip-172-20-40-46.us-west-2.compute.internal)]: sending Heartbeat to peer
Killed

Kubelet logs (and pod describe) does not indicate reason

Feb 05 11:30:19 ip-172-20-95-162 kubelet[1211]: I0205 11:30:19.568462    1211 kube_docker_client.go:348] Stop pulling image "muralireddy/weave-kube:profiling": "Status: Image is up to date for muralir
eddy/weave-kube:profiling"
Feb 05 11:30:20 ip-172-20-95-162 kubelet[1211]: I0205 11:30:20.248610    1211 kubelet.go:1910] SyncLoop (PLEG): "weave-net-pxmdp_kube-system(5bff3562-2939-11e9-9890-061daf14c9b6)", event: &pleg.PodLif
ecycleEvent{ID:"5bff3562-2939-11e9-9890-061daf14c9b6", Type:"ContainerStarted", Data:"4721f3e9fb908d26fd524576a63a45528a90fe3ec95fad4861959513312e793a"}
Feb 05 11:30:26 ip-172-20-95-162 kubelet[1211]: I0205 11:30:26.273587    1211 kubelet.go:1910] SyncLoop (PLEG): "weave-net-pxmdp_kube-system(5bff3562-2939-11e9-9890-061daf14c9b6)", event: &pleg.PodLif
ecycleEvent{ID:"5bff3562-2939-11e9-9890-061daf14c9b6", Type:"ContainerDied", Data:"4721f3e9fb908d26fd524576a63a45528a90fe3ec95fad4861959513312e793a"}
Feb 05 11:30:26 ip-172-20-95-162 kubelet[1211]: I0205 11:30:26.574396    1211 kuberuntime_manager.go:513] Container {Name:weave Image:muralireddy/weave-kube:profiling Command:[/home/weave/launch.sh] A
rgs:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:HOSTNAME Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:spec.nodeName,},ResourceFieldRef:nil,ConfigMapKeyRef:nil
,SecretKeyRef:nil,}} {Name:IPALLOC_RANGE Value:100.96.0.0/11 ValueFrom:nil} {Name:WEAVE_MTU Value:8912 ValueFrom:nil} {Name:CONN_LIMIT Value:500 ValueFrom:nil}] Resources:{Limits:map[memory:{i:{value:
209715200 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Requests:map[cpu:{i:{value:50 scale:-3} d:{Dec:<nil>} s:50m Format:DecimalSI} memory:{i:{value:209715200 scale:0} d:{Dec:<nil>} s: Format:BinarySI
}]} VolumeMounts:[{Name:weavedb ReadOnly:false MountPath:/weavedb SubPath: MountPropagation:<nil>} {Name:cni-bin ReadOnly:false MountPath:/host/opt SubPath: MountPropagation:<nil>} {Name:cni-bin2 Read
Only:false MountPath:/host/home SubPath: MountPropagation:<nil>} {Name:cni-conf ReadOnly:false MountPath:/host/etc SubPath: MountPropagation:<nil>} {Name:dbus ReadOnly:false MountPath:/host/var/lib/db
us SubPath: MountPropagation:<nil>} {Name:lib-modules ReadOnly:false MountPath:/lib/modules SubPath: MountPropagation:<nil>} {Name:xtables-lock ReadOnly:false MountPath:/run/xtables.lock SubPath: Moun
tPropagation:<nil>} {Name:weave-net-token-d2nn7 ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Ha
ndler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/status,Port:6784,Host:127.0.0.1,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:30,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,Fai
lureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:&SecurityContext{Capabilities:nil,Pr
ivileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,Allo
Feb 05 11:30:26 ip-172-20-95-162 kubelet[1211]: I0205 11:30:26.574567    1211 kuberuntime_manager.go:757] checking backoff for container "weave" in pod "weave-net-pxmdp_kube-system(5bff3562-2939-11e9-
9890-061daf14c9b6)"
Feb 05 11:30:26 ip-172-20-95-162 kubelet[1211]: I0205 11:30:26.574685    1211 kuberuntime_manager.go:767] Back-off 10s restarting failed container=weave pod=weave-net-pxmdp_kube-system(5bff3562-2939-1
1e9-9890-061daf14c9b6)
Feb 05 11:30:26 ip-172-20-95-162 kubelet[1211]: E0205 11:30:26.574746    1211 pod_workers.go:186] Error syncing pod 5bff3562-2939-11e9-9890-061daf14c9b6 ("weave-net-pxmdp_kube-system(5bff3562-2939-11e
9-9890-061daf14c9b6)"), skipping: failed to "StartContainer" for "weave" with CrashLoopBackOff: "Back-off 10s restarting failed container=weave pod=weave-net-pxmdp_kube-system(5bff3562-2939-11e9-9890-
061daf14c9b6)"

Memory and CPU profiles

mem-profile-150nodes.pdf
cpu-profile-150nodes.pdf

The text was updated successfully, but these errors were encountered:

murali-reddy · 2019-02-05T13:12:39Z

Increasing memory requests and limits (from 200Mi in kops manifest) prevents pods from crashing. Interesting to note memory profile shows below 50MB consumption

   16.49MB 30.82% 30.82%    16.49MB 30.82%  github.com/weaveworks/weave/router.newSleeveCrypto
   10.64MB 19.88% 50.70%    10.64MB 19.88%  github.com/weaveworks/weave/vendor/github.com/google/gopacket.(*serializeBuffer).PrependBytes
    7.50MB 14.02% 64.72%     7.50MB 14.02%  fmt.Sprintf
    6.50MB 12.15% 76.87%     6.50MB 12.15%  github.com/weaveworks/weave/vendor/github.com/google/gopacket/layers.errorFunc
    4.52MB  8.45% 85.31%     4.52MB  8.45%  github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.makeConnsMap
    1.16MB  2.16% 87.48%     1.16MB  2.16%  github.com/weaveworks/weave/vendor/github.com/weaveworks/common/signals.(*Handler).Loop
       1MB  1.87% 89.34%    17.49MB 32.69%  github.com/weaveworks/weave/router.(*SleeveOverlay).PrepareConnection
       1MB  1.87% 91.21%        1MB  1.87%  encoding/gob.decString
    0.66MB  1.23% 92.44%     0.66MB  1.23%  github.com/weaveworks/weave/vendor/github.com/modern-go/reflect2.loadGo17Types
    0.53MB  0.99% 93.44%     0.53MB  0.99%  github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp.OpenNetlinkSocket

bboreham · 2019-02-05T13:34:25Z

shows below 50MB consumption

It can be hard to catch the heap at its largest. If you set GODEBUG=gctrace=1 in the environment you should get extra lines in the log which show the heap size each time a full GC ran.

murali-reddy · 2019-02-06T06:38:22Z

Thanks @bboreham for the suggestion. With GODEBUG=gctrace=1 I captured below GC activity when cluster is of 150 nodes size.

gc 208 @532.209s 0%: 0.014+43+0.069 ms clock, 0.029+1.8/23/15+0.13 ms cpu, 103->104->52 MB, 106 MB goal, 2 P
gc 209 @532.968s 0%: 0.008+36+0.068 ms clock, 0.017+2.6/16/17+0.13 ms cpu, 102->102->51 MB, 104 MB goal, 2 P
gc 210 @533.712s 0%: 0.033+31+0.074 ms clock, 0.067+27/17/0+0.14 ms cpu, 101->103->54 MB, 103 MB goal, 2 P
gc 211 @534.084s 0%: 0.009+35+0.075 ms clock, 0.019+2.1/14/25+0.15 ms cpu, 104->104->52 MB, 108 MB goal, 2 P
gc 212 @534.744s 0%: 0.020+30+0.075 ms clock, 0.041+2.9/16/21+0.15 ms cpu, 101->102->52 MB, 104 MB goal, 2 P
gc 213 @534.955s 0%: 0.022+55+0.15 ms clock, 0.044+7.8/29/6.5+0.30 ms cpu, 102->102->52 MB, 104 MB goal, 2 P
gc 214 @535.646s 0%: 0.016+41+0.092 ms clock, 0.032+11/21/10+0.18 ms cpu, 103->104->53 MB, 105 MB goal, 2 P
gc 215 @536.065s 0%: 0.039+35+0.078 ms clock, 0.078+2.1/18/18+0.15 ms cpu, 104->105->52 MB, 107 MB goal, 2 P
gc 216 @536.967s 0%: 0.012+33+0.079 ms clock, 0.024+2.4/17/19+0.15 ms cpu, 103->103->52 MB, 105 MB goal, 2 P
gc 217 @537.788s 0%: 0.029+36+0.052 ms clock, 0.058+3.0/21/16+0.10 ms cpu, 103->103->52 MB, 105 MB goal, 2 P
gc 52 @531.320s 0%: 0.007+3.7+0.018 ms clock, 0.014+0.44/0.46/4.7+0.037 ms cpu, 6->6->3 MB, 7 MB goal, 2 P
gc 218 @538.540s 0%: 0.010+28+0.080 ms clock, 0.020+8.9/13/19+0.16 ms cpu, 103->104->54 MB, 105 MB goal, 2 P

scvg0: inuse: 68, idle: 68, sys: 137, released: 0, consumed: 137 (MB)
scvg0: inuse: 4, idle: 1, sys: 6, released: 0, consumed: 6 (MB)
scvg1: inuse: 76, idle: 59, sys: 136, released: 0, consumed: 136 (MB)
scvg1: inuse: 5, idle: 1, sys: 6, released: 0, consumed: 6 (MB)
scvg2: 4 MB released
scvg2: inuse: 79, idle: 56, sys: 135, released: 4, consumed: 130 (MB)
scvg2: inuse: 7, idle: 0, sys: 8, released: 0, consumed: 8 (MB)

bboreham · 2019-02-06T09:16:40Z

So that says 50MB was the real minimum and GC allowed the heap to grow to about 100MB before collecting.
I think you said offline the container limit was 400MB, so we have a mystery why it should OOM at 100.

murali-reddy · 2019-02-06T09:47:39Z

Should have mentioned it. This sample was taken when I left kops weave manfiest defaults unmodified

          resources:
            requests:
              cpu: 50m
              memory: 200Mi
            limits:
              memory: 200Mi

bboreham · 2019-02-06T10:45:46Z

OK, it's plausible that overheads and fragmentation pushed it briefly beyond 200Mi.
We can document that 200Mi is too low for a 150-node cluster, maybe file an issue on the Kops repo to add to their docs (or parameterise that number).

High memory usage in newSleveCrypto() seems to derive from buffers like this - the code is optimised to avoid allocations, whereas it would be cheaper in memory if we only allocated what we needed. Maybe we can deallocate that structure when we have set up a fastdp flow?

gopacket.PrependBytes seems likely to be involved in actually writing out packets; again we should not need it when fastdp is in operation.

itskingori · 2019-07-15T11:24:48Z

Increasing memory requests and limits (from 200Mi in kops manifest) prevents pods from crashing. Interesting to note memory profile shows below 50MB consumption

Based on the memory profile of the weave container in our cluster (see #3659 (comment)), we've opted to increase the memory limit to 300MB. We're using Kops by the way, so the default is 200MB. To note, we never got to 100 nodes but we grew quickly from around 30 to 80 at which point lots of weave pods got OOM'd.

murali-reddy · 2019-07-15T13:02:25Z

@itskingori please share pprof heap output for any nodes where you observed 200MB was not sufficient enough for a cluster < 100 nodes. I did launch clusters with 100 nodes with kops numerous times for scaling tests, never faced any issue with defaults.

Note that this issue is reported for cluster sizes > 150, beyond 150 nodes there is increase in memory requirement and 200M is too low resulting in OOM

itskingori · 2019-07-15T13:49:59Z

... please share pprof heap output for any nodes where you observed 200MB was not sufficient enough for a cluster < 100 nodes. I did launch clusters with 100 nodes with kops numerous times for scaling tests, never faced any issue with defaults.

Unfortunately we haven't scaled again to 80 ... 👇

I ran this on a pod at 144MB right now and got weave-net-7mnzr.mem.zip ...

kubectl exec -n=kube-system weave-net-7mnzr -n kube-system -- curl http://127.0.0.1:6784/debug/pprof/heap > weave-net-7mnzr.mem

itskingori · 2019-07-15T13:50:03Z

Note that this issue is reported for cluster sizes > 150, beyond 150 nodes there is increase in memory requirement and 200M is too low resulting in OOM

Noted. Would node size cause some variance i.e. maybe because run many pods per node?

itskingori · 2019-07-17T09:12:30Z

Note that this issue is reported for cluster sizes > 150, beyond 150 nodes there is increase in memory requirement and 200M is too low resulting in OOM

Noted. Would node size cause some variance i.e. maybe because run many pods per node?

@murali-reddy Just nudging you on this. I'm wondering if node-type and number of pods on the node would introduce some variance to your tests? We're running 30GB/60GB nodes and we have quite a number of pods running on them.

I woke up to our production cluster at 64 nodes and you can see the memory edging towards 200MB (which is why we increased to 300MB). It didn't get to 80 which 💣-ed us last time (see #3659 (comment)). I think at 80 we’d be at if not more than 200MB.

bboreham · 2019-07-17T12:12:48Z

There's no particular reason for it to be sensitive to node size.
If you could grab another heap profile this might help to advance our knowledge.
The previous profile you attached in this issue was using 29MB of Go heap (so the 144MB figure quoted is unexplained - could be that the memory use had shrunk before you took the profile).

Also please state the version of Weave Net in use which is essential to interpret the profile - even better please open a new issue which will prompt you to enter such information.

itskingori · 2019-07-17T13:22:29Z

@bboreham Noted. That said, I've spotted a rogue weave pod (see weave-net-7mnzr below). You can see it's hit 200MB with just 38 nodes. Got the heap: weave-net-7mnzr.mem.zip.

cc: @murali-reddy

The version:

/home/weave # ./weave --local version
weave 2.5.1

itskingori · 2019-07-17T13:47:46Z

@bboreham This might be interesting ... looking at the logs you can see the same pod has a higher occurrence of these errors ...

Something definitely is wrong with weave-net-7mnzr.

zacblazic · 2019-07-17T13:52:39Z

Colleague of @itskingori here. 👋Thought I'd share this set of logs too as it's related to many of the other issues that have been popping up:

Adding the latest heap again: weave-net-7mnzr.mem.zip

itskingori · 2019-07-17T14:26:20Z

@bboreham We've tried to get as much information about the state as we can so that you have as much information as you can to do on ...

Here the latest memory dump just in case: weave-net-7mnzr.mem.zip. We're up to 216MB ...

murali-reddy · 2019-07-18T09:11:41Z

thanks @itskingori and @zacblazic for the data

I am suspecting there is mismatch between external view of memory consumption of the process RSS/resident size and what go profiler view of heap. Even with the new pprof heap output its only consuming 38MB.

Type: inuse_space
Time: Jul 17, 2019 at 7:53pm (IST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 27.11MB, 70.53% of 38.43MB total

I am no expert on GO GC, but setting GOGC to lesser value from default might help. Need to check if that helps.

Also going by the memory usage growth of weave-net-7mnzr even increasing the memory limit would eventually hit the limit. Since the connection retry and shutdown resulting from IPAM conflicts is not recovarable perhaps there is no point retrying.

itskingori · 2019-07-18T09:31:41Z

Also going by the memory usage growth of weave-net-7mnzr even increasing the memory limit would eventually hit the limit.

Yes indeed. But it gives us time to spot it and remedy. I discovered this at 201MB growing slowly, so with the old limit it would have gotten OOM'd already.

It also seems that in growing memory is a symptom and not the cause in that when a weave pod is in this state, it's memory keeps growing and it's cpu also increases just a little bit.

peer-list.txt

status_ipam.txt

status.txt

status_connections.txt

Also, to note, I should have pointed that the peer-list seems to have a missing node i.e. it has 41 entries and the rest have 42. Don't know if that means anything but I suspect the counts should be the same.

bboreham · 2020-07-07T09:35:56Z

There were many inefficiencies fixed in 2.6, and also a longstanding leak fixed in 2.6.5, so I'll close this.

murali-reddy self-assigned this Feb 5, 2019

murali-reddy added the bug label Feb 5, 2019

This was referenced Feb 7, 2019

[SCALING] weaver connections to peers results in various errors in at larger cluster sizes #3595

Closed

CNCF cluster access to test weave-net scaling cncf/cluster#105

Closed

murali-reddy mentioned this issue Sep 26, 2019

weave network of 100 machines #1621

Closed

bboreham closed this as completed Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SCALING] weave-net pod CrashLoopBackOff #3593

[SCALING] weave-net pod CrashLoopBackOff #3593

murali-reddy commented Feb 5, 2019

murali-reddy commented Feb 5, 2019

bboreham commented Feb 5, 2019

murali-reddy commented Feb 6, 2019

bboreham commented Feb 6, 2019

murali-reddy commented Feb 6, 2019

bboreham commented Feb 6, 2019

itskingori commented Jul 15, 2019

murali-reddy commented Jul 15, 2019

itskingori commented Jul 15, 2019

itskingori commented Jul 15, 2019

itskingori commented Jul 17, 2019

bboreham commented Jul 17, 2019

itskingori commented Jul 17, 2019 •

edited

Loading

itskingori commented Jul 17, 2019

zacblazic commented Jul 17, 2019 •

edited

Loading

itskingori commented Jul 17, 2019 •

edited

Loading

murali-reddy commented Jul 18, 2019

itskingori commented Jul 18, 2019

bboreham commented Jul 7, 2020

[SCALING] weave-net pod CrashLoopBackOff #3593

[SCALING] weave-net pod CrashLoopBackOff #3593

Comments

murali-reddy commented Feb 5, 2019

What you expected to happen?

What happened?

How to reproduce it?

Anything else we need to know?

Versions:

Logs:

murali-reddy commented Feb 5, 2019

bboreham commented Feb 5, 2019

murali-reddy commented Feb 6, 2019

bboreham commented Feb 6, 2019

murali-reddy commented Feb 6, 2019

bboreham commented Feb 6, 2019

itskingori commented Jul 15, 2019

murali-reddy commented Jul 15, 2019

itskingori commented Jul 15, 2019

itskingori commented Jul 15, 2019

itskingori commented Jul 17, 2019

bboreham commented Jul 17, 2019

itskingori commented Jul 17, 2019 • edited Loading

itskingori commented Jul 17, 2019

zacblazic commented Jul 17, 2019 • edited Loading

itskingori commented Jul 17, 2019 • edited Loading

murali-reddy commented Jul 18, 2019

itskingori commented Jul 18, 2019

bboreham commented Jul 7, 2020

itskingori commented Jul 17, 2019 •

edited

Loading

zacblazic commented Jul 17, 2019 •

edited

Loading

itskingori commented Jul 17, 2019 •

edited

Loading