Skip to content
This repository was archived by the owner on Jun 20, 2024. It is now read-only.

[SCALING] weave-net pod CrashLoopBackOff #3593

Closed
murali-reddy opened this issue Feb 5, 2019 · 19 comments
Closed

[SCALING] weave-net pod CrashLoopBackOff #3593

murali-reddy opened this issue Feb 5, 2019 · 19 comments
Assignees
Labels

Comments

@murali-reddy
Copy link
Contributor

What you expected to happen?

On empty cluster with no workload traffic (dataplane traffic) with just Weave-net control plane traffic weave-net pods should scale to 100's and even thousands of nodes

What happened?

While running scaling tests one of the symptom noticed was some of the weave-net pods go into CrashLoopBackOff. This can happen on any cluster beyond 100 nodes (CONN_LIMIT set to larger value than default 100)

Some pods were explicitly marked as OOM killed will open separate issue for that.

How to reproduce it?

Launch cluster with more than 100 nodes with CONN_LIMIT set accordingly)

Anything else we need to know?

Versions:

$ weave version
2.5.1

Logs:

Weave net logs indicate no error there is abrupt Killed message

INFO: 2019/02/05 11:56:11.355798 ->[172.20.63.172:6783|2a:12:2e:32:dc:3d(ip-172-20-63-172.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.355898 InvalidateRoutes
INFO: 2019/02/05 11:56:11.356176 ->[172.20.52.215:6783|42:69:b4:33:e1:31(ip-172-20-52-215.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.356276 InvalidateRoutes
INFO: 2019/02/05 11:56:11.356585 ->[172.20.52.143:6783|ae:d4:86:29:75:55(ip-172-20-52-143.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.356681 InvalidateRoutes
INFO: 2019/02/05 11:56:11.357009 ->[172.20.63.1:6783|76:1e:52:bf:fc:72(ip-172-20-63-1.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.357114 InvalidateRoutes
INFO: 2019/02/05 11:56:11.357449 ->[172.20.34.183:6783|32:d1:37:f3:ac:da(ip-172-20-34-183.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.357552 InvalidateRoutes
INFO: 2019/02/05 11:56:11.357918 ->[172.20.59.216:6783|b6:63:b7:7b:6c:f8(ip-172-20-59-216.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.358013 InvalidateRoutes
INFO: 2019/02/05 11:56:11.358412 ->[172.20.68.73:6783|42:11:98:23:d0:0f(ip-172-20-68-73.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.358510 InvalidateRoutes
INFO: 2019/02/05 11:56:11.358952 ->[172.20.48.161:6783|fe:40:8b:5c:ee:a5(ip-172-20-48-161.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.359053 InvalidateRoutes
INFO: 2019/02/05 11:56:11.359548 ->[172.20.86.83:6783|a2:db:47:21:2f:8e(ip-172-20-86-83.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.359655 InvalidateRoutes
INFO: 2019/02/05 11:56:11.360132 ->[172.20.33.98:59573|1a:5d:81:03:16:cf(ip-172-20-33-98.us-west-2.compute.internal)]: connection added (new peer)
DEBU: 2019/02/05 11:56:11.360237 InvalidateRoutes
DEBU: 2019/02/05 11:56:11.365999 sleeve ->[<nil>|b2:65:d6:64:fd:1e(ip-172-20-49-60.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.366069 sleeve ->[172.20.49.60:6783|b2:65:d6:64:fd:1e(ip-172-20-49-60.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.376753 sleeve ->[172.20.61.199:6783|66:7a:03:79:2a:bb(ip-172-20-61-199.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.386788 sleeve ->[172.20.45.58:6783|b2:f4:b4:79:61:1d(ip-172-20-45-58.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.390747 sleeve ->[172.20.78.159:6783|6e:45:c6:53:fb:b7(ip-172-20-78-159.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.398153 fastdp ->[172.20.50.203:6784|32:d3:1f:07:17:c3(ip-172-20-50-203.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.403672 sleeve ->[172.20.93.143:6783|96:5f:2d:b6:91:8b(ip-172-20-93-143.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.403759 sleeve ->[172.20.93.20:6783|ce:39:f3:49:e2:34(ip-172-20-93-20.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.403827 sleeve ->[172.20.40.201:6783|0a:f9:a1:7a:2b:14(ip-172-20-40-201.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.403962 sleeve ->[172.20.57.74:6783|36:29:96:a5:2c:60(ip-172-20-57-74.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.404036 sleeve ->[172.20.45.80:6783|aa:d9:60:09:e7:57(ip-172-20-45-80.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.404091 sleeve ->[172.20.52.177:6783|b6:69:7a:01:26:23(ip-172-20-52-177.us-west-2.compute.internal)]: handleHeartbeat
DEBU: 2019/02/05 11:56:11.404470 sleeve ->[172.20.40.46:6783|fa:9f:91:cc:0a:93(ip-172-20-40-46.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.404789 fastdp ->[172.20.78.105:6784|ea:0e:47:ec:86:1e(ip-172-20-78-105.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.404891 sleeve ->[172.20.78.105:6783|ea:0e:47:ec:86:1e(ip-172-20-78-105.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.404943 fastdp ->[172.20.50.54:6784|a2:28:38:27:df:3d(ip-172-20-50-54.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.405069 sleeve ->[172.20.50.54:6783|a2:28:38:27:df:3d(ip-172-20-50-54.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.405147 fastdp ->[172.20.42.123:6784|2a:f5:97:de:ad:44(ip-172-20-42-123.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.405220 sleeve ->[172.20.42.123:6783|2a:f5:97:de:ad:44(ip-172-20-42-123.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.405337 sleeve ->[172.20.50.203:6783|32:d3:1f:07:17:c3(ip-172-20-50-203.us-west-2.compute.internal)]: sendHeartbeat
DEBU: 2019/02/05 11:56:11.405404 fastdp ->[172.20.57.135:6784|8a:84:6d:61:9b:d8(ip-172-20-57-135.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.405482 fastdp ->[172.20.50.127:6784|0e:28:03:89:b6:aa(ip-172-20-50-127.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.405614 fastdp ->[172.20.43.159:6784|8a:f2:98:b9:02:c0(ip-172-20-43-159.us-west-2.compute.internal)]: sending Heartbeat to peer
DEBU: 2019/02/05 11:56:11.405707 fastdp ->[172.20.40.46:6784|fa:9f:91:cc:0a:93(ip-172-20-40-46.us-west-2.compute.internal)]: sending Heartbeat to peer
Killed

Kubelet logs (and pod describe) does not indicate reason

Feb 05 11:30:19 ip-172-20-95-162 kubelet[1211]: I0205 11:30:19.568462    1211 kube_docker_client.go:348] Stop pulling image "muralireddy/weave-kube:profiling": "Status: Image is up to date for muralir
eddy/weave-kube:profiling"
Feb 05 11:30:20 ip-172-20-95-162 kubelet[1211]: I0205 11:30:20.248610    1211 kubelet.go:1910] SyncLoop (PLEG): "weave-net-pxmdp_kube-system(5bff3562-2939-11e9-9890-061daf14c9b6)", event: &pleg.PodLif
ecycleEvent{ID:"5bff3562-2939-11e9-9890-061daf14c9b6", Type:"ContainerStarted", Data:"4721f3e9fb908d26fd524576a63a45528a90fe3ec95fad4861959513312e793a"}
Feb 05 11:30:26 ip-172-20-95-162 kubelet[1211]: I0205 11:30:26.273587    1211 kubelet.go:1910] SyncLoop (PLEG): "weave-net-pxmdp_kube-system(5bff3562-2939-11e9-9890-061daf14c9b6)", event: &pleg.PodLif
ecycleEvent{ID:"5bff3562-2939-11e9-9890-061daf14c9b6", Type:"ContainerDied", Data:"4721f3e9fb908d26fd524576a63a45528a90fe3ec95fad4861959513312e793a"}
Feb 05 11:30:26 ip-172-20-95-162 kubelet[1211]: I0205 11:30:26.574396    1211 kuberuntime_manager.go:513] Container {Name:weave Image:muralireddy/weave-kube:profiling Command:[/home/weave/launch.sh] A
rgs:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:HOSTNAME Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:spec.nodeName,},ResourceFieldRef:nil,ConfigMapKeyRef:nil
,SecretKeyRef:nil,}} {Name:IPALLOC_RANGE Value:100.96.0.0/11 ValueFrom:nil} {Name:WEAVE_MTU Value:8912 ValueFrom:nil} {Name:CONN_LIMIT Value:500 ValueFrom:nil}] Resources:{Limits:map[memory:{i:{value:
209715200 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Requests:map[cpu:{i:{value:50 scale:-3} d:{Dec:<nil>} s:50m Format:DecimalSI} memory:{i:{value:209715200 scale:0} d:{Dec:<nil>} s: Format:BinarySI
}]} VolumeMounts:[{Name:weavedb ReadOnly:false MountPath:/weavedb SubPath: MountPropagation:<nil>} {Name:cni-bin ReadOnly:false MountPath:/host/opt SubPath: MountPropagation:<nil>} {Name:cni-bin2 Read
Only:false MountPath:/host/home SubPath: MountPropagation:<nil>} {Name:cni-conf ReadOnly:false MountPath:/host/etc SubPath: MountPropagation:<nil>} {Name:dbus ReadOnly:false MountPath:/host/var/lib/db
us SubPath: MountPropagation:<nil>} {Name:lib-modules ReadOnly:false MountPath:/lib/modules SubPath: MountPropagation:<nil>} {Name:xtables-lock ReadOnly:false MountPath:/run/xtables.lock SubPath: Moun
tPropagation:<nil>} {Name:weave-net-token-d2nn7 ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Ha
ndler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/status,Port:6784,Host:127.0.0.1,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:30,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,Fai
lureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:&SecurityContext{Capabilities:nil,Pr
ivileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,Allo
Feb 05 11:30:26 ip-172-20-95-162 kubelet[1211]: I0205 11:30:26.574567    1211 kuberuntime_manager.go:757] checking backoff for container "weave" in pod "weave-net-pxmdp_kube-system(5bff3562-2939-11e9-
9890-061daf14c9b6)"
Feb 05 11:30:26 ip-172-20-95-162 kubelet[1211]: I0205 11:30:26.574685    1211 kuberuntime_manager.go:767] Back-off 10s restarting failed container=weave pod=weave-net-pxmdp_kube-system(5bff3562-2939-1
1e9-9890-061daf14c9b6)
Feb 05 11:30:26 ip-172-20-95-162 kubelet[1211]: E0205 11:30:26.574746    1211 pod_workers.go:186] Error syncing pod 5bff3562-2939-11e9-9890-061daf14c9b6 ("weave-net-pxmdp_kube-system(5bff3562-2939-11e
9-9890-061daf14c9b6)"), skipping: failed to "StartContainer" for "weave" with CrashLoopBackOff: "Back-off 10s restarting failed container=weave pod=weave-net-pxmdp_kube-system(5bff3562-2939-11e9-9890-
061daf14c9b6)"

Memory and CPU profiles

mem-profile-150nodes.pdf
cpu-profile-150nodes.pdf

@murali-reddy murali-reddy self-assigned this Feb 5, 2019
@murali-reddy
Copy link
Contributor Author

Increasing memory requests and limits (from 200Mi in kops manifest) prevents pods from crashing. Interesting to note memory profile shows below 50MB consumption

   16.49MB 30.82% 30.82%    16.49MB 30.82%  github.com/weaveworks/weave/router.newSleeveCrypto
   10.64MB 19.88% 50.70%    10.64MB 19.88%  github.com/weaveworks/weave/vendor/github.com/google/gopacket.(*serializeBuffer).PrependBytes
    7.50MB 14.02% 64.72%     7.50MB 14.02%  fmt.Sprintf
    6.50MB 12.15% 76.87%     6.50MB 12.15%  github.com/weaveworks/weave/vendor/github.com/google/gopacket/layers.errorFunc
    4.52MB  8.45% 85.31%     4.52MB  8.45%  github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.makeConnsMap
    1.16MB  2.16% 87.48%     1.16MB  2.16%  github.com/weaveworks/weave/vendor/github.com/weaveworks/common/signals.(*Handler).Loop
       1MB  1.87% 89.34%    17.49MB 32.69%  github.com/weaveworks/weave/router.(*SleeveOverlay).PrepareConnection
       1MB  1.87% 91.21%        1MB  1.87%  encoding/gob.decString
    0.66MB  1.23% 92.44%     0.66MB  1.23%  github.com/weaveworks/weave/vendor/github.com/modern-go/reflect2.loadGo17Types
    0.53MB  0.99% 93.44%     0.53MB  0.99%  github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp.OpenNetlinkSocket

@bboreham
Copy link
Contributor

bboreham commented Feb 5, 2019

shows below 50MB consumption

It can be hard to catch the heap at its largest. If you set GODEBUG=gctrace=1 in the environment you should get extra lines in the log which show the heap size each time a full GC ran.

@murali-reddy
Copy link
Contributor Author

Thanks @bboreham for the suggestion. With GODEBUG=gctrace=1 I captured below GC activity when cluster is of 150 nodes size.

gc 208 @532.209s 0%: 0.014+43+0.069 ms clock, 0.029+1.8/23/15+0.13 ms cpu, 103->104->52 MB, 106 MB goal, 2 P
gc 209 @532.968s 0%: 0.008+36+0.068 ms clock, 0.017+2.6/16/17+0.13 ms cpu, 102->102->51 MB, 104 MB goal, 2 P
gc 210 @533.712s 0%: 0.033+31+0.074 ms clock, 0.067+27/17/0+0.14 ms cpu, 101->103->54 MB, 103 MB goal, 2 P
gc 211 @534.084s 0%: 0.009+35+0.075 ms clock, 0.019+2.1/14/25+0.15 ms cpu, 104->104->52 MB, 108 MB goal, 2 P
gc 212 @534.744s 0%: 0.020+30+0.075 ms clock, 0.041+2.9/16/21+0.15 ms cpu, 101->102->52 MB, 104 MB goal, 2 P
gc 213 @534.955s 0%: 0.022+55+0.15 ms clock, 0.044+7.8/29/6.5+0.30 ms cpu, 102->102->52 MB, 104 MB goal, 2 P
gc 214 @535.646s 0%: 0.016+41+0.092 ms clock, 0.032+11/21/10+0.18 ms cpu, 103->104->53 MB, 105 MB goal, 2 P
gc 215 @536.065s 0%: 0.039+35+0.078 ms clock, 0.078+2.1/18/18+0.15 ms cpu, 104->105->52 MB, 107 MB goal, 2 P
gc 216 @536.967s 0%: 0.012+33+0.079 ms clock, 0.024+2.4/17/19+0.15 ms cpu, 103->103->52 MB, 105 MB goal, 2 P
gc 217 @537.788s 0%: 0.029+36+0.052 ms clock, 0.058+3.0/21/16+0.10 ms cpu, 103->103->52 MB, 105 MB goal, 2 P
gc 52 @531.320s 0%: 0.007+3.7+0.018 ms clock, 0.014+0.44/0.46/4.7+0.037 ms cpu, 6->6->3 MB, 7 MB goal, 2 P
gc 218 @538.540s 0%: 0.010+28+0.080 ms clock, 0.020+8.9/13/19+0.16 ms cpu, 103->104->54 MB, 105 MB goal, 2 P
scvg0: inuse: 68, idle: 68, sys: 137, released: 0, consumed: 137 (MB)
scvg0: inuse: 4, idle: 1, sys: 6, released: 0, consumed: 6 (MB)
scvg1: inuse: 76, idle: 59, sys: 136, released: 0, consumed: 136 (MB)
scvg1: inuse: 5, idle: 1, sys: 6, released: 0, consumed: 6 (MB)
scvg2: 4 MB released
scvg2: inuse: 79, idle: 56, sys: 135, released: 4, consumed: 130 (MB)
scvg2: inuse: 7, idle: 0, sys: 8, released: 0, consumed: 8 (MB)

@bboreham
Copy link
Contributor

bboreham commented Feb 6, 2019

So that says 50MB was the real minimum and GC allowed the heap to grow to about 100MB before collecting.
I think you said offline the container limit was 400MB, so we have a mystery why it should OOM at 100.

@murali-reddy
Copy link
Contributor Author

Should have mentioned it. This sample was taken when I left kops weave manfiest defaults unmodified

          resources:
            requests:
              cpu: 50m
              memory: 200Mi
            limits:
              memory: 200Mi

@bboreham
Copy link
Contributor

bboreham commented Feb 6, 2019

OK, it's plausible that overheads and fragmentation pushed it briefly beyond 200Mi.
We can document that 200Mi is too low for a 150-node cluster, maybe file an issue on the Kops repo to add to their docs (or parameterise that number).

High memory usage in newSleveCrypto() seems to derive from buffers like this - the code is optimised to avoid allocations, whereas it would be cheaper in memory if we only allocated what we needed. Maybe we can deallocate that structure when we have set up a fastdp flow?

gopacket.PrependBytes seems likely to be involved in actually writing out packets; again we should not need it when fastdp is in operation.

@itskingori
Copy link

Increasing memory requests and limits (from 200Mi in kops manifest) prevents pods from crashing. Interesting to note memory profile shows below 50MB consumption

Based on the memory profile of the weave container in our cluster (see #3659 (comment)), we've opted to increase the memory limit to 300MB. We're using Kops by the way, so the default is 200MB. To note, we never got to 100 nodes but we grew quickly from around 30 to 80 at which point lots of weave pods got OOM'd.

@murali-reddy
Copy link
Contributor Author

@itskingori please share pprof heap output for any nodes where you observed 200MB was not sufficient enough for a cluster < 100 nodes. I did launch clusters with 100 nodes with kops numerous times for scaling tests, never faced any issue with defaults.

Note that this issue is reported for cluster sizes > 150, beyond 150 nodes there is increase in memory requirement and 200M is too low resulting in OOM

@itskingori
Copy link

... please share pprof heap output for any nodes where you observed 200MB was not sufficient enough for a cluster < 100 nodes. I did launch clusters with 100 nodes with kops numerous times for scaling tests, never faced any issue with defaults.

Unfortunately we haven't scaled again to 80 ... 👇

Screenshot 2019-07-15 at 16 42 56

I ran this on a pod at 144MB right now and got weave-net-7mnzr.mem.zip ...

kubectl exec -n=kube-system weave-net-7mnzr -n kube-system -- curl http://127.0.0.1:6784/debug/pprof/heap > weave-net-7mnzr.mem

@itskingori
Copy link

Note that this issue is reported for cluster sizes > 150, beyond 150 nodes there is increase in memory requirement and 200M is too low resulting in OOM

Noted. Would node size cause some variance i.e. maybe because run many pods per node?

@itskingori
Copy link

Note that this issue is reported for cluster sizes > 150, beyond 150 nodes there is increase in memory requirement and 200M is too low resulting in OOM

Noted. Would node size cause some variance i.e. maybe because run many pods per node?

@murali-reddy Just nudging you on this. I'm wondering if node-type and number of pods on the node would introduce some variance to your tests? We're running 30GB/60GB nodes and we have quite a number of pods running on them.

I woke up to our production cluster at 64 nodes and you can see the memory edging towards 200MB (which is why we increased to 300MB). It didn't get to 80 which 💣-ed us last time (see #3659 (comment)). I think at 80 we’d be at if not more than 200MB.

Screenshot 2019-07-17 at 10 18 10

@bboreham
Copy link
Contributor

There's no particular reason for it to be sensitive to node size.
If you could grab another heap profile this might help to advance our knowledge.
The previous profile you attached in this issue was using 29MB of Go heap (so the 144MB figure quoted is unexplained - could be that the memory use had shrunk before you took the profile).

Also please state the version of Weave Net in use which is essential to interpret the profile - even better please open a new issue which will prompt you to enter such information.

@itskingori
Copy link

itskingori commented Jul 17, 2019

@bboreham Noted. That said, I've spotted a rogue weave pod (see weave-net-7mnzr below). You can see it's hit 200MB with just 38 nodes. Got the heap: weave-net-7mnzr.mem.zip.

cc: @murali-reddy

Screenshot 2019-07-17 at 16 16 08

Screenshot 2019-07-17 at 16 19 58

The version:

/home/weave # ./weave --local version
weave 2.5.1

@itskingori
Copy link

@bboreham This might be interesting ... looking at the logs you can see the same pod has a higher occurrence of these errors ...

Screenshot 2019-07-17 at 16 46 00

Something definitely is wrong with weave-net-7mnzr.

@zacblazic
Copy link

zacblazic commented Jul 17, 2019

Colleague of @itskingori here. 👋Thought I'd share this set of logs too as it's related to many of the other issues that have been popping up:

image

Adding the latest heap again: weave-net-7mnzr.mem.zip

@itskingori
Copy link

itskingori commented Jul 17, 2019

@bboreham We've tried to get as much information about the state as we can so that you have as much information as you can to do on ...

Here the latest memory dump just in case: weave-net-7mnzr.mem.zip. We're up to 216MB ...

Screenshot 2019-07-17 at 17 25 03

@murali-reddy
Copy link
Contributor Author

thanks @itskingori and @zacblazic for the data

I am suspecting there is mismatch between external view of memory consumption of the process RSS/resident size and what go profiler view of heap. Even with the new pprof heap output its only consuming 38MB.

Type: inuse_space
Time: Jul 17, 2019 at 7:53pm (IST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 27.11MB, 70.53% of 38.43MB total

I am no expert on GO GC, but setting GOGC to lesser value from default might help. Need to check if that helps.

Also going by the memory usage growth of weave-net-7mnzr even increasing the memory limit would eventually hit the limit. Since the connection retry and shutdown resulting from IPAM conflicts is not recovarable perhaps there is no point retrying.

@itskingori
Copy link

Also going by the memory usage growth of weave-net-7mnzr even increasing the memory limit would eventually hit the limit.

Yes indeed. But it gives us time to spot it and remedy. I discovered this at 201MB growing slowly, so with the old limit it would have gotten OOM'd already.

It also seems that in growing memory is a symptom and not the cause in that when a weave pod is in this state, it's memory keeps growing and it's cpu also increases just a little bit.

Also, to note, I should have pointed that the peer-list seems to have a missing node i.e. it has 41 entries and the rest have 42. Don't know if that means anything but I suspect the counts should be the same.

@bboreham
Copy link
Contributor

bboreham commented Jul 7, 2020

There were many inefficiencies fixed in 2.6, and also a longstanding leak fixed in 2.6.5, so I'll close this.

@bboreham bboreham closed this as completed Jul 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants