You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jun 20, 2024. It is now read-only.
Weave overlay network inside a 160+ nodes kubernetes cluster to remain stable and not suddenly start dropping packets even if 30-40 kubernetes nodes, and hence weave pods running on them are under resource constraints. Starved weave pods on some nodes shouldn't impact the complete cluster.
What happened?
On a 167 nodes cluster, weave suddenly gave in. On most of the nodes, we observed that there were >160 pending peer connections. There was a constant high number of failed connections reported by pods. The number of terminated connections shot up. CPU spiked 5-10x of usual usage within a minute and memory increased by ~10-20%. This caused AWS ELBs to mark all the nodes as unhealthy thereby blocking all ingress traffic to the cluster. On further debugging, we found that the health check TCP SYN packets from ELBs coming on underlay network were unable to get routed on overlay network due to high unanswered ARP requests.
Weave metric graphs during this exact event:
There was no sign of high resource usage for the weave pods just before the incident. The resource graphs for 2 randomly selected weave pods when the incident happened:
Note: At the time of incident, weave had 520Mi of Memory limit, 50m of CPU request and no CPU limit.
What we did to resolve it and our conclusions:
Updated weave to 2.5.1. 2) Increased CONN_LIMIT to 600. 3) Increased weave daemonset's CPU and memory requests. This triggered the daemonset rolling update. It seemed to help. At that point because of increased limit, weave could not come back up on 12 kubernetes nodes which had insufficient resources. All the rest of the weave pods seemed to have calmed down but they still had slightly high error rate. When we got the 12 pods back up, weave became happy and settled down.
We do not think weave upgrade or increasing CONN_LIMIT helped us.
How to reproduce it?
Since we didn't see any particular metric increasing leading to this incident, we are not sure how this happened or how it can be reproduced. Since increasing resources allocated to weave daemonset helped with the situation, it might get triggered if there is a 160+ nodes size cluster and weave pods have resource constraints on 1/4th of the nodes.
As part of immediate recovery steps in the original incident, we intentionally didn't let weave schedule on 12 nodes.
Later (after 2-3 days), to restore the weave completely, we kicked out other applications and let 12 weave pods join the mesh. Going from 155-167 weave pods triggered errors in application. Applications started reporting errors in connectivity with services within the cluster.
To confirm it is not a coincidence, we bounced the 12 weave pods trying to recreate that blip. This impacted the whole mesh overlay network and the original issue got reproduced. This was happening despite weave having good amount of cpu/mem resources allocated to it. At this point, increasing resources didn't help. To recover:
we brought down the cluster size back to 155. This didn't solve the problem completely. Many nodes still had connectivity issues.
we had to clear out weave dB on around 13-14 unreachable/ misbehaving weave pods and restart them. A simple restart didn't work.
These two steps helped us restore the health of the cluster.
The steady memory growth part of this issue matches #3807, which is fixed now.
2.6.0 had several fixes to make it much better at handling nodes joining/leaving repeatedly.
What you expected to happen?
Weave overlay network inside a 160+ nodes kubernetes cluster to remain stable and not suddenly start dropping packets even if 30-40 kubernetes nodes, and hence weave pods running on them are under resource constraints. Starved weave pods on some nodes shouldn't impact the complete cluster.
What happened?
On a 167 nodes cluster, weave suddenly gave in. On most of the nodes, we observed that there were >160 pending peer connections. There was a constant high number of failed connections reported by pods. The number of terminated connections shot up. CPU spiked 5-10x of usual usage within a minute and memory increased by ~10-20%. This caused AWS ELBs to mark all the nodes as unhealthy thereby blocking all ingress traffic to the cluster. On further debugging, we found that the health check TCP SYN packets from ELBs coming on underlay network were unable to get routed on overlay network due to high unanswered ARP requests.

Weave metric graphs during this exact event:
There was no sign of high resource usage for the weave pods just before the incident. The resource graphs for 2 randomly selected weave pods when the incident happened:


Note: At the time of incident, weave had 520Mi of Memory limit, 50m of CPU request and no CPU limit.
What we did to resolve it and our conclusions:
We do not think weave upgrade or increasing CONN_LIMIT helped us.
How to reproduce it?
Since we didn't see any particular metric increasing leading to this incident, we are not sure how this happened or how it can be reproduced. Since increasing resources allocated to weave daemonset helped with the situation, it might get triggered if there is a 160+ nodes size cluster and weave pods have resource constraints on 1/4th of the nodes.
Anything else we need to know?
Versions:
The text was updated successfully, but these errors were encountered: