-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods sometimes have no network connectivity for up to two minutes on startup #9706
Comments
The WorkloadEndpoint resource is backed by the Pod resource (so that's why it already exists). For Felix to see a WEP, the pod must have either the PodIP(s) field filled in or a Calico annotation for the pod IPs and felix must get the event. Yes, both CNI plugin and felix will try to set up the routes. But if felix doesn't hear about the WEP soon enough it will remove the route again. So, problem could be that the PodIPs field and annotation are missing, or, it could be that the event is not making it through from CNI plugin to API server to Typha to Felix in a timely fashion. Kubelet can be slow to write back the podIPs, which is why we also have the annotation. The CNI plugin patches the Pod resource to add the annotation in order to get the IP to felix ASAP. Worth checking if something is clobbering/rejecting that annotation. 50 nodes is a small cluster relative to the number of typha instances we'll run; you'll have 3 typha nodes for HA and each typha instance can handle hundreds of nodes. The problem could be with:
In v3.28 we added a feature that allows the CNI plugin to wait for Felix to receive the WEP and program the policy; that won't speed it up but it makes sure the binary inside the pod doesn't start before the networking is ready. |
@bradbehle any progress? |
We did look into the items mentioned by @fasaxc:
We did discover that there were more pods than we initially thought being created during the times where we saw this problem, so we scheduled some of those for a different time, and that mitigated the problem. We also noticed that the cluster master pods (apiserver/scheduler/controller-manager) had resource spikes during this time, so our best guess is that the apiserver might have just been slow to notify typha of the new pods (which was mentioned in this issue before, that "it could be that the event is not making it through from CNI plugin to API server to Typha to Felix in a timely fashion.") We are looking at using that 3.28 feature that was mentioned to ensure that pods don't start until networking for the pod is fully functional |
Looks like it happened again a few days ago. Same thing, during the time in question one of the three etcd pods and two of the three master pods (running the kube apiserver, controller manager, and scheduler) had a spike in memory usage during the 30 minute time when there was a significant delay between the pod starting up and it getting networking. So likely caused by an increase in the amount of master activity (pods being scheduled and/or apiserver calls being made), which I suppose could result in the apiserver being slow to notify calico-typha. And I suppose it is possible as mentioned before that if it was many pods being scheduled/run at once on the same nodes that typha is running on, that calico-typha could be resource starved during this time, we can look into that as well. |
To further drill down on the Calico side, you could enable prometheus metrics for Typha and Felix. Typha tracks internal latency and ping/pong latency to its clients. Might also be worth correlating the problem with typha logs. If you see Typha saying that it had to do a full list of workload endpoints around that time then that could be a thread to pull on. That means that it's lost its watch and needs to list all pods, which is expensive. |
Expected Behavior
Pods should always have network connectivity either immediately or very soon after they start up
Current Behavior
Pods sometimes have no network connectivity for up to two minutes on startup
Possible Solution
Steps to Reproduce (for bugs)
Context
This is causing our workload that runs in these pods to fail intermittently and is causing significant problems getting work accomplished that depend on these pods. We are trying a few workarounds, but would like to get to the root cause of this soon so we can address the real issue, whatever it is
Your Environment
Short Summary:
At times in our 50 node cluster, pods will start up without outbound network connectivity for anywhere from 1 second to 2 minutes. The pod will have a pod IP, it will be in Running and Ready state, but all outbound connection attempts will fail even though there is no network policies applied to the pod. After a certain amount of time the pod will then be able to connect outbound. I think I have tracked this down to a delay in calico-node being notified (or noticing?) that the pod exists and therefore not adding the necessary iptables rules in the filter table so the outbound traffic does not get dropped. If this is actually the case, I would like to understand why this delay is occurring, what we can do about it, and whether Calico has any SLA or guideline as to whether a pod not having network connectivity when it first starts up is "normal", or something that could be fixed.
More Details:
If I'm understanding correctly, the following steps occur when a pod is first created. Please correct anything I have wrong here, or add things in that are relevant that I missed.
Kubelet calls the container runtime (our cluster is using Ubuntu 20 nodes and containerd), and the runtime calls the CNI plugin (Calico in this case) to create the pod networking.
Calico CNI does the following:
Calico CNI found existing endpoint: &{{WorkloadEndpoint projectcalico.org/v3} {10.208.108.172-k8s-cni--test--1-eth0 default fa565576-4257-4ca4-b3b6-89f83ccbd7fb 274584 0 2025-01-12 04:03:27 +0000 UTC <nil> <nil> map[projectcalico.org/namespace:default projectcalico.org/orchestrator:k8s projectcalico.org/serviceaccount:default] map[] [] [] []} {k8s 10.208.108.172 cni-test-1 eth0 default [] [] [kns.default ksa.default.default] calide70d6647b3 [] []}} ContainerID="b1fb3be2b384588c55d586f4c49cd581166692bb1ddc8fff12cdc11ad2104fdc" Namespace="default" Pod="cni-test-1" WorkloadEndpoint="10.208.108.172-k8s-cni--test--1-"
but I don't really understand how it "found" this existing WorkloadEndpoint? Did something create it before the CNI was called? I don't think it matters for this issue, I'm just curious about it.Jan 12 04:03:28 2025-01-12 04:03:28.630 [INFO][87213] ipam_plugin.go 286: Calico CNI IPAM assigned addresses IPv4=[172.30.215.213/26] IPv6=[] ContainerID="b1fb3be2b384588c55d586f4c49cd581166692bb1ddc8fff12cdc11ad2104fdc" HandleID="k8s-pod-network.b1fb3be2b384588c55d586f4c49cd581166692bb1ddc8fff12cdc11ad2104fdc" Workload="10.208.108.172-k8s-cni--test--1-eth0"
caliXXXXXXXXXXX
(I'm pretty sure it is the CNI that creates that route and network namespace, but not 100% certain)calico-node pod does the following:
caliXXXXXXXXXXX
interface is created, as well as when it transitions to "up" state. This is based on the log line in calico-node:[INFO][68] felix/endpoint_mgr.go 374: Workload interface came up, marking for reconfiguration. ifaceName="caliXXXXXXXXXXX"
. It also saysApplying /proc/sys configuration to interface. ifaceName="
, but I'm not sure what that meanscaliXXXXXXXXXXX
exists (yet), it will track it, and if it isn't notified about this pod within 10 seconds (or so), it will delete the interface, or at least I think it will based on this line:[INFO][68] felix/route_table.go 902: Remove old route dest=172.30.44.193/32 ifaceName="caliXXXXXXXXXXX" ifaceRegex="^cali.*" ipVersion=0x4 routeProblems=[]string{"unexpected route"} tableIndex=254
cali-tw-caliXXXXXXXXXXX
andcali-fw-caliXXXXXXXXXXX
chains. And if there are network policies that apply to this pod then it also adds iptables rules to these changes to enforce the network policies. I think it also adds thecaliXXXXXXXXXXX
interface and route to it, if those aren't already there (especially if it deleted them in step 2 above). This is based on the calico-node log line:[INFO][68] felix/int_dataplane.go 1956: Received *proto.WorkloadEndpointUpdate update from calculation graph ...
From what I'm seeing in the /var/log/containerd.log and calico-node pod logs, when we are seeing the pod come up without networking it is because calico-node is sometimes not getting the WorkloadEndpointUpdate notification until a minute or more after the pod is created (or at least I'm not seeing the
Received *proto.WorkloadEndpointUpdate update from calculation graph
log message for the pod until a minute or more after the pod is created and the networking is set up by the Calico CNI).I did a test where I shut down calico-typha pods and then created a pod (so the Calico CNI would still do its thing and create the network namespace, veth pair, route, and pod IP), but calico-node would not be notified. In this case the pod came up in Running and Ready state, but as expected did not have outbound network connectivity. I checked the iptables rules in the filter table and it looks like the traffic was being dropped by this chain (the pod's interface started with
calid...
so if I understand correctly the traffic is evaluated by thiscali-from-wl-dispatch-d
chain, and since there is nocali-fw-...
chain that exactly matched thecali...
interface name, the traffic was dropped by this "Unknown interface" rule)I think this makes sense, since calico-node wasn't notified about the pod, it did not set up the needed chains to allow the traffic to/from this pod.
We have some monitoring on this cluster, and at the times when this problem is the worst is somewhat correlated with starting a greater than normal number of pods, and sometimes there is a small increase in CPU and/or memory usage on the cluster nodes, however it is nothing we would consider out of the ordinary. The nodes do not get anywhere near their max of CPU/Memory from what we can tell, and we don't expect a small increase in the number of pods created per minute to cause these sometimes large delays (up to a minute or two) in pod networking being set up. Our control plane monitoring does show an increase in Memory usage of one of the three apiserver and one of the three etcd pods during the time when this problem occurs, but I wouldn't necessarily expect that to cause a one to two minute long delay in notifications about new pods or about CRD updates.
We checked the calico-typha pod logs during the time these delays occurred but do not see anything out of the ordinary. Although there is not much logging in general in those calico-typha pods so it's hard to tell much anyway.
We are looking for advice about where to go from here to resolve this. If this is a known issue with a solution (such as tuning calico-typha parameters), please let us know. Otherwise let us know what we should do to troubleshoot this further (increased logging levels, additional components to check, specific performance metrics to check, ...)
The text was updated successfully, but these errors were encountered: