Pods sometimes have no network connectivity for up to two minutes on startup #9706

bradbehle · 2025-01-13T16:30:47Z

Expected Behavior

Pods should always have network connectivity either immediately or very soon after they start up

Current Behavior

Pods sometimes have no network connectivity for up to two minutes on startup

Possible Solution

Steps to Reproduce (for bugs)

Run a cluster like ours
Start a bunch of pods, and see that sometimes pods come up without networking for up to two minutes (see below for much more details)

Context

This is causing our workload that runs in these pods to fail intermittently and is causing significant problems getting work accomplished that depend on these pods. We are trying a few workarounds, but would like to get to the root cause of this soon so we can address the real issue, whatever it is

Your Environment

Calico version: 3.27.4 via Tigera Operator 1.32.10
Calico dataplane (iptables, windows etc.): iptables
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes 1.29
Operating System and version: Ubuntu 20
Link to your project (optional): https://cloud.ibm.com/containers/overview

Short Summary:
At times in our 50 node cluster, pods will start up without outbound network connectivity for anywhere from 1 second to 2 minutes. The pod will have a pod IP, it will be in Running and Ready state, but all outbound connection attempts will fail even though there is no network policies applied to the pod. After a certain amount of time the pod will then be able to connect outbound. I think I have tracked this down to a delay in calico-node being notified (or noticing?) that the pod exists and therefore not adding the necessary iptables rules in the filter table so the outbound traffic does not get dropped. If this is actually the case, I would like to understand why this delay is occurring, what we can do about it, and whether Calico has any SLA or guideline as to whether a pod not having network connectivity when it first starts up is "normal", or something that could be fixed.

More Details:
If I'm understanding correctly, the following steps occur when a pod is first created. Please correct anything I have wrong here, or add things in that are relevant that I missed.

Kubelet calls the container runtime (our cluster is using Ubuntu 20 nodes and containerd), and the runtime calls the CNI plugin (Calico in this case) to create the pod networking.

Calico CNI does the following:

Creates (or finds?) the initial WorkloadEndpoint. I see this log message at the beginning of the CNI log lines in containerd.log: Calico CNI found existing endpoint: &{{WorkloadEndpoint projectcalico.org/v3} {10.208.108.172-k8s-cni--test--1-eth0 default fa565576-4257-4ca4-b3b6-89f83ccbd7fb 274584 0 2025-01-12 04:03:27 +0000 UTC <nil> <nil> map[projectcalico.org/namespace:default projectcalico.org/orchestrator:k8s projectcalico.org/serviceaccount:default] map[] [] [] []} {k8s 10.208.108.172 cni-test-1 eth0 default [] [] [kns.default ksa.default.default] calide70d6647b3 [] []}} ContainerID="b1fb3be2b384588c55d586f4c49cd581166692bb1ddc8fff12cdc11ad2104fdc" Namespace="default" Pod="cni-test-1" WorkloadEndpoint="10.208.108.172-k8s-cni--test--1-" but I don't really understand how it "found" this existing WorkloadEndpoint? Did something create it before the CNI was called? I don't think it matters for this issue, I'm just curious about it.
Calico IPAM plugin is called which picks an unused pod IP (ideally from an already existing /26 subnet that has been assigned to this specific worker) for this pod to use. Here's the line I found that shows that this process completed successfully: Jan 12 04:03:28 2025-01-12 04:03:28.630 [INFO][87213] ipam_plugin.go 286: Calico CNI IPAM assigned addresses IPv4=[172.30.215.213/26] IPv6=[] ContainerID="b1fb3be2b384588c55d586f4c49cd581166692bb1ddc8fff12cdc11ad2104fdc" HandleID="k8s-pod-network.b1fb3be2b384588c55d586f4c49cd581166692bb1ddc8fff12cdc11ad2104fdc" Workload="10.208.108.172-k8s-cni--test--1-eth0"
Calico CNI plugin creates the pod's veth pair for the pod as well as the network namespace for the pod and the route that routes inbound traffic addressed to the pod IP to the host's veth interface caliXXXXXXXXXXX (I'm pretty sure it is the CNI that creates that route and network namespace, but not 100% certain)

calico-node pod does the following:

Notices (I think immediately) when the caliXXXXXXXXXXX interface is created, as well as when it transitions to "up" state. This is based on the log line in calico-node: [INFO][68] felix/endpoint_mgr.go 374: Workload interface came up, marking for reconfiguration. ifaceName="caliXXXXXXXXXXX". It also says Applying /proc/sys configuration to interface. ifaceName=", but I'm not sure what that means
If calico-node hasn't been notified that the pod associated with this caliXXXXXXXXXXX exists (yet), it will track it, and if it isn't notified about this pod within 10 seconds (or so), it will delete the interface, or at least I think it will based on this line: [INFO][68] felix/route_table.go 902: Remove old route dest=172.30.44.193/32 ifaceName="caliXXXXXXXXXXX" ifaceRegex="^cali.*" ipVersion=0x4 routeProblems=[]string{"unexpected route"} tableIndex=254
When calico-node is notified of a new pod via a WorkloadEndpointUpdate (I assume this comes from typha?), it makes iptables updates that allow traffic to/from the pod, including creating the cali-tw-caliXXXXXXXXXXX and cali-fw-caliXXXXXXXXXXX chains. And if there are network policies that apply to this pod then it also adds iptables rules to these changes to enforce the network policies. I think it also adds the caliXXXXXXXXXXX interface and route to it, if those aren't already there (especially if it deleted them in step 2 above). This is based on the calico-node log line: [INFO][68] felix/int_dataplane.go 1956: Received *proto.WorkloadEndpointUpdate update from calculation graph ...

From what I'm seeing in the /var/log/containerd.log and calico-node pod logs, when we are seeing the pod come up without networking it is because calico-node is sometimes not getting the WorkloadEndpointUpdate notification until a minute or more after the pod is created (or at least I'm not seeing the Received *proto.WorkloadEndpointUpdate update from calculation graph log message for the pod until a minute or more after the pod is created and the networking is set up by the Calico CNI).

I did a test where I shut down calico-typha pods and then created a pod (so the Calico CNI would still do its thing and create the network namespace, veth pair, route, and pod IP), but calico-node would not be notified. In this case the pod came up in Running and Ready state, but as expected did not have outbound network connectivity. I checked the iptables rules in the filter table and it looks like the traffic was being dropped by this chain (the pod's interface started with calid... so if I understand correctly the traffic is evaluated by this cali-from-wl-dispatch-d chain, and since there is no cali-fw-... chain that exactly matched the cali... interface name, the traffic was dropped by this "Unknown interface" rule)

Chain cali-from-wl-dispatch-d (1 references)
 pkts bytes target     prot opt in     out     source               destination         
46247   17M cali-fw-calid5033036511  all  --  calid5033036511 *       0.0.0.0/0            0.0.0.0/0           [goto]  /* cali:LsJAtEFxvKzRFwEG */
 1376  101K cali-fw-calidb9a215e62e  all  --  calidb9a215e62e *       0.0.0.0/0            0.0.0.0/0           [goto]  /* cali:4abd4kLmqpH1JyUu */
  470 37522 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:uyo2QT_oOB8DQkfl */ /* Unknown interface */

I think this makes sense, since calico-node wasn't notified about the pod, it did not set up the needed chains to allow the traffic to/from this pod.

We have some monitoring on this cluster, and at the times when this problem is the worst is somewhat correlated with starting a greater than normal number of pods, and sometimes there is a small increase in CPU and/or memory usage on the cluster nodes, however it is nothing we would consider out of the ordinary. The nodes do not get anywhere near their max of CPU/Memory from what we can tell, and we don't expect a small increase in the number of pods created per minute to cause these sometimes large delays (up to a minute or two) in pod networking being set up. Our control plane monitoring does show an increase in Memory usage of one of the three apiserver and one of the three etcd pods during the time when this problem occurs, but I wouldn't necessarily expect that to cause a one to two minute long delay in notifications about new pods or about CRD updates.

We checked the calico-typha pod logs during the time these delays occurred but do not see anything out of the ordinary. Although there is not much logging in general in those calico-typha pods so it's hard to tell much anyway.

We are looking for advice about where to go from here to resolve this. If this is a known issue with a solution (such as tuning calico-typha parameters), please let us know. Otherwise let us know what we should do to troubleshoot this further (increased logging levels, additional components to check, specific performance metrics to check, ...)

The text was updated successfully, but these errors were encountered:

fasaxc · 2025-01-13T17:28:04Z

The WorkloadEndpoint resource is backed by the Pod resource (so that's why it already exists). For Felix to see a WEP, the pod must have either the PodIP(s) field filled in or a Calico annotation for the pod IPs and felix must get the event.

Yes, both CNI plugin and felix will try to set up the routes. But if felix doesn't hear about the WEP soon enough it will remove the route again.

So, problem could be that the PodIPs field and annotation are missing, or, it could be that the event is not making it through from CNI plugin to API server to Typha to Felix in a timely fashion.

Kubelet can be slow to write back the podIPs, which is why we also have the annotation. The CNI plugin patches the Pod resource to add the annotation in order to get the IP to felix ASAP. Worth checking if something is clobbering/rejecting that annotation.

50 nodes is a small cluster relative to the number of typha instances we'll run; you'll have 3 typha nodes for HA and each typha instance can handle hundreds of nodes. The problem could be with:

Typha-to-API server connectivity (ISTR that you run API server connections through a tunnel; could that have dropped/delayed the connection in a way that the kube client wouldn't immediately spot?)
Typha-to-felix connectivity (but there's a L7 heartbeat on that protocol and typha would log if it timed out)
Inadequate CPU reservations in Typha/calico-node (in particular, if calico has no CPU reservations and, if your workload is spiky and it has CPU reservations that it can completely starve out calico).

In v3.28 we added a feature that allows the CNI plugin to wait for Felix to receive the WEP and program the policy; that won't speed it up but it makes sure the binary inside the pod doesn't start before the networking is ready.

tomastigera · 2025-01-28T17:33:09Z

@bradbehle any progress?

bradbehle · 2025-01-28T18:20:51Z

We did look into the items mentioned by @fasaxc:

the worker to apiserver connection is not through a tunnel (that is just the apiserver to worker connections for things like kubectl logs, kubectl exec, webhook calls, ...)
Typha to felix connectivity seemed fine, no errors in the logs about timeouts or reconnections
The nodes did not seem to be starved for CPU (or memory) at the times this occurred.

We did discover that there were more pods than we initially thought being created during the times where we saw this problem, so we scheduled some of those for a different time, and that mitigated the problem. We also noticed that the cluster master pods (apiserver/scheduler/controller-manager) had resource spikes during this time, so our best guess is that the apiserver might have just been slow to notify typha of the new pods (which was mentioned in this issue before, that "it could be that the event is not making it through from CNI plugin to API server to Typha to Felix in a timely fashion.")

We are looking at using that 3.28 feature that was mentioned to ensure that pods don't start until networking for the pod is fully functional

bradbehle · 2025-01-28T19:57:56Z

Looks like it happened again a few days ago. Same thing, during the time in question one of the three etcd pods and two of the three master pods (running the kube apiserver, controller manager, and scheduler) had a spike in memory usage during the 30 minute time when there was a significant delay between the pod starting up and it getting networking. So likely caused by an increase in the amount of master activity (pods being scheduled and/or apiserver calls being made), which I suppose could result in the apiserver being slow to notify calico-typha. And I suppose it is possible as mentioned before that if it was many pods being scheduled/run at once on the same nodes that typha is running on, that calico-typha could be resource starved during this time, we can look into that as well.

fasaxc · 2025-01-29T11:09:02Z

To further drill down on the Calico side, you could enable prometheus metrics for Typha and Felix. Typha tracks internal latency and ping/pong latency to its clients.

Might also be worth correlating the problem with typha logs. If you see Typha saying that it had to do a full list of workload endpoints around that time then that could be a thread to pull on. That means that it's lost its watch and needs to list all pods, which is expensive.

tomastigera added the kind/support label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods sometimes have no network connectivity for up to two minutes on startup #9706

Pods sometimes have no network connectivity for up to two minutes on startup #9706

bradbehle commented Jan 13, 2025

fasaxc commented Jan 13, 2025

tomastigera commented Jan 28, 2025

bradbehle commented Jan 28, 2025

bradbehle commented Jan 28, 2025

fasaxc commented Jan 29, 2025

Pods sometimes have no network connectivity for up to two minutes on startup #9706

Pods sometimes have no network connectivity for up to two minutes on startup #9706

Comments

bradbehle commented Jan 13, 2025

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

fasaxc commented Jan 13, 2025

tomastigera commented Jan 28, 2025

bradbehle commented Jan 28, 2025

bradbehle commented Jan 28, 2025

fasaxc commented Jan 29, 2025