Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU load due to XDP program in iptables #8856

Closed
dzacball opened this issue May 23, 2024 · 31 comments
Closed

High CPU load due to XDP program in iptables #8856

dzacball opened this issue May 23, 2024 · 31 comments

Comments

@dzacball
Copy link

dzacball commented May 23, 2024

Expected Behavior

Calico using a reasonable amount of CPU. No XDP/BPF related error logs in calico-node.

Current Behavior

Calico using about 10x times it's usual CPU. In calico-node logs, I can see messages like

[INFO][60] felix/int_dataplane.go 2154: Applying XDP actions did not succeed, disabling XDP error=failed to resync: failed to load BPF program (/usr/lib/calico/bpf/filter.o): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error loading BTF: Invalid argument(22)

Possible Solution

If I rebuild calico-node using an up-to-date version of bpftool (tried with v7.4 and worked), the issue is gone.

Steps to Reproduce (for bugs)

Not sure about how to reproduce. It happens in our env.. and it was introduced by updates to map definitions in #8610.

Context

We can not use affected versions of Calico due to the high CPU load (and of course we aren't sure what other issues could arise)

Your Environment

Calico v3.27.3 and/or v3.28.0.
K8s 1.28 + Ubuntu 20.04.6 workers with 5.4.0-177-generic kernel.

More details

I could confirm that this issue was introduced by this PR: #8610. It is present in Calico v3.27.3 and v3.28.0. If I revert the PR, and build my own calico-node image on top of v3.28.0, the issue is gone. Also, if I rebuild calico-node with an up-to-date version of bpftool (tried with v7.4 and worked), the issue is gone.

Related slack thread: https://calicousers.slack.com/archives/CPTH1KS00/p1713552425270619

@dzacball
Copy link
Author

Just to make sure it's not specific to a combination of HEPs/GNPs/NetworkSets, I created a cluster with v3.27.3 and deleted all related resources. After that, I restarted calico-node, and the issue is still present:

$ k get networksets.crd.projectcalico.org -A
No resources found

$ k get hostendpoints.crd.projectcalico.org
No resources found

$ k get globalnetworkpolicies.crd.projectcalico.org
No resources found

$ k get networkpolicies.crd.projectcalico.org -A
No resources found

$ k get networkpolicies.networking.k8s.io -A
No resources found

$ k rollout restart daemonset -n kube-system calico-node
daemonset.apps/calico-node restarted

$ k get po -n kube-system | grep calico
calico-kube-controllers-5b784b4785-7898t             1/1     Running   0          91m
calico-node-gk7ft                                    1/1     Running   0          78s
calico-node-ngz97                                    1/1     Running   0          2m59s
calico-typha-5667cc7f45-d7lds                        0/1     Pending   0          91m
calico-typha-5667cc7f45-kp9c4                        1/1     Running   0          91m
calico-typha-5667cc7f45-l5pxg                        1/1     Running   0          91m

$ k top po -n kube-system | grep calico
calico-kube-controllers-5b784b4785-7898t             3m           16Mi            
calico-node-gk7ft                                    285m         125Mi          <-- CPU should be much lower, about 1/10th of this
calico-node-ngz97                                    247m         132Mi          <-- CPU should be much lower, about 1/10th of this 
calico-typha-5667cc7f45-kp9c4                        3m           20Mi            
calico-typha-5667cc7f45-l5pxg                        2m           21Mi            

$ k logs -n kube-system calico-node-ngz97 --since 1s
Defaulted container "calico-node" out of: calico-node, install-cni (init)
2024-05-24 08:24:24.200 [INFO][59] felix/xdp_state.go 200: Retrying after an XDP update failure...
2024-05-24 08:24:24.244 [INFO][59] felix/xdp_state.go 200: Retrying after an XDP update failure...
2024-05-24 08:24:24.289 [INFO][59] felix/xdp_state.go 200: Retrying after an XDP update failure...
2024-05-24 08:24:24.332 [INFO][59] felix/xdp_state.go 200: Retrying after an XDP update failure...
2024-05-24 08:24:24.372 [INFO][59] felix/xdp_state.go 200: Retrying after an XDP update failure...
2024-05-24 08:24:24.428 [INFO][59] felix/int_dataplane.go 2154: Applying XDP actions did not succeed, disabling XDP error=failed to resync: failed to load BPF program (/usr/lib/calico/bpf/filter.o): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error loading BTF: Invalid argument(22)
libbpf: magic: 0xeb9f
version: 1
flags: 0x0
...
...
...

Worker env:

# cat /etc/os-release  | grep VERSION
VERSION="20.04.6 LTS (Focal Fossa)"
VERSION_ID="20.04"
VERSION_CODENAME=focal

# uname -r
5.4.0-182-generic

@spacegaucho
Copy link

spacegaucho commented May 24, 2024

Could you check your syslog for any networkd-dispatcher, NetworkManager, ethtool or systemd-udevd flooding? I just experienced something similar, where my master nodes didn't show any particular kubernetes process loading up the node (at a glance on k9s), but the node itself was at x10 the usual load (top), after upgrading from 3.25 to 3.28. Don't think my problem is BPF related.

This, apparently, only affected my on premise arch.

Specs

focal
systemd 245 (245.4-4ubuntu3.23)
kernel: 5.4.0-152-generic
k8s: v1.25.12

Syslog

May 24 13:23:08 master networkd-dispatcher[688]: ERROR:Unknown interface index 1419614 seen even after reload
May 24 13:23:08 master networkd-dispatcher[688]: WARNING:Unknown index 1419616 seen, reloading interface list
May 24 13:23:08 master systemd-udevd[701481]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
[...]
May 24 13:23:08 master systemd-udevd[701481]: ethtool: could not get ethtool features for calico_tmp_B
May 24 13:23:08 master systemd-udevd[701481]: Could not set offload features of calico_tmp_B: No such device

@dzacball
Copy link
Author

@spacegaucho similar stuff here:

# cat /var/log/syslog | tail -n 10
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 networkd-dispatcher[785]: WARNING:Unknown index 450967 seen, reloading interface list
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10804]: calico_tmp_B: Failed to get link config: No buffer space available
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10802]: calico_tmp_A: Failed to get link config: No buffer space available
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 networkd-dispatcher[785]: WARNING:Unknown index 450969 seen, reloading interface list
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10804]: calico_tmp_B: Failed to get link config: No buffer space available
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10802]: calico_tmp_A: Failed to get link config: No buffer space available
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-networkd[679]: calico_tmp_B: Failed to wait for the interface to be initialized: No such device
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 networkd-dispatcher[785]: WARNING:Unknown index 450971 seen, reloading interface list
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10802]: calico_tmp_A: Failed to get link config: No buffer space available
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10804]: calico_tmp_B: Failed to get link config: No buffer space available

Could you check your calico-node logs to see if ther's anything like the logs below?

[INFO][60] felix/int_dataplane.go 2154: Applying XDP actions did not succeed, disabling XDP error=failed to resync: failed to load BPF program (/usr/lib/calico/bpf/filter.o): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error loading BTF: Invalid argument(22)

@spacegaucho
Copy link

spacegaucho commented May 24, 2024

@dzacball looks like we are onto something here.

No, there's nothing referencing the events you mentioned:

k stern -n kube-system calico-node-* --since=48h --no-follow | grep 'XDP actions did not succeed'
+ calico-node-zlbwb › calico-node
+ calico-node-gm87s › calico-node
+ calico-node-4fp5p › calico-node
+ calico-node-vkh8q › calico-node
+ calico-node-pphxf › calico-node
+ calico-node-cdp9k › calico-node
+ calico-node-8smsj › calico-node
+ calico-node-nfww9 › calico-node
+ calico-node-lptlx › calico-node
- calico-node-cdp9k › calico-node
- calico-node-4fp5p › calico-node
- calico-node-vkh8q › calico-node
- calico-node-nfww9 › calico-node
- calico-node-zlbwb › calico-node
- calico-node-8smsj › calico-node
- calico-node-lptlx › calico-node
- calico-node-gm87s › calico-node
- calico-node-pphxf › calico-node

But the issue persists. Had to rollback to 3.25 and that "fixed" the issue. I'm browsing around for any additional information.

@mazdakn
Copy link
Member

mazdakn commented May 28, 2024

@spacegaucho can you possibly check if the issue is gone with upgrading bpftool?

@spacegaucho
Copy link

Sure! @mazdakn how should I go about that? My calico-node pods currently run the following:

bpftool --version
bpftool v5.3.0

Should I test a v3.29.0-0.dev image? Or would I need to update the binary manually inside the pods? Thanks.

@mazdakn
Copy link
Member

mazdakn commented May 29, 2024

@spacegaucho Thanks for being OK to test it. I'll provide you an image with the updated bpftool.

@xloypaypa
Copy link

I got similar issue here.
Not very sure about the root cause. But once I upgraded my kernel from 5.4.0-182-generic to 5.15.0-107-generic. the issue disappeared.

referring to the kernel version here: https://ubuntu.com/security/livepatch/docs/livepatch/reference/kernels
I what I did it: install the HWE kernel. (I'm using Ubuntu 20.04 LTS)

So, not sure if this issue is related to specific kernel version or it's actually related the HWE kernel

@mazdakn mazdakn changed the title High CPU load and BPF related error logs High CPU load due to XDP program in iptables Jun 3, 2024
@caseydavenport
Copy link
Member

xref #8833

@mazdakn
Copy link
Member

mazdakn commented Jun 5, 2024

@spacegaucho here is an image with the updated bpftool (v7.4). Please give it a try, and let me know if it works for you.

mazdakrn/calico-node:latest

This is based on master branch, but should be OK to test it in a v3.28 cluster.

@mazdakn
Copy link
Member

mazdakn commented Jun 5, 2024

@dzacball @spacegaucho can you also disable XDP mode in iptables by setting xdpEnabled to false in felixconfiguration resource. This should prevent felix trying to load iptables XDP objects.

Ref: https://docs.tigera.io/calico/latest/reference/resources/felixconfig

@dzacball
Copy link
Author

dzacball commented Jun 6, 2024

@mazdakn I can verify that if I set xdpEnabled to false, the issue is gone.

@mazdakn
Copy link
Member

mazdakn commented Jun 7, 2024

@dzacball can you also try the image I mentioned above in one of your test cluster?

@dzacball
Copy link
Author

@mazdakn I tested your image - it works, issue is gone. (As I already mentioned, I also did some tests a few weeks back with a self-built image, using v3.28.0 + latest bpftool, worked as well)

@mazdakn
Copy link
Member

mazdakn commented Jun 11, 2024

@dzacball for testing.

@spacegaucho
Copy link

sorry for not replying earlier, will try it asap.

@dzacball @spacegaucho can you also disable XDP mode in iptables by setting xdpEnabled to false in felixconfiguration resource. This should prevent felix trying to load iptables XDP objects.

Ref: https://docs.tigera.io/calico/latest/reference/resources/felixconfig

@spacegaucho
Copy link

spacegaucho commented Jun 13, 2024

@spacegaucho here is an image with the updated bpftool (v7.4). Please give it a try, and let me know if it works for you.

mazdakrn/calico-node:latest

This is based on master branch, but should be OK to test it in a v3.28 cluster.

can confirm this fixed the issue for me as well.

  • started with 3.25, checked that there where no matches for XDP actions did not succeed string, as I tested in a previous post.
  • upgraded to 3.28 (using this upgrade path) and verified that the string was present and flooding.
  • modified the upgrade.yaml file to include the image provided by @dzacball, applied it and verified that after the image was set in place and the pods restarted, the log flooding stopped.

@mazdakn
Copy link
Member

mazdakn commented Jun 13, 2024

@spacegaucho thanks for testing and confirmation. Did you also managed to set xdpEnabled to false and see if the issue is gone?

@spacegaucho
Copy link

xdpEnabled

no, sorry, should i test that directly in vanilla 3.28?

@mazdakn
Copy link
Member

mazdakn commented Jun 14, 2024

Yes, with vanilla 3.28 image.

@isugimpy
Copy link

I notice that #8880 is flagged for 3.29, is it possible this will also get a 3.28 backport? I was about to go down the rabbit hole of switching to the eBPF datapath and exploring XDP, but don't want to run into this issue.

@mazdakn
Copy link
Member

mazdakn commented Jun 18, 2024

@isugimpy we will definitely backport it to v3.28, and also v3.27.
However, this issue is not relevant to eBPF dataplane. The issue discussed here is about XDP used in iptables dataplane.
eBPF dataplane is NOT affected.

@isugimpy
Copy link

Oh, brilliant. I didn't catch that this was unique to iptables! Thank you!

@dzacball
Copy link
Author

@mazdakn Do you have an ETA for upcoming 3.27 and 3.28 releases that will already contain this fix PR?

@mazdakn
Copy link
Member

mazdakn commented Jun 25, 2024

@dzacball v3.27.4 is expected to be released early July (probably the first week). There is no ETA for 3.28 patch release yet.

@dzacball
Copy link
Author

@mazdakn Thank you.

@mazdakn
Copy link
Member

mazdakn commented Jun 26, 2024

Closing since the fixes (including back ports to v3.27 and v3.28) are merged now, and will be available in the next patch releases.

@mazdakn mazdakn closed this as completed Jun 26, 2024
@mazdakn
Copy link
Member

mazdakn commented Jul 8, 2024

@dzacball 3.28.1 is planned to be released in the second half of July.

@dzacball
Copy link
Author

dzacball commented Jul 9, 2024

thanks for the heads-up

@mihivagyok
Copy link
Contributor

@mazdakn Do you have any update when the release will arrive? Thank you!

@mazdakn
Copy link
Member

mazdakn commented Jul 24, 2024

@mihivagyok we are in the process of releasing 3.28.1 atm. It most likely will be released early next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants