Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enabling eBPF Recorder on AKS crashes SPOD containers #795

Closed
tshaiman opened this issue Jan 29, 2022 · 27 comments
Closed

enabling eBPF Recorder on AKS crashes SPOD containers #795

tshaiman opened this issue Jan 29, 2022 · 27 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tshaiman
Copy link

following @saschagrunert excellent tutorial here , I have called the method :

kubectl patch spod spod --type=merge -p '{"spec":{"enableBpfRecorder":true}}'

which eventually led to the following output on the bpf-recorder container :

I0129 19:09:14.865625 27546 logr.go:252] "msg"="Set logging verbosity to 1"
I0129 19:09:14.865684 27546 logr.go:252] "msg"="Profiling support enabled: false"
I0129 19:09:14.865733 27546 logr.go:252] setup "msg"="starting component: bpf-recorder" "buildDate"="1980-01-01T00:00:00Z" "compiler"="gc" "gitCommit"="unknown" "gitTreeState"="clean" "goVersion"="go1.17.3" "libbpf"="0.5.0" "libseccomp"="2.5.2" "platform"="linux/amd64" "version"="0.5.0-dev"
I0129 19:09:14.865789 27546 bpfrecorder.go:106] bpf-recorder "msg"="Setting up caches with expiry of 1h0m0s"
I0129 19:09:14.865820 27546 bpfrecorder.go:123] bpf-recorder "msg"="Starting log-enricher on node: aks-primary-29748022-vmss000002"
I0129 19:09:14.866518 27546 bpfrecorder.go:154] bpf-recorder "msg"="Connecting to metrics server"
I0129 19:09:14.867108 27546 bpfrecorder.go:170] bpf-recorder "msg"="Got system mount namespace: 4026531840"
I0129 19:09:14.867126 27546 bpfrecorder.go:172] bpf-recorder "msg"="Doing BPF load/unload self-test"
I0129 19:09:14.867139 27546 bpfrecorder.go:371] bpf-recorder "msg"="Loading bpf module"
I0129 19:09:14.867162 27546 bpfrecorder.go:440] bpf-recorder "msg"="Using system btf file"
I0129 19:09:14.867382 27546 bpfrecorder.go:391] bpf-recorder "msg"="Loading bpf object from module"
libbpf: map 'events': failed to create: Invalid argument(-22)
libbpf: failed to load object 'recorder.bpf.o'
E0129 19:09:14.871501 27546 logr.go:270] setup "msg"="running security-profiles-operator" "error"="load self-test: load bpf object: failed to load BPF object"

  • Cloud provider or hardware configuration: Azure AKS version 1.21.7
  • OS : Linux
  • Kernel (e.g. uname -a): 5.4.0-1067-azure
  • Others: containerd://1.4.9+azure

kubectl get nodes -o wide

❯ k get nodes -o wide

NAME                              STATUS   ROLES   AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
aks-primary-29748022-vmss000000   Ready    agent   27h   v1.21.7   10.240.0.4    <none>        Ubuntu 18.04.6 LTS   5.4.0-1067-azure   containerd://1.4.9+azure
aks-primary-29748022-vmss000001   Ready    agent   27h   v1.21.7   10.240.0.5    <none>        Ubuntu 18.04.6 LTS   5.4.0-1067-azure   containerd://1.4.9+azure
aks-primary-29748022-vmss000002   Ready    agent   27h   v1.21.7   10.240.0.6    <none>        Ubuntu 18.04.6 LTS   5.4.0-1067-azure   containerd://1.4.9+azure
@tshaiman tshaiman added the kind/bug Categorizes issue or PR as related to a bug. label Jan 29, 2022
@saschagrunert
Copy link
Member

I can reproduce it and we probably should update libbpf and the vendored btf to see if that fixes the issue.

@saschagrunert saschagrunert self-assigned this Jan 31, 2022
@saschagrunert
Copy link
Member

Did a test with #796 and it does not work, because:

  • we usually fallback to the in-memory btf (which is now available within that patch) if no vmlinux is exposed. The vmlinux file is available on the azure node, so it should work in theory with that file.
  • forcing it to use the in-memory btf fails with the same error

That's odd, I'm not sure if the kernel configuration of the azure nodes are correct to support our BPF application.

@tshaiman
Copy link
Author

tshaiman commented Jan 31, 2022 via email

@saschagrunert
Copy link
Member

@tshaiman can you share the configuration flags how the kernel has been built? Ubuntu 18.04 does not expose /sys/kernel/btf/vmlinux per default.

@saschagrunert
Copy link
Member

saschagrunert commented Jan 31, 2022

When trying the llvm-bootstrap demo application: https://github.com/libbpf/libbpf-bootstrap/blob/master/examples/c/bootstrap.c

Then I'm getting the same error on an azure node (I disabled the failure on RLIMIT_MEMLOCK increasing):

root@aks-agentpool-41851968-vmss000001:~/libbpf-bootstrap/examples/c# ./bootstrap
Failed to increase RLIMIT_MEMLOCK limit!
libbpf: map 'rb': failed to create: Invalid argument(-22)
libbpf: failed to load object 'bootstrap_bpf'
libbpf: failed to load BPF skeleton 'bootstrap_bpf': -22
Failed to load and verify BPF skeleton
# uname -a
Linux aks-agentpool-41851968-vmss000001 5.4.0-1067-azure #70~18.04.1-Ubuntu SMP Thu Jan 13 19:46:01 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

@tshaiman
Copy link
Author

tshaiman commented Feb 1, 2022

@saschagrunert : I don't have insights on how the kernel was built as I'm not part of the AKS team.

@saschagrunert
Copy link
Member

@saschagrunert : I don't have insights on how the kernel was built as I'm not part of the AKS team.

Maybe we can open an issue in their tracker to describe the problem there?

@tshaiman
Copy link
Author

tshaiman commented Feb 4, 2022

@saschagrunert : done
Azure/AKS#2768

@saschagrunert
Copy link
Member

Still the same with the latest Azure deployment.

@tshaiman
Copy link
Author

tshaiman commented Feb 17, 2022

correct, as I still see the bug here : Azure/AKS#2768 is still open . I send a ping/reminder on the ticket

@tshaiman
Copy link
Author

tshaiman commented Mar 5, 2022

still pending on AKS, i have reminded them many times .
ps it could be related to
Azure/AKS#2827

@holyspectral
Copy link

Can reproduce this issue on GKE cos nodes too. Error logs:

Found 6 pods, using pod/spod-sl7n4
I0315 20:32:49.511742  167526 logr.go:252]  "msg"="Set logging verbosity to 0"  
I0315 20:32:49.511798  167526 logr.go:252]  "msg"="Profiling support enabled: false"  
I0315 20:32:49.511881  167526 logr.go:252] setup "msg"="starting component: bpf-recorder"  "buildDate"="1980-01-01T00:00:00Z" "compiler"="gc" "gitCommit"="67f1c871de542881ea397058874fc020c604198e" "gitTreeState"="dirty" "goVersion"="go1.17.6" "libbpf"="0.6.1" "libseccomp"="2.5.3" "platform"="linux/amd64" "version"="0.4.2-dev"
I0315 20:32:49.511934  167526 bpfrecorder.go:105] bpf-recorder "msg"="Setting up caches with expiry of 1h0m0s"  
I0315 20:32:49.511957  167526 bpfrecorder.go:122] bpf-recorder "msg"="Starting log-enricher on node: gke-sam-cluster-2-pool-2-0f1e4876-rr52"  
I0315 20:32:49.512901  167526 bpfrecorder.go:153] bpf-recorder "msg"="Connecting to metrics server"  
I0315 20:32:49.513778  167526 bpfrecorder.go:169] bpf-recorder "msg"="Got system mount namespace: 4026531840"  
I0315 20:32:49.513798  167526 bpfrecorder.go:171] bpf-recorder "msg"="Doing BPF load/unload self-test"  
I0315 20:32:49.513815  167526 bpfrecorder.go:370] bpf-recorder "msg"="Loading bpf module"  
I0315 20:32:49.513839  167526 bpfrecorder.go:439] bpf-recorder "msg"="Using system btf file"  
I0315 20:32:49.514097  167526 bpfrecorder.go:390] bpf-recorder "msg"="Loading bpf object from module"  
libbpf: map 'events': failed to create: Invalid argument(-22)
libbpf: failed to load object 'recorder.bpf.o'
E0315 20:32:49.520079  167526 logr.go:270] setup "msg"="running security-profiles-operator" "error"="load self-test: load bpf object: failed to load BPF object"  

@holyspectral
Copy link

Not related to GKE, but maybe BTF Hub can help with the AKS case?

@saschagrunert
Copy link
Member

Not related to GKE, but maybe BTF Hub can help with the AKS case?

AKS already exposes /sys/kernel/btf/vmlinux which should contain the correct BTF information. I think I tried manually using the internally provided BTF, but this had the same effect.

@brness
Copy link
Contributor

brness commented Apr 12, 2022

I have the same issue and I deploy SPO in my local cluster, is this concerned as kernel problem?
my OS is centos with kernel:
Linux k8s-master-node-1 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

@alexeldeib
Copy link

alexeldeib commented Apr 18, 2022

AKS doesn't do anything special for our kernels. They are based on Azure marketplace Ubuntu 18.04 images. Have you tried to reproduce this on a vanilla Azure (non-AKS) VM?

FWIW, I tried this once like a year or so ago and hit similar issues I wasn't able to resolve. I ended up not using BTF sadly. My suspicion is it could be something to do with 18.04 or how they backported kernel fixes and a version like 20.04 originally based on 5.x+ could work out of the box (i.e., something is wonky between 4.15 + 18.04 and 5.4 + 18.04 because 4.15 didn't support BTF, but later kernels did). But TBH, I am not an expert in BTF, and I don't think we are doing anything special here, so I'm not sure where to investigate.

@alexeldeib
Copy link

@tshaiman can you share the configuration flags how the kernel has been built? Ubuntu 18.04 does not expose /sys/kernel/btf/vmlinux per default.

@saschagrunert in case it's helpful, attached the kconfig from a running AKS node. Notably I do see CONFIG_DEBUG_INFO_BTF=y which is interesting (don't think that used to be the case in original 18.04, possibly came with kernel bump in 18.04.5 or whatever latest patch is?).

here's a snippet of the config grepping for bpf/btf flags to save you some time (admittedly 1067 vs 1074 but they are basically the same)

/# cat /boot/config-5.4.0-1074-azure | grep "B[T|P]F"
CONFIG_CGROUP_BPF=y
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_UNPRIV_DEFAULT_OFF=y
CONFIG_IPV6_SEG6_BPF=y
CONFIG_NETFILTER_XT_MATCH_BPF=m
CONFIG_BPFILTER=y
CONFIG_BPFILTER_UMH=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_JIT=y
CONFIG_BPF_STREAM_PARSER=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_BPF_EVENTS=y
CONFIG_BPF_KPROBE_OVERRIDE=y

config-5.4.0-1067-azure.txt

@tshaiman
Copy link
Author

@alexeldeib : thanks a lot for assisting in getting those configs . my 2 cents here is the latest kernel config which is 5.4.0.1074 from my running AKS node.
config-5.4.0-1074-azure.txt

@alexeldeib
Copy link

alexeldeib commented Apr 18, 2022

@tshaiman
Copy link
Author

that is indeed seems to be the root cause, well done @alexeldeib !
@saschagrunert : do you think an alternative to ring buffer maps < kernel 5.8 can be used ? It might assist other developers mentioning having the same compatibilities issues.

@saschagrunert
Copy link
Member

@tshaiman maybe, I'll see what we can do here.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2022
@B3ns44d
Copy link

B3ns44d commented Jul 20, 2022

@saschagrunert I'm facing the same issue on a k3s cluster

spod-f7bzv 2/3 CrashLoopBackOff 38 (4m42s ago) 134m

$ k logs spod-f7bzv -c bpf-recorder

I0720 16:07:53.028594   59631 bpfrecorder.go:105] bpf-recorder "msg"="Setting up caches with expiry of 1h0m0s"
I0720 16:07:53.028604   59631 bpfrecorder.go:122] bpf-recorder "msg"="Starting log-enricher on node: setcisedtp0013.hosting.cegedim.cloud"
I0720 16:07:53.029242   59631 bpfrecorder.go:153] bpf-recorder "msg"="Connecting to metrics server"
I0720 16:07:53.030180   59631 bpfrecorder.go:173] bpf-recorder "msg"="Got system mount namespace: 4026531840"
I0720 16:07:53.030190   59631 bpfrecorder.go:175] bpf-recorder "msg"="Doing BPF load/unload self-test"
I0720 16:07:53.030195   59631 bpfrecorder.go:374] bpf-recorder "msg"="Loading bpf module"
I0720 16:07:53.030211   59631 bpfrecorder.go:443] bpf-recorder "msg"="Using system btf file"
I0720 16:07:53.030592   59631 bpfrecorder.go:394] bpf-recorder "msg"="Loading bpf object from module"
libbpf: map 'events': failed to create: Invalid argument(-22)
libbpf: failed to load object 'recorder.bpf.o'
E0720 16:07:53.034027   59631 logr.go:279] setup "msg"="running security-profiles-operator" "error"="load self-test: load bpf object: failed to load BPF object"

Environment:

NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

uname -a:

Linux -- 5.4.0-110-generic #124-Ubuntu SMP Thu Apr 14 19:46:19 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

@saschagrunert
Copy link
Member

@B3ns44d I think we require kernel 5.8 for that to work :-/

@B3ns44d
Copy link

B3ns44d commented Jul 21, 2022

@saschagrunert ohhh didn't know that, it now functions properly after upgrading to 5.13.0-41.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 20, 2022
@JAORMX
Copy link
Contributor

JAORMX commented Aug 20, 2022

This seems to have been answered with the kernel version comment. Closing.

@JAORMX JAORMX closed this as completed Aug 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants