-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High rate of dropped events #558
Comments
Do you have any data on how many containers per node you run on average and also info on how quickly containers churn? We have a fix going in for how we pull container metadata which we believe is causing the dropped events. |
It's around 16 to 20 containers (including kube-system ones) on each node and most of them seem to be long lived (most of them have the same age as the node). There's one pod that have some restarts. I'm attaching a file with more detailed data from kubectl: |
Just an update on this, After fixing the k8s metadata issue (#562), I was able to reduce the drop rate to almost zero on the heavy CPU load nodes (0.0001% drop rate). Was the inability of Falco to reach https://kubernetes.default, the root cause of all these drops? |
Have you seen any improvement further with the newer falco builds? We've further fixed pulling container info as well as alerting on drops. And yes, I am sure the DNS was blocking somewhere which would have caused this. |
I haven't noticed much difference. The DNS fix was what really pulled down the number of drops. |
I'm running Falco (latest from 3 weeks ago) as a daemonset in one of our staging environments in GKE (1.11.6-gke.6). It's a pool of around 20 nodes using Ubuntu. I'm using BPF probe.
I'm experiencing a very high drop rate of events, at around 72%, on nodes with around 8 of CPU load. On nodes where the CPU load is less (below 1), the drop rate is almost inexistent or even zero.
Also tested in a less busy GKE+COS environment, with BPF enabled, and still got around 12% drop rate for a node with 1.2 of CPU load.
I'm attaching two PDFs with the detailed data that I measured, showing CPU load, memory, disk, pod restarts, drop rate, etc.
I also tested using the kernel probe on the Ubuntu node pool, and the drop rate was actually higher (around 74% on the busier nodes), so not sure if it's totally related to BPF.
Falco Dropped syscalls - GKE + Ubuntu + BPF.pdf
Falco Dropped syscalls - GKE + COS + BPF.pdf
The text was updated successfully, but these errors were encountered: