-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lots of "syscall event drop" since 0.32.1 #2129
Comments
That's interesting @wdoekes thank you for pointing this out! Just 2 questions:
|
There is definitely something different with 0.32.1. Cluster # 1 (io):
^- 0.32.0: just one notice Cluster # 2 (mpr):
^- 0.32.0: silence Cluster # 3 (mac):
^- 032.1: we have 2018 lines of "[Ss]yscall event drop" All of these were started between at around 15:00 yesterday and I'm looking at them 18 hours later. Yesterday, all of them were showing many many of those log lines when they were all running 0.32.1. Today, only the one running 0.32.1 exhibits excessive amounts of those log lines. (I'll followup with answers to your questions.) |
Agree with @Andreagit97 question here; are you able to try with kmod? |
(Also seen in the logs above.) All of them Ubuntu/Focal generic kernels. E.g.:
Running them now. Cluster # 1 (io, running Falco 0.32.1, dkms):
There on 0.32.1 with DKMS, we already have plenty after about 15 minutes. Cluster # 3 (mac, running Falco 0.32.0, dkms):
Just 2 log lines with Falco 0.32.0 with DKMS, after about 20 minutes. I think we can rule out the choice of module type. |
Thank you @wdoekes for reporting this! These data are super useful! I suspect that for some reasons Falco 0.32.1 userspace is slower than Falco Just one last thing 🙏 it would be great if we could understand which type of drops are these (e.g. if they are drops due to the full buffers or due to wrong instrumentation). To get this information you can simply search in your logs something like
In this way we could understand what is going on 🤔 |
Could you try also the difference between the 2 versions ( |
I'm sorry, I don't have any lines that contain "Critical" or "n_drops". Helm chart log levels are set to debug:
I've dropped everything after
(i.e. the stuff below)
Now k8s.pod.name is null, for obvious reasons. But I have not seen a event drop in the 4 minutes since I started. Let me run this for a while (with 0.32.1 and eBPF), and I'll get back to you... |
Ok. Ran without:
for 20 minutes, and had 0 "event drop" in the logs. Restarted with above arguments, and instantly 45 logs entries:
(etc..) Looks like the the |
Yeah, it seems we have found our culprit! The funny thing is that this is due to a bug fix that you can see here. In a nutshell, in Falco |
Ah. Good to know. Thanks for the feedback 👍 |
Since we seem to have solved the mystery, I'd suggest linking this to #1403 (I already asked to add this issue in that list) for better traceability, and close this one. What does everyone think? |
Agree 👍 |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh with Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle rotten |
👍 |
Describe the bug
Since Falco 0.32.1 I'm getting many of these:
Falco internal: syscall event drop. 1538124 system calls dropped in last second.
Syscall event drop but token bucket depleted, skipping actions
This was not the case with Falco 0.32.0 as far as I can see.
So one of two things happened:
Falco 0.32.0 behaviour
Falco 0.32.1 behaviour
How to reproduce it
Not sure yet.
Next steps
I've reverted the first of my test environments to 0.32.0. If the errors stay gone, I must conclude that 0.32.1 is to blame and not my changed rules.
I'll be back to confirm.
Environment
Additional context
These errors popped up on both master and worker k8s nodes AND on different clusters at once. So I would rule out that this due to some external performance spike. The only common thing between them are the rules and the Falco version.
Cheers,
Walter Doekes
OSSO B.V.
The text was updated successfully, but these errors were encountered: