-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UMBRELLA] Dropped events #1403
Comments
Greetings! In our environment, Falco was reporting a lot of non-actionable weird events. For example: "rule": "Non sudo setuid",
"output_fields": {
"proc.pname": null,
"proc.cmdline": "<NA>",
"user.uid": 4294967295,
"k8s.pod.name": null,
"evt.time": 1600013430060232000,
"evt.arg.uid": "root",
"container.id": "host",
"k8s.ns.name": null,
"user.name": null,
"container.image.repository": null
} This event was not telling us anything useful. We started an investigation with @leogr and we rapidly made the connection with dropped system calls. To really see the problem, we adjusted the "rule": "Falco internal: syscall event drop",
"output_fields": {
"n_evts": "1113206",
"n_drops_bug": "0",
"n_drops_pf": "0",
"ebpf_enabled": "0",
"n_drops": "621240",
"n_drops_buffer": "621240"
} Our average of dropped system calls was over 1,000,000 every minute. Our first experiment was to get rid of all Falco rules that are in the rules file, but disabled. It did not fix our problem. Ou second experiment was to get rid of the Kubernetes integration (by removing -K, -k, and -pk options). Falco's performance improved, but it was only temporary. After a few hours, Falco was re-dropping thousands of system calls. Then we removed a Our program output was a simple bash script calling We went back to the drawing board and replaced our A permanent fix for theses issues is to make all Falco outputs non-blocking. Falco should not block himself while trying to report us security events 😄 For the drops we still have, they might be caused by increases in the number of system calls to process due to activities in our environment. We might need to dig deeper, get statistics on the number of system calls processed by Falco and statistics on peaks, pin point what is generating those peaks. At least, weird non-actionable events are gone! 🎉 That's all folks! |
Hey @JPLachance Thank you a lot for reporting this! 🥳 |
Really nice report. For people that could be interested, |
In EKS 1.18,
I figured out from system log that
I thought to be associated with 3-4 drops every 20 seconds. I was able to find this log generated by aws-vpc-cni-plugin: I can't find the any misconfiguration about aws-vpc-cni-plugin or EKS. Here is my envrionment:
And Worker node startup with So I decided to turn off notifications for syscall drop. |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
/remove-lifecycle stale |
This is not necessary and also recommended in falcosecurity/falco#1403.
This is not necessary and also recommended in falcosecurity/falco#1403.
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh with Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle rotten |
/remove-lifecycle rotten
L.
…On Fri, Jul 9, 2021 at 4:01 PM poiana ***@***.***> wrote:
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle rotten
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1403 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA5J42JHQLJQGWYNTREEPLTW36MXANCNFSM4ROQ3HOA>
.
|
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
/remove-lifecycle stale |
This does no longer seem to hold true when running falco with the |
that's a good point, thanks! |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
/remove-lifecycle rotten |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
/remove-lifecycle stale |
Why this issue?
When Falco is running, a producer (a.k.a the driver) continuously forwards events to a consumer (the Falco userspace program) with a buffer that sits in the middle. When - for any reason - the consumer is not able to consume the incoming events, then an events drop occurs.
Starting from v0.15.0, Falco introduced a mechanism to detect dropped events and take actions, as explained in the official documentation. However, events drop is still an issue as reported by many users.
Since the problem depends on many factors and can be hard to analyze and understand, this issue aims to give users an overview of that and collect the knowledge acquired until now.
Please, note that this document does not try to directly solve this problem and consider that some assumptions might be wrong (feel free to confute them!).
N.B.
At the time of writing, it was not clear the outcome of this issue, so I just chose to label it as "documentation".
Event drop alerts
Currently, when dropped events are detected, Falco will print out some statistics that can give users some information about the kind of drops that happened.
An example of an alert regarding dropped events:
Note that the statistics reported above are relative to the timeframe specified (the last second) and not cumulative. Furthermore, note that an event represents just a single syscall received by the driver, regardless of whether a rule is triggered or not.
So, what do those values mean?
ebpf_enabled
indicates whenever the driver is the eBPF probe (=1
) or a kernel module (=0
)n_drops
is the sum of othersn_drop_*
fields (see the section below) and represents the total number of dropped eventsn_evts
is the number of events that the driver should send according to its configuration. It also includesn_drops
sincen_drops
is the number of events that the driver should send to userspace but is not able to send due to various reasons.Also note that in extreme cases, drop alerts may be rate-limited, so consider incrementing those values in the configuration file, for example:
Kind of drops
As you can notice, not all drops are the same. Below an explanation for each kind of them (ordered by the less frequent to the most one).
n_drops_bug
is the number of dropped events caused by an invalid condition in the kernel instrumentation, something went wrong basically. AFAIK, only the eBPF probe can generate this kind of drop, and luckily there are no reports of this problem.n_drops_pf
(wherepf
stands for page fault) is the number of dropped events caused by invalid memory access; it happens whenever the memory page (referenced by the syscall) had disappeared before the driver was able to collect information about the event. We noticed that rarely, it sometimes happens on GKE, and it is related to some process that is continuously crashing (seen_drops_pf=1
about every hour on GKE #1309).n_drops_buffer
is the number of dropped events caused by a full buffer (the buffer sits between the producer and the consumer). It's the most frequent one, and it's related to performance. We have also different categories of buffer drops to understand which syscall triggered them (e.g.n_drops_buffer_clone_fork_exit
,n_drops_buffer_connect_enter
, ...)Performance-related drops (n_drops_buffer)
We experience this kind of event dropping when the consumer is blocked for a while (note that the task that consumes events is single-threaded). That is strictly related to performance and can happen for several reasons. We also added a benchmark command in the event-generator to experiment with this problem (see falcosecurity/event-generator#36 for more details).
Possible causes:
Limited CPU resource
The consumer hits the maximum CPU resources allocated for it and gets blocked for a while. For example, the official Helm chart comes with a 200m CPU hard limit that may cause this problem.
Large/complex ruleset (high CPU usage)
The larger and more complex the ruleset, the more CPU will be needed. At some point, either with or without resource limitation, high CPU usage can produce event dropping.
Fetching metadata from external components (I/O blocking)
In some cases, fetching metadata (e.g., container information, k8s metadata) from an external component can be a blocking operation.
For example, the
--disable-cri-async
flag is quite explanatory about that:Another option that might cause problem is:
Slow responses from the Kubernetes API server could cause this problem too.
Blocking output (I/O blocking)
Falco outputs mechanism can also have an impact and might block the event processing for a while, producing drops.
The buffer size
If you are not able to solve your drop issues you can always increase the syscall buffer size (the shared buffer between userspace and kernel that contains all collected data). You can find more info on how to change its dimension in the Falco config file
Debugging
When debugging, the first time to consider is that multiple causes may occur simultaneously. It is worth excluding every single cause, one by one.
Once
n_drops_bug
andn_drops_pf
cases are excluded, for the performance-related drops (ie.n_drops_buffer
) a handy checklist is:-A
option, if any--disable-cri-async
option, if any-U
option, if any-k https://$(KUBERNETES_SERVICE_HOST)
(instead of-k https://kubernetes.default
, see this comment)webserver.enabled
to false in the config file and removing any other related configuration-K
,-k
, and-pk
options)stdout_output
(event drop alerts still show up)Finally, some useful links that could help with debugging:
Interesting issues about drops
Falco on GKE - dropped syscall events #669,
Falco on GKE - dropped syscall events #669 (comment),
Falco on GKE - dropped syscall events #669 (comment),
Frequent and noisy syscall event drops when running falco 0.17.1 helm chart #961,
Frequent syscall event drops #1382,
Frequent syscall event drops when falco runs as a binary #615,
Syscall events dropped exceed number of syscall events #1231,
High rate of dropped events #558
Issues related to
n_drops_pf
Many dropped system calls events due to page faults #917,
It appears that we are dropping syscall information? #770,
Falco on GKE - dropped syscall events #669 (comment)
Some threads on our Slack channel
https://kubernetes.slack.com/archives/CMWH3EH32/p1592904527372800, https://kubernetes.slack.com/archives/CMWH3EH32/p1599741275086000
Drop related to the K8s support (
-K
,-k
, and-pk
options)Lots of "syscall event drop" since 0.32.1 #2129
The text was updated successfully, but these errors were encountered: