-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Falco on GKE - dropped syscall events #669
Comments
Thanks for opening @caquino - need some time to look at this because I need to get to a gcp account to do tests. |
FWIW, getting the same on a 3-server (1cpu2Gbram) cluster at DigitalOcean Kubernetes. There are a bunch of additional things already deployed:
Do these errors usually mean the cluster's capacity is reached? |
On my case the cluster has 3 n1-standard-2 (2 vCPUs, 7.5 GB memory) nodes, other than what comes pre-configured on the cluster from Google is running only a Deployment with nginx and php. This is not a production cluster, so it has no traffic on it yet, so I would expect to not be a capacity issue. |
I am also having the same issue ("syscall event drop" logs several times an hour) in my almost completely unused GKE cluster with 3 nodes. Re: the fluentd log tampering alerts, I created #684 to address it, and I describe the fix that worked to stop those from flowing in. |
Continuously seeing this after installing k8s-with-rbac/falco-daemonset-configmap.yaml in a 3 node gke environment: Falco internal: syscall event drop. 7 system calls dropped in last second. Occurs in: Not yet checked: |
Just to be sure: what The default one? |
@leodido I've just install falco to the GKE using: helm install --name falco stable/falco --set ebpf.enabled=true Having exactly the same errors. |
syscall_event_drops: Same result. Maybe I should have decreased? Using falco-daemonset-configmap.yaml with the following changes:
|
Same here, I'm using the default configuration. |
Has anyone found a workaround for this issue? |
I see the same behaviour with Falco 0.17 on five node GKE cluster. CPU load on the nodes is between 18% and 35%. Falco is using 1-5% . |
Seeing this in Azure with falco using abysmal amounts of CPU and using the set defaults |
I'm also seeing 1 |
I have the same issue. As a workaround i have just set the action to "ignore" until this feature works with GKE |
I'm seeing this when installed as a binary on the host OS in AWS EC2 running CentOS 7 w/ kernel 5.3 mainline. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Could we keep this open until it has been resolved? |
@shane-lawrence i agree we want to keep this open, I was coming here to un-stale this but you already did :) |
I abandoned falco due to this issue.
… On Nov 25, 2019, at 5:26 PM, Lorenzo Fontana ***@***.***> wrote:
@shane-lawrence i agree we want to keep this open, I was coming here to I’m imi un-stale this but you already did :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Guys, any progress/updates? Thanks! |
For everyone looking at this: In other words, once inputs are consumed by the Falco driver they are sent immediately by the engine to processes them. For that reason, we had to implement an artifact called "Token Bucket" that it's basically a rate limiter for the engine and is responsible for the drops. Now, this kind of system is limited because on machines that have lot of activity in kernel space (and fill the ring buffer fast) Falco drops. After discussions in many many office hours and repo planning calls, we decided to redesign the Inputs to achieve two goals:
You can find a Diagram with an explanation here. Long story short: It's likely that this issue will persist until we implement the new input API and then release 1.0 - We don't have an official release date yet but IIRC many community members wanted to set that for March 2020. We are grateful for everyone's help and we are trying to do our best to make Falco more and more reliable for all kinds of workloads. To stay updated and have a voice in the process please join our weekly calls. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Stale issues rot after 30d of inactivity. Mark the issue as fresh with Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue with Mark the issue as fresh with Provide feedback via https://github.com/falcosecurity/community. |
@poiana: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I've abandoned Falco (because never worked for me on gcloud). No idea if this is still an issue. |
I've just deployed Falco today using latest helm chart and still running into this. Is the only option to set the syscall drops to ignore until version 1.0.0 is released? With so many dropped calls, can I trust Falco is even accurately detecting events i am using it to monitor? |
+1 to @bsord 's question. Commenting to follow along on discussion. |
+1, still hitting this issue after 2 years. |
I'm hitting this too. A lot. I see numbers like 10k, 150K and even 458K of "syscall events dropped in the last second". Anyone of these events may be a potential security risk so to me this sounds like a serious issue with Falco. Will there be a fix for this? I see great potential for Falco in our security environment, but I cannot explain this to my customers. Especially not since I can't even give them a percentage of what has been investigated and what has not been investigate by Falco. |
Hey folks, As you can understand, it's hard to give answers without being able to reproduce the problem. Moreover, the syscall events dropping may be caused by different factors. In my personal experience, usually, it happens when Falco hasn't enough resources (CPU) to process the event stream. Sometimes this can happen due to a misconfiguration. I created the #1403 which include a handy checklist for debugging purpose. It seemed to me that the most recent Falco versions fixed the majority of the issues. However, by seeing your comments, likely something is still not working on GKE. I'm happy to help you with that, but I also need your help investigating and understanding what's going on. For instance, for debugging these kinds of issues, we usually need:
Could someone create a full report and share it (privately would be fine too)? |
/reopen |
@leogr: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
UPDATE: Our Falco was doing synchronous calls to an outside alerting system. These calls took about 100+ms which stopped the main event-processing thread of Falco. That was the main cause of the thousands of missed events. I did not assume that the Falco alerting requests were put in the same thread as the event processing. So I'll be adding the falcosidekick and an external mq now, which should fix most of these dropped events. Thanks for your hints! |
Hey @eelkonio you're welcome! I am happy to have been helpful :) I want also to give you more context. Since Falco 0.27.0, the alert processing was offloaded to another thread (see #1451). |
Hi leogr, I used Falco 0.29.1 so that other thread should have been present. However, there may be limits to how much messages per second this thread can handle before it starts stalling? As I said I saw 458K and 600+K of dropped events per second a few times. That may also have been caused by the cpu limits on the container that runs both/all these threads. I will remove that too to see if it makes any difference. Thanks again - love the product and hope to get it working fine soon! |
There's no hard limit. It's just a matter of how many resources Falco can use. Furthermore, the output mechanism it's still sequential. For example, if just one call blocks indefinitely, all subsequent calls will be stalled (Falco will try to emit a warning if an output consumers blocks for more than 2 seconds, see here). In such a situation, Falco cannot operate and starts to drop events. For this reason, using a responsive consumer (like falcosidekick) is still beneficial. |
I am currently running 0.29.1 installed via helm in GKE(1.20.9-gke.1001) and only logging to stdout, I'm seeing 1000's of |
Just trying to guess a possible issue. Is the BPF JIT compiler enabled? |
I have
|
If you are using Falco with the K8s support enabled, could you please try the latest release (Falco 0.30.0)? It comes with several fixes that reduce resource consumption when fetching metadata from the K8s API server, which may indirectly alleviate the event dropping problem. I'm not sure that is the case, but any testing and feedback are very useful for us. |
hey @bsda |
Rotten issues close after 30d of inactivity. Reopen the issue with Mark the issue as fresh with Provide feedback via https://github.com/falcosecurity/community. |
@poiana: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened: Falco is constantly dropping events on a really small GKE cluster (3 nodes) and 9 containers (excluding Google services containers)
What you expected to happen: I would expect that for a cluster of such size that falco would have no issues handling the events for it.
How to reproduce it (as minimally and precisely as possible): Based on my tests, just by deploying a simple cluster on GKE with monitoring and metrics extensions enabled, it is enough to cause drops.
Anything else we need to know?:
Environment:
falco --version
): 0.15.3cat /etc/os-release
): COSuname -a
): 4.14.119+The logs are flooded with syscall event drop, the only other event showing up is the following:
Which I assume is generating enough events to cause the syscall drops, is there anything I can do? Filter out this events/process?
I've checked the other similar issue, and I'm using the KUBERNETES_SERVICE_HOST like described on it.
Log snippet:
The text was updated successfully, but these errors were encountered: