Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falco on GKE - dropped syscall events #669

Closed
caquino opened this issue Jun 13, 2019 · 93 comments
Closed

Falco on GKE - dropped syscall events #669

caquino opened this issue Jun 13, 2019 · 93 comments

Comments

@caquino
Copy link

caquino commented Jun 13, 2019

What happened: Falco is constantly dropping events on a really small GKE cluster (3 nodes) and 9 containers (excluding Google services containers)

What you expected to happen: I would expect that for a cluster of such size that falco would have no issues handling the events for it.

How to reproduce it (as minimally and precisely as possible): Based on my tests, just by deploying a simple cluster on GKE with monitoring and metrics extensions enabled, it is enough to cause drops.

Anything else we need to know?:

Environment:

  • Falco version (use falco --version): 0.15.3
  • System info
{
  "machine": "x86_64",
  "nodename": "gke-ai-cloud-us-central1-default-pool-b3942f02-7mjv",
  "release": "4.14.119+",
  "sysname": "Linux",
  "version": "#1 SMP Wed May 15 17:44:01 PDT 2019"
}
  • Cloud provider or hardware configuration: Google Cloud
  • OS (e.g: cat /etc/os-release): COS
  • Kernel (e.g. uname -a): 4.14.119+
  • Install tools (e.g. in kubernetes, rpm, deb, from source): kubernetes
  • Others:
    The logs are flooded with syscall event drop, the only other event showing up is the following:
12:50:28.988592573: Warning Log files were tampered (user=root command=pos_writer.rb:* -Eascii-8bit:ascii-8bit /usr/sbin/google-fluentd --under-supervisor file=/var/log/gcp-journald-kubelet.pos) k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6xbdr container=e84a9479a965 k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6xbdr container=e84a9479a965

Which I assume is generating enough events to cause the syscall drops, is there anything I can do? Filter out this events/process?

I've checked the other similar issue, and I'm using the KUBERNETES_SERVICE_HOST like described on it.

Log snippet:

13:03:18.688024362: Warning Log files were tampered (user=root command=pos_writer.rb:* -Eascii-8bit:ascii-8bit /usr/sbin/google-fluentd --under-supervisor file=/var/log/gcp-journald-kubelet.pos) k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6fmpm container=9b27b6fc8460 k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6fmpm container=9b27b6fc8460
13:03:34.485317730: Critical Falco internal: syscall event drop. 11 system calls dropped in last second.(ebpf_enabled=1 n_drops=11 n_drops_buffer=11 n_drops_bug=0 n_drops_pf=0 n_evts=92933)
13:04:20.686678200: Critical Falco internal: syscall event drop. 4 system calls dropped in last second.(ebpf_enabled=1 n_drops=4 n_drops_buffer=4 n_drops_bug=0 n_drops_pf=0 n_evts=8090)
13:04:24.654208479: Critical Falco internal: syscall event drop. 4 system calls dropped in last second.(ebpf_enabled=1 n_drops=4 n_drops_buffer=4 n_drops_bug=0 n_drops_pf=0 n_evts=11527)
@fntlnz
Copy link
Contributor

fntlnz commented Jun 13, 2019

Thanks for opening @caquino - need some time to look at this because I need to get to a gcp account to do tests.

@fntlnz
Copy link
Contributor

fntlnz commented Jun 13, 2019

/assign @fntlnz
/assign @leodido

@michiels
Copy link

FWIW, getting the same on a 3-server (1cpu2Gbram) cluster at DigitalOcean Kubernetes. There are a bunch of additional things already deployed:

  • Traefik ingress controller
  • A rails app with a web and background process
  • Logspout logging pipeline.

Do these errors usually mean the cluster's capacity is reached?

@caquino
Copy link
Author

caquino commented Jun 14, 2019

On my case the cluster has 3 n1-standard-2 (2 vCPUs, 7.5 GB memory) nodes, other than what comes pre-configured on the cluster from Google is running only a Deployment with nginx and php.

This is not a production cluster, so it has no traffic on it yet, so I would expect to not be a capacity issue.

@nuala33
Copy link

nuala33 commented Jun 21, 2019

I am also having the same issue ("syscall event drop" logs several times an hour) in my almost completely unused GKE cluster with 3 nodes.

Re: the fluentd log tampering alerts, I created #684 to address it, and I describe the fix that worked to stop those from flowing in.

@kbrown
Copy link

kbrown commented Jul 10, 2019

Continuously seeing this after installing k8s-with-rbac/falco-daemonset-configmap.yaml in a 3 node gke environment:

Falco internal: syscall event drop. 7 system calls dropped in last second.
Falco internal: syscall event drop. 15 system calls dropped in last second.

Occurs in:
falcosecurity/falco:dev
falcosecurity/falco:latest

Not yet checked:
falcosecurity/falco:15.0.x (container won't start with config from dev branch)

@leodido
Copy link
Member

leodido commented Jul 11, 2019

Just to be sure: what syscall_event_drops.rate and syscall_event_drops.max_burst config are you folks using?

The default one?

@metalsong
Copy link

@leodido I've just install falco to the GKE using:

helm install --name falco stable/falco --set ebpf.enabled=true

Having exactly the same errors.

@kbrown
Copy link

kbrown commented Jul 11, 2019

@leodido

  1. I have used the defaults,
  2. then tried the increased limits:

syscall_event_drops:
actions:
- log
- alert
rate: 1
max_burst: 1000

Same result. Maybe I should have decreased?

Using falco-daemonset-configmap.yaml with the following changes:

      env:
      - name: SYSDIG_BPF_PROBE
        value: ""
      - name: KBUILD_EXTRA_CPPFLAGS
        value: -DCOS_73_WORKAROUND

@caquino
Copy link
Author

caquino commented Jul 11, 2019

Same here, I'm using the default configuration.

@DannyPat44
Copy link

Has anyone found a workaround for this issue?

@qbast
Copy link

qbast commented Aug 5, 2019

I see the same behaviour with Falco 0.17 on five node GKE cluster. CPU load on the nodes is between 18% and 35%. Falco is using 1-5% .

@Aaron-ML
Copy link

Seeing this in Azure with falco using abysmal amounts of CPU and using the set defaults

@bgeesaman
Copy link
Contributor

bgeesaman commented Aug 13, 2019

I'm also seeing 1 n_drops per node, every 60 to 61 mins on my COS 4.14.127+ v1.12.8-gke.10 GKE cluster (8CPUs/30+GB RAM nodes). Running falcosecurity/falco:latest.

@arthurk
Copy link

arthurk commented Sep 9, 2019

I have the same issue. As a workaround i have just set the action to "ignore" until this feature works with GKE

@cvernooy23
Copy link

I'm seeing this when installed as a binary on the host OS in AWS EC2 running CentOS 7 w/ kernel 5.3 mainline.

@stale
Copy link

stale bot commented Nov 25, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 25, 2019
@shane-lawrence
Copy link
Contributor

Could we keep this open until it has been resolved?

@stale stale bot removed the wontfix label Nov 25, 2019
@fntlnz
Copy link
Contributor

fntlnz commented Nov 25, 2019

@shane-lawrence i agree we want to keep this open, I was coming here to un-stale this but you already did :)

@kbrown
Copy link

kbrown commented Nov 26, 2019 via email

@popaaaandrei
Copy link

Guys, any progress/updates? Thanks!

@fntlnz
Copy link
Contributor

fntlnz commented Jan 4, 2020

For everyone looking at this:
The reason why Falco drops is that right now it does not have a way to "offload" events coming from kernel space while receiving them in userspace.

In other words, once inputs are consumed by the Falco driver they are sent immediately by the engine to processes them.

For that reason, we had to implement an artifact called "Token Bucket" that it's basically a rate limiter for the engine and is responsible for the drops.

Now, this kind of system is limited because on machines that have lot of activity in kernel space (and fill the ring buffer fast) Falco drops.

After discussions in many many office hours and repo planning calls, we decided to redesign the Inputs to achieve two goals:

  • Have an input streaming interface (that offloads messages to a queue)
  • Implement inputs as a gRPC client - means that the inputs are not part of the Falco engine itself but a separate service

You can find a Diagram with an explanation here.

Long story short: It's likely that this issue will persist until we implement the new input API and then release 1.0 - We don't have an official release date yet but IIRC many community members wanted to set that for March 2020.

We are grateful for everyone's help and we are trying to do our best to make Falco more and more reliable for all kinds of workloads. To stay updated and have a voice in the process please join our weekly calls.

@stale
Copy link

stale bot commented Mar 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@poiana
Copy link
Contributor

poiana commented Jan 13, 2021

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

@poiana
Copy link
Contributor

poiana commented Feb 13, 2021

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

@poiana
Copy link
Contributor

poiana commented Feb 13, 2021

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@poiana poiana closed this as completed Feb 13, 2021
@kbrown
Copy link

kbrown commented Feb 14, 2021

I've abandoned Falco (because never worked for me on gcloud). No idea if this is still an issue.

@bsord
Copy link

bsord commented Aug 1, 2021

I've just deployed Falco today using latest helm chart and still running into this. Is the only option to set the syscall drops to ignore until version 1.0.0 is released? With so many dropped calls, can I trust Falco is even accurately detecting events i am using it to monitor?

@abroglesc
Copy link

+1 to @bsord 's question. Commenting to follow along on discussion.

@bsda
Copy link

bsda commented Sep 7, 2021

+1, still hitting this issue after 2 years.

@eelkonio
Copy link

I'm hitting this too. A lot.

I see numbers like 10k, 150K and even 458K of "syscall events dropped in the last second". Anyone of these events may be a potential security risk so to me this sounds like a serious issue with Falco.

Will there be a fix for this? I see great potential for Falco in our security environment, but I cannot explain this to my customers. Especially not since I can't even give them a percentage of what has been investigated and what has not been investigate by Falco.

@leogr
Copy link
Member

leogr commented Sep 17, 2021

Hey folks,

As you can understand, it's hard to give answers without being able to reproduce the problem. Moreover, the syscall events dropping may be caused by different factors. In my personal experience, usually, it happens when Falco hasn't enough resources (CPU) to process the event stream. Sometimes this can happen due to a misconfiguration. I created the #1403 which include a handy checklist for debugging purpose.

It seemed to me that the most recent Falco versions fixed the majority of the issues. However, by seeing your comments, likely something is still not working on GKE. I'm happy to help you with that, but I also need your help investigating and understanding what's going on.

For instance, for debugging these kinds of issues, we usually need:

  • confirmation that after trying each item in the [UMBRELLA] Dropped events #1403's checklist, the problem persists
  • all version numbers (including GKE ones)
  • machine info
  • Falco's configuration are you using
  • manifests you used to deploy Falco
  • a few logs reporting "syscall events dropped" notice
  • any performance metrics you can provide

Could someone create a full report and share it (privately would be fine too)?

@leogr
Copy link
Member

leogr commented Sep 17, 2021

/reopen

@poiana
Copy link
Contributor

poiana commented Sep 17, 2021

@leogr: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@poiana poiana reopened this Sep 17, 2021
@eelkonio
Copy link

UPDATE:
I found the problem via the links you provided, leogr.

Our Falco was doing synchronous calls to an outside alerting system. These calls took about 100+ms which stopped the main event-processing thread of Falco. That was the main cause of the thousands of missed events. I did not assume that the Falco alerting requests were put in the same thread as the event processing. So I'll be adding the falcosidekick and an external mq now, which should fix most of these dropped events.

Thanks for your hints!

@leogr
Copy link
Member

leogr commented Sep 23, 2021

Hey @eelkonio

you're welcome! I am happy to have been helpful :)
Btw, your solution is perfectly fine, I can confirm (similar situations happened to me in the past, and I solved them in a similar way).

I want also to give you more context. Since Falco 0.27.0, the alert processing was offloaded to another thread (see #1451).
AFAIK, it mitigates most of those kinds of issues, though it might not be the definitive solution for all circumstances (especially, it cannot solve situations where the calls take a very long time or stay pending indefinitely).
So just curious to know which Falco version are you using?
Anyway, adding falcosidekick will help for sure. Likely, it's the best solution for your case.

@eelkonio
Copy link

Hi leogr,

I used Falco 0.29.1 so that other thread should have been present. However, there may be limits to how much messages per second this thread can handle before it starts stalling?

As I said I saw 458K and 600+K of dropped events per second a few times. That may also have been caused by the cpu limits on the container that runs both/all these threads. I will remove that too to see if it makes any difference.

Thanks again - love the product and hope to get it working fine soon!

@leogr
Copy link
Member

leogr commented Sep 23, 2021

I used Falco 0.29.1 so that other thread should have been present. However, there may be limits to how much messages per second this thread can handle before it starts stalling?

There's no hard limit. It's just a matter of how many resources Falco can use. Furthermore, the output mechanism it's still sequential. For example, if just one call blocks indefinitely, all subsequent calls will be stalled (Falco will try to emit a warning if an output consumers blocks for more than 2 seconds, see here). In such a situation, Falco cannot operate and starts to drop events.

For this reason, using a responsive consumer (like falcosidekick) is still beneficial.

@bsda
Copy link

bsda commented Sep 23, 2021

I am currently running 0.29.1 installed via helm in GKE(1.20.9-gke.1001) and only logging to stdout, I'm seeing 1000's of n_drops_buffer and n_drops_bug drops. I've tried many (if not all) of the suggestions from #1403 and can't seem to get rid of these. I'll try to get a full report when I get some time, below is an idea of the numbers I am seeing from a single node.

image

@leogr
Copy link
Member

leogr commented Sep 23, 2021

Just trying to guess a possible issue. Is the BPF JIT compiler enabled?
(ie. the kernel has CONFIG_BPF_JIT enabled and net.core.bpf_jit_enable is set to 1)

@bsda
Copy link

bsda commented Sep 23, 2021

I have CONFIG_BPF_JIT=y and

net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
net.core.bpf_jit_kallsyms = 1
net.core.bpf_jit_limit = 264241152

@leogr
Copy link
Member

leogr commented Oct 6, 2021

If you are using Falco with the K8s support enabled, could you please try the latest release (Falco 0.30.0)?

It comes with several fixes that reduce resource consumption when fetching metadata from the K8s API server, which may indirectly alleviate the event dropping problem. I'm not sure that is the case, but any testing and feedback are very useful for us.
Thank you in advance :)

@leogr
Copy link
Member

leogr commented Oct 14, 2021

I am currently running 0.29.1 installed via helm in GKE(1.20.9-gke.1001) and only logging to stdout, I'm seeing 1000's of n_drops_buffer and n_drops_bug drops. I've tried many (if not all) of the suggestions from #1403 and can't seem to get rid of these. I'll try to get a full report when I get some time, below is an idea of the numbers I am seeing from a single node.

image

hey @bsda
@FedeDP and I have recently tried Falco 0.30.0 on a testing GKE cluster, but we weren't able to reproduce any n_drops_buffer but I noticed some n_drops_bug during the start-up phase. We tried Falco deployed via helm and a couple of pods with stress-ng on a 2 nodes cluster (COS).
Let me know if you have any chance to share a report. Thanks.

@fntlnz fntlnz removed their assignment Nov 10, 2021
@poiana
Copy link
Contributor

poiana commented Dec 10, 2021

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

@poiana poiana closed this as completed Dec 10, 2021
@poiana
Copy link
Contributor

poiana commented Dec 10, 2021

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests