Improve falco benchmarking, performance, and regression tooling to better track system resources impact #2296

happy-dude · 2022-11-22T20:31:21Z

Motivation

Hey team, while evaluating and understanding the relationship between Falco, system resources, and detection rules, I was wondering if there was a way to better monitor and correlate the impact of Falco config and rule changes. With this information, I can better optimize and tune Falco for our unique envirionment.

The generally falls under the lines of a Falco benchmarking or instrumentation toolchain. For comparison, osquery provides a tool that provides some info on it's queries and configuration.

Additionally, it was discussed in the Slack community that something during CI/CD would be useful as well for regression testing.

Feature

Userspace instrumentation/benchmarking tool to correlate impact of config settings and rules on system resources
Incorporate CI/CD tooling for rules to better track performance improvements/regressions to code changes
Provide recommendations on how to improve problematic rules?
Possible documentation improvements, as there are a few blog posts (falco, sysdig, book) that sufficiently go over performance impact and considerations in depth, and fewer in a consumable "general best practices" way.

Additional context

See #2222, libs#531, Slack thread for more info

jasondellaluce · 2022-11-30T17:45:34Z

Adding a milestone to not lose track of the conversation. Thanks for opening this!

/milestone 0.34.0

incertum · 2022-12-20T00:25:55Z

@happy-dude please see some initial progress on adding native support for resource utilization metrics #2333. Would you have additional thoughts on the metrics collected / planned / still missing that would ultimately set the stage for perf benchmarking and regression tests. Thanks a bunch in advance!

jasondellaluce · 2023-01-10T17:36:08Z

/milestone 0.35.0

incertum · 2023-03-11T05:02:30Z

@happy-dude published a public HackMD proposing a Test Matrix https://hackmd.io/-nwsFyySTEKsjmjGHCyPRg?view using the newly introduced base_syscalls config setting which will be released in Falco 0.35, see also #2433.

Additional note: Creating realistic enough synthetic workloads is notoriously challenging. Benchmarking on actual real-life servers with a lot of activity tends to give more meaningful numbers.

happy-dude · 2023-03-13T22:35:07Z

Hey @incertum , thanks for the test matrix!

I've review some of the items in the test matrix and will be running the following:

modern-bpf probe
following falco_rules.yaml file:

- macro: spawned_process
  condition: (evt.type in (execve, execveat) and evt.dir=< and proc.name=iShouldNeverAlert)

- rule: TEST Simple Spawned Process
  desc: >
    Test base_syscalls config option, ref https://hackmd.io/-nwsFyySTEKsjmjGHCyPRg?view
  enabled: true
  condition: >
    spawned_process
  output: |
    <...output format...>
  priority: WARNING

For the following tests:

Baseline, bare-minimum, only spawned processes

base_syscalls: [clone, clone3, fork, vfork, execve, execveat]

Baseline, spawned processes, but also turning on all process related syscalls Falco uses to keep the smart process cache table in memory up to date

base_syscalls: [chdir, chroot, clone, clone3, fchdir, fork, setgid, setpgid, setresgid, setresuid, setsid, vfork]

Network accept

base_syscalls: [clone, clone3, fork, vfork, execve, execveat, getsockopt, socket, bind, accept, accept4, close]

Network connect

base_syscalls: [clone, clone3, fork, vfork, execve, execveat, getsockopt, socket, connect, close]

File opens (tends to be highest volume of syscalls, can then also run a test with all syscalls related to otehr file operations other than just opening such as symlinking etc)

base_syscalls: [clone, clone3, fork, vfork, execve, execveat, open, openat, openat2, close]

Is there an expected results or output format you would like to see the evaluation delivered as?

edit: added close syscalls to base_syscall sets.

happy-dude · 2023-03-13T22:57:19Z

I had to revert my changes to the rules file because it started logging a lot and ballooned the size of the events logfile relatively quickly 😅

EDIT: adjusted my alert rule into something that should never alert:

- macro: spawned_process
  condition: (evt.type in (execve, execveat) and evt.dir=< and proc.name=iShouldNeverAlert)

incertum · 2023-03-14T22:18:02Z

Updated HackMD suggesting to still add a simple filter to the test rule, also forgot to add close to conditions 3-5, it's now also updated.

incertum · 2023-03-14T22:42:17Z

Is there an expected results or output format you would like to see the evaluation delivered as?

Thoughts:

These system call categories are broad, however very relevant for basic detections plus while workloads are very variable, spawned processes are generically lower in volume than network related system calls etc -> it is good to confirm that Falco never drops a single event for spawned processes when only monitoring spawned processes even when Falco is run on a very busy server with 96 CPUs
Ranking in terms of event volume on many servers often follows a pattern: spawned processes, network connect, network accept, ... [large gap] file opens (listing is in order from lowest to highest volume of syscall events) -> Therefore these tests can help seeing if/when problems in terms of kernel side drops start to occur
After these simple tests can check more custom sets of syscalls depending on what you aim to monitor. I/O related syscalls or even just the ones more around memory like mmap definitely will be much higher volume than file opens

Lastly we are working on exposing syscall counters as part of Falco's new native resource utilization metrics support (planned for Falco 0.35) -> once we have these counters, we can derive even better conclusions

incertum · 2023-03-14T22:46:25Z

... you would like to see the evaluation delivered as?

No need to report any numbers back. Hoping these tests help you understand the unique workload footprint on your servers better. In general, longer term we need to try to perhaps push some more filters kernel side for the super high volume system calls ...

incertum · 2023-04-25T17:37:03Z

Note base_syscalls option is now in it's final form for Falco 0.35 release, see

falco/falco.yaml

Lines 544 to 546 in dad382e

    
           base_syscalls: 
        
             repair: false 
        
             custom_set: []

base_syscalls:
  repair: false
  custom_set: []

FedeDP · 2023-05-29T09:19:31Z

/milestone 0.36.0

poiana · 2023-08-27T13:33:15Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

jasondellaluce · 2023-08-28T08:08:52Z

/remove-lifecycle stale

incertum · 2023-08-28T15:11:19Z

Closing this in favor of #2435.

happy-dude added the kind/feature label Nov 22, 2022

poiana added this to the 0.34.0 milestone Nov 30, 2022

leogr moved this to 🆕 New in Falco Roadmap Nov 30, 2022

leogr added this to Falco Roadmap Nov 30, 2022

poiana modified the milestones: 0.34.0, 0.35.0 Jan 10, 2023

poiana modified the milestones: 0.35.0, 0.36.0 May 29, 2023

LucaGuerra removed this from Falco Roadmap Jul 7, 2023

poiana added the lifecycle/stale label Aug 27, 2023

poiana removed the lifecycle/stale label Aug 28, 2023

incertum closed this as completed Aug 28, 2023

incertum mentioned this issue Dec 7, 2023

Introduce conditional kernel-side event filtering falcosecurity/libs#1557

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve falco benchmarking, performance, and regression tooling to better track system resources impact #2296

Improve falco benchmarking, performance, and regression tooling to better track system resources impact #2296

happy-dude commented Nov 22, 2022 •

edited

Loading

jasondellaluce commented Nov 30, 2022

incertum commented Dec 20, 2022

jasondellaluce commented Jan 10, 2023

incertum commented Mar 11, 2023

happy-dude commented Mar 13, 2023 •

edited

Loading

happy-dude commented Mar 13, 2023 •

edited

Loading

incertum commented Mar 14, 2023 •

edited

Loading

incertum commented Mar 14, 2023

incertum commented Mar 14, 2023

incertum commented Apr 25, 2023

FedeDP commented May 29, 2023

poiana commented Aug 27, 2023

jasondellaluce commented Aug 28, 2023

incertum commented Aug 28, 2023

Improve falco benchmarking, performance, and regression tooling to better track system resources impact #2296

Improve falco benchmarking, performance, and regression tooling to better track system resources impact #2296

Comments

happy-dude commented Nov 22, 2022 • edited Loading

jasondellaluce commented Nov 30, 2022

incertum commented Dec 20, 2022

jasondellaluce commented Jan 10, 2023

incertum commented Mar 11, 2023

happy-dude commented Mar 13, 2023 • edited Loading

happy-dude commented Mar 13, 2023 • edited Loading

incertum commented Mar 14, 2023 • edited Loading

incertum commented Mar 14, 2023

incertum commented Mar 14, 2023

incertum commented Apr 25, 2023

FedeDP commented May 29, 2023

poiana commented Aug 27, 2023

jasondellaluce commented Aug 28, 2023

incertum commented Aug 28, 2023

happy-dude commented Nov 22, 2022 •

edited

Loading

happy-dude commented Mar 13, 2023 •

edited

Loading

happy-dude commented Mar 13, 2023 •

edited

Loading

incertum commented Mar 14, 2023 •

edited

Loading