Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve falco benchmarking, performance, and regression tooling to better track system resources impact #2296

Closed
happy-dude opened this issue Nov 22, 2022 · 14 comments
Milestone

Comments

@happy-dude
Copy link
Contributor

happy-dude commented Nov 22, 2022

Motivation

Hey team, while evaluating and understanding the relationship between Falco, system resources, and detection rules, I was wondering if there was a way to better monitor and correlate the impact of Falco config and rule changes. With this information, I can better optimize and tune Falco for our unique envirionment.

The generally falls under the lines of a Falco benchmarking or instrumentation toolchain. For comparison, osquery provides a tool that provides some info on it's queries and configuration.

Additionally, it was discussed in the Slack community that something during CI/CD would be useful as well for regression testing.

Feature

  • Userspace instrumentation/benchmarking tool to correlate impact of config settings and rules on system resources
  • Incorporate CI/CD tooling for rules to better track performance improvements/regressions to code changes
  • Provide recommendations on how to improve problematic rules?
  • Possible documentation improvements, as there are a few blog posts (falco, sysdig, book) that sufficiently go over performance impact and considerations in depth, and fewer in a consumable "general best practices" way.

Additional context

See #2222, libs#531, Slack thread for more info

@jasondellaluce
Copy link
Contributor

Adding a milestone to not lose track of the conversation. Thanks for opening this!

/milestone 0.34.0

@poiana poiana added this to the 0.34.0 milestone Nov 30, 2022
@leogr leogr moved this to 🆕 New in Falco Roadmap Nov 30, 2022
@incertum
Copy link
Contributor

@happy-dude please see some initial progress on adding native support for resource utilization metrics #2333. Would you have additional thoughts on the metrics collected / planned / still missing that would ultimately set the stage for perf benchmarking and regression tests. Thanks a bunch in advance!

@jasondellaluce
Copy link
Contributor

/milestone 0.35.0

@poiana poiana modified the milestones: 0.34.0, 0.35.0 Jan 10, 2023
@incertum
Copy link
Contributor

@happy-dude published a public HackMD proposing a Test Matrix https://hackmd.io/-nwsFyySTEKsjmjGHCyPRg?view using the newly introduced base_syscalls config setting which will be released in Falco 0.35, see also #2433.

Additional note: Creating realistic enough synthetic workloads is notoriously challenging. Benchmarking on actual real-life servers with a lot of activity tends to give more meaningful numbers.

@happy-dude
Copy link
Contributor Author

happy-dude commented Mar 13, 2023

Hey @incertum , thanks for the test matrix!

I've review some of the items in the test matrix and will be running the following:

  • modern-bpf probe
  • following falco_rules.yaml file:
- macro: spawned_process
  condition: (evt.type in (execve, execveat) and evt.dir=< and proc.name=iShouldNeverAlert)

- rule: TEST Simple Spawned Process
  desc: >
    Test base_syscalls config option, ref https://hackmd.io/-nwsFyySTEKsjmjGHCyPRg?view
  enabled: true
  condition: >
    spawned_process
  output: |
    <...output format...>
  priority: WARNING

For the following tests:

  1. Baseline, bare-minimum, only spawned processes
base_syscalls: [clone, clone3, fork, vfork, execve, execveat]
  1. Baseline, spawned processes, but also turning on all process related syscalls Falco uses to keep the smart process cache table in memory up to date
base_syscalls: [chdir, chroot, clone, clone3, fchdir, fork, setgid, setpgid, setresgid, setresuid, setsid, vfork]
  1. Network accept
base_syscalls: [clone, clone3, fork, vfork, execve, execveat, getsockopt, socket, bind, accept, accept4, close]
  1. Network connect
base_syscalls: [clone, clone3, fork, vfork, execve, execveat, getsockopt, socket, connect, close]
  1. File opens (tends to be highest volume of syscalls, can then also run a test with all syscalls related to otehr file operations other than just opening such as symlinking etc)
base_syscalls: [clone, clone3, fork, vfork, execve, execveat, open, openat, openat2, close]

Is there an expected results or output format you would like to see the evaluation delivered as?

edit: added close syscalls to base_syscall sets.

@happy-dude
Copy link
Contributor Author

happy-dude commented Mar 13, 2023

I had to revert my changes to the rules file because it started logging a lot and ballooned the size of the events logfile relatively quickly 😅

EDIT: adjusted my alert rule into something that should never alert:

- macro: spawned_process
  condition: (evt.type in (execve, execveat) and evt.dir=< and proc.name=iShouldNeverAlert)

@incertum
Copy link
Contributor

incertum commented Mar 14, 2023

Updated HackMD suggesting to still add a simple filter to the test rule, also forgot to add close to conditions 3-5, it's now also updated.

@incertum
Copy link
Contributor

Is there an expected results or output format you would like to see the evaluation delivered as?

Thoughts:

  • These system call categories are broad, however very relevant for basic detections plus while workloads are very variable, spawned processes are generically lower in volume than network related system calls etc -> it is good to confirm that Falco never drops a single event for spawned processes when only monitoring spawned processes even when Falco is run on a very busy server with 96 CPUs
  • Ranking in terms of event volume on many servers often follows a pattern: spawned processes, network connect, network accept, ... [large gap] file opens (listing is in order from lowest to highest volume of syscall events) -> Therefore these tests can help seeing if/when problems in terms of kernel side drops start to occur
  • After these simple tests can check more custom sets of syscalls depending on what you aim to monitor. I/O related syscalls or even just the ones more around memory like mmap definitely will be much higher volume than file opens

Lastly we are working on exposing syscall counters as part of Falco's new native resource utilization metrics support (planned for Falco 0.35) -> once we have these counters, we can derive even better conclusions

@incertum
Copy link
Contributor

... you would like to see the evaluation delivered as?

No need to report any numbers back. Hoping these tests help you understand the unique workload footprint on your servers better. In general, longer term we need to try to perhaps push some more filters kernel side for the super high volume system calls ...

@incertum
Copy link
Contributor

Note base_syscalls option is now in it's final form for Falco 0.35 release, see

falco/falco.yaml

Lines 544 to 546 in dad382e

base_syscalls:
repair: false
custom_set: []

base_syscalls:
  repair: false
  custom_set: []

@FedeDP
Copy link
Contributor

FedeDP commented May 29, 2023

/milestone 0.36.0

@poiana poiana modified the milestones: 0.35.0, 0.36.0 May 29, 2023
@poiana
Copy link
Contributor

poiana commented Aug 27, 2023

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@jasondellaluce
Copy link
Contributor

/remove-lifecycle stale

@incertum
Copy link
Contributor

Closing this in favor of #2435.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants