Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new(falco): introduce new metrics w/ Falco internal: metrics snapshot option and new metrics config #2333

Merged
merged 16 commits into from
May 23, 2023

Conversation

incertum
Copy link
Contributor

@incertum incertum commented Dec 19, 2022

Signed-off-by: Melissa Kilby melissa.kilby.oss@gmail.com

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind feature

/kind release

If contributing rules or changes to rules, please make sure to also uncomment one of the following line:

/kind rule-update

/kind rule-create

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area build

/area engine

/area rules

/area tests

/area proposals

/area CI

What this PR does / why we need it:

Initial solution to Falco native resource utilization metrics logs support.

Consider these key points about the new metrics feature in Falco:

  • It introduces a redesigned stats/metrics system.
  • Native support for resource utilization metrics and specialized performance metrics.
  • Metrics are emitted as monotonic counters at predefined intervals (snapshots).
  • All metrics are consolidated into a single log message, adhering to the established rules schema and naming conventions.
  • Additional info fields complement the metrics and facilitate customized statistical analyses and correlations.
  • The metrics framework is designed for easy future extension.

Additional highlights:

  • Includes old / established stats / counters in one place
  • For bpf drivers we now support libbpf stats
  • More extensions (syscall counters and prometheus exporter option) are planned

Which issue(s) this PR fixes:

#2222

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

new(falco): introduce new metrics w/ Falco internal: metrics snapshot option and new metrics config

@incertum
Copy link
Contributor Author

Continuing to stage here, but the expectation would be to move CPU and memory resource utilization to libsinsp and expose via proper APIs, besides it's still unclear how the concept of a unified stats module will look like in Falco.

Meanwhile, wondering about memory, to say the least it is a bit confusing how different tools compute it and knowing not just what it means, but also what implications for tool optimizations are etc. Perhaps worth a discussion to decide what we think is meaningful for Falco optimizations, especially the Kubernetes use case?

Added reading /sys/fs/cgroup/memory/memory.usage_in_bytes, but there is more potential in /sys/fs/cgroup/memory/memory.stat and have been reading through a couple of resources and I am still unclear how container_memory_working_set_bytes is calculated, also cAdvisor source fetches info differently, not from fs ... @gnosek just curious if you had more thoughts as well. I think it would be important to optimize Falco for setting the right limits, so end users don't run into unexpected OOM kills, also something I have been chatting with @Andreagit97 . It would be wonderful having all these metrics in one Falco log and not needing to stitch together numbers from 2-3 external sources.

@happy-dude
Copy link
Contributor

happy-dude commented Dec 20, 2022

Would this be the right place to request metrics/counters on syscalls inspected?

A scenario that played out in my organization is that a endpoint-detection tool was experiencing abnormally high CPU usage, and emitted stats of syscalls it was inspecting.

One particular syscall was encountered much more than the rest, and we did some further digging (via other tooling and auditd) and discovered that the latest version of a service daemon was experiencing errors and needed to be patched.

While Falco's high CPU usage was not the source of the issue, being able to get an aggregate summary of what falco is doing during it's uptime is helpful. This is slightly different from getting a direct read of events Falco is triggering on (which I believe is already exported as a metric), because an event does not have to be a true positive to affect Falco's overall performance.

@incertum
Copy link
Contributor Author

👋 @happy-dude +1 here, would like an option of making the logs even more verbose for instance if CPU exceeds a threshold as well. Adding it to the linked issue.

Existing stats for kernel side event_drops had been extended to be more verbose, they now expose some coarse-grained syscall categories counters. However, we should have more options in userspace. Would similar bucketing (but more buckets) of syscalls work?

Those stats can also help making informed decisions on price tag associated with each evt type and maybe we discover more effective ways of discarding uninteresting events within an evt type earlier as well, but first we need to find a good way to only select the syscalls that are truly needed for the state engine based on Falco rules, see RIP POC#1, thinking about a better way now.

Meanwhile, the scap-open util binary has become really good for testing stuff and measuring perf under different configurations as the kernel tracing accounts for most CPU.

@gnosek
Copy link
Contributor

gnosek commented Dec 20, 2022

@incertum, ages ago, the most relevant memory metric for me turned out to be PSS (not RSS), which accounts for page sharing between processes (e.g. a 2MB mapping shared between two processes is accounted as 1MB to each one), but I don't think there's one metric that fits all cases (e.g. PSS fails when you want to determine how much memory you'll get back by killing that process, which I suppose the OOM killer is interested in).

TL;DR, every choice we make will be wrong for somebody.

@happy-dude, yeah, some more detailed stats would be nice. Still, to track down the issue you were having, you'd need per (tid-or-pid, syscall) stats and I'm afraid that's overkill in the general case (there's a pretty big chance you'd need to (re)allocate memory for every event while the stats table grows).

I'm ->this<- close to suggesting that we use chisels for this (so that we can run arbitrary lua code on each event) but that will only make the perf issues worse, if any. Or we take the memory hit and we shove the stats in sinsp_threadinfo, but I'd rather make this optional.

BTW, sysdig (as opposed to falco) would be a better tool IMO to track down performance issues. At the very least, it supports chisels out of the box ;) (there's e.g. topscalls which does roughly what you want)

@happy-dude
Copy link
Contributor

Existing stats for kernel side event_drops had been extended to be more verbose, they now expose some coarse-grained syscall categories counters. However, we should have more options in userspace. Would similar bucketing (but more buckets) of syscalls work?

This is great! Agree that dropped events gives an overview of the cost for monitoring the syscall/ currently loaded rules.

At the cost of overhead, I'd like a verbose/debug flag that exposes per-syscall being inspected by Falco.
I agree with @gnosek that this is likely excessive and is better investigated with other tooling.

Meanwhile, the scap-open util binary has become really good for testing stuff and measuring perf under different configurations as the kernel tracing accounts for most CPU.

I'm catching up on the discussion with scap-open, and it seems pretty useful. Could be something my org can incorporate scap-open as part of our linting workflow as a proactive measure when crafting new rules 🤔

@incertum
Copy link
Contributor Author

incertum commented Dec 20, 2022

@happy-dude re scap-open not quite yet given atm a static set of syscalls is activated regardless of Falco rules. As @FedeDP confirmed such a capability to only activate the syscalls needed based on Falco rules MUST land for 0.35.

Started a HackMD document to further discuss such counters per event type categories. Would you be able to help creating the right buckets? As I will open a libs PR with proper APIs could try adding them as well. Ty!

@gnosek yup yup, will add PSS and try to find out more about container_memory_working_set as well ..

@leogr
Copy link
Member

leogr commented Jan 10, 2023

/milestone 0.35.0

@poiana
Copy link
Contributor

poiana commented Apr 11, 2023

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@incertum
Copy link
Contributor Author

/remove-lifecycle stale

@incertum incertum force-pushed the falco-resource-utilization-init branch from f4059d0 to 1be3286 Compare April 26, 2023 14:11
@poiana poiana added size/L and removed size/XL labels Apr 26, 2023
incertum and others added 13 commits May 22, 2023 13:30
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
…s to MB

Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
* renaming to `metrics` for technical clarity
* adopt Prometheus like metrics interval settings

Co-authored-by: Jason Dellaluce <jasondellaluce@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Jason Dellaluce <jasondellaluce@gmail.com>
Signed-off-by: Jason Dellaluce <jasondellaluce@gmail.com>
* prefix counters and stats belonging to kernel space w/ `k.` else `u.` for userspace
* add n_drops_perc from old stats writer schema
* revert one change: file output shall reflect exact same "output_fields" key as rule output, note that src is already part of the "output_fields" schema.

Co-authored-by: Jason Dellaluce <jasondellaluce@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Co-authored-by: Stanley Chan <pocketgamer5000@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
…rics interval

Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
…s falco. and scap.

* Ensure each metric field name more consistently adheres to the grammar used in Falco rules:
  * `falco.`: new field class representing userspace counters, statistics, resource utilization, or necessary information fields
  * `scap.`: new field class represents counters and statistics mostly obtained from Falco's kernel instrumentation before events are sent to userspace, but can include scap userspace stats as well
* minor cleanup

Co-authored-by: Jason Dellaluce <jasondellaluce@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
…rity

Co-authored-by: Jason Dellaluce <jasondellaluce@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
@incertum incertum force-pushed the falco-resource-utilization-init branch from 0cf6784 to 82d56f9 Compare May 22, 2023 13:33
@incertum
Copy link
Contributor Author

@jasondellaluce feedback incorporated except:

We should not format every value as string: json supports numbers, so I think we should format numbers as numbers, and strings as strings. This could probably be attained easily by passing down nlohman json array instead of a map<string,string> to handle_msg() and to the output queue control message.

This would require changing formatting also for Falco rules. Right now we use the exact same formatting and I would keep it that way in this PR and rather uniformly refactor handle_msg in a follow PR if we all determine it won't break anything or cause any other unexpected side effects. For the most part casting from string in your ETL is no problem at all and in fact many Falco rules fields are numeric today.

@leogr and @jasondellaluce and @happy-dude polished the config description a lot, please take another look.

@happy-dude
Copy link
Contributor

polished the config description a lot, please take another look.

Read through the falco.yaml doc changes and it is clear what the config entails; thanks for the improvements 🥳 !

Copy link
Contributor

@jasondellaluce jasondellaluce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented May 22, 2023

LGTM label has been added.

Git tree hash: 16e6a97d509759fa56febc3d4c014b315be3d981

@jasondellaluce
Copy link
Contributor

I'll take care of right number formatting in a follow-up PR.

{
msg.delta = msg.stats;
char metric_name[STATS_NAME_MAX] = "falco.";
strncat(metric_name, utilization[stat].name, sizeof(metric_name) - strlen(metric_name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If src contains n or more bytes, strncat() writes n+1 bytes to
       dest (n from src plus the terminating null byte).  Therefore, the
       size of dest must be at least strlen(dest)+n+1.

We should do sizeof(metric_name) - strlen(metric_name) - 1 here to avoid missing the terminator char.
We can do it in a subsequent PR but please do that :) @jasondellaluce

Copy link
Contributor

@FedeDP FedeDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, strncat note aside.
We can fix it in a subsequent PR!

Copy link
Contributor

@FedeDP FedeDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented May 23, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FedeDP, incertum, jasondellaluce

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [FedeDP,jasondellaluce]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@incertum
Copy link
Contributor Author

LGTM, strncat note aside. We can fix it in a subsequent PR!

Same comment here as in another PR:

@FedeDP please note in libs we do it the way it currently is e.g. kmod_path = strncat(cwd, KMOD_DEFAULT_PATH, FILENAME_MAX - strlen(cwd)) hence we would need a commit in libs to uniformly change it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants