-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new(falco): introduce new metrics w/ Falco internal: metrics snapshot
option and new metrics
config
#2333
new(falco): introduce new metrics w/ Falco internal: metrics snapshot
option and new metrics
config
#2333
Conversation
Continuing to stage here, but the expectation would be to move CPU and memory resource utilization to Meanwhile, wondering about Added reading |
82abdba
to
f4059d0
Compare
Would this be the right place to request metrics/counters on syscalls inspected? A scenario that played out in my organization is that a endpoint-detection tool was experiencing abnormally high CPU usage, and emitted stats of syscalls it was inspecting. One particular syscall was encountered much more than the rest, and we did some further digging (via other tooling and auditd) and discovered that the latest version of a service daemon was experiencing errors and needed to be patched. While Falco's high CPU usage was not the source of the issue, being able to get an aggregate summary of what falco is doing during it's uptime is helpful. This is slightly different from getting a direct read of events Falco is triggering on (which I believe is already exported as a metric), because an event does not have to be a true positive to affect Falco's overall performance. |
👋 @happy-dude +1 here, would like an option of making the logs even more verbose for instance if CPU exceeds a threshold as well. Adding it to the linked issue. Existing stats for kernel side event_drops had been extended to be more verbose, they now expose some coarse-grained syscall categories counters. However, we should have more options in userspace. Would similar bucketing (but more buckets) of syscalls work? Those stats can also help making informed decisions on price tag associated with each evt type and maybe we discover more effective ways of discarding uninteresting events within an evt type earlier as well, but first we need to find a good way to only select the syscalls that are truly needed for the state engine based on Falco rules, see RIP POC#1, thinking about a better way now. Meanwhile, the scap-open util binary has become really good for testing stuff and measuring perf under different configurations as the kernel tracing accounts for most CPU. |
@incertum, ages ago, the most relevant memory metric for me turned out to be PSS (not RSS), which accounts for page sharing between processes (e.g. a 2MB mapping shared between two processes is accounted as 1MB to each one), but I don't think there's one metric that fits all cases (e.g. PSS fails when you want to determine how much memory you'll get back by killing that process, which I suppose the OOM killer is interested in). TL;DR, every choice we make will be wrong for somebody. @happy-dude, yeah, some more detailed stats would be nice. Still, to track down the issue you were having, you'd need per (tid-or-pid, syscall) stats and I'm afraid that's overkill in the general case (there's a pretty big chance you'd need to (re)allocate memory for every event while the stats table grows). I'm ->this<- close to suggesting that we use chisels for this (so that we can run arbitrary lua code on each event) but that will only make the perf issues worse, if any. Or we take the memory hit and we shove the stats in sinsp_threadinfo, but I'd rather make this optional. BTW, sysdig (as opposed to falco) would be a better tool IMO to track down performance issues. At the very least, it supports chisels out of the box ;) (there's e.g. |
This is great! Agree that dropped events gives an overview of the cost for monitoring the syscall/ currently loaded rules. At the cost of overhead, I'd like a verbose/debug flag that exposes per-syscall being inspected by Falco.
I'm catching up on the discussion with scap-open, and it seems pretty useful. Could be something my org can incorporate scap-open as part of our linting workflow as a proactive measure when crafting new rules 🤔 |
@happy-dude re Started a HackMD document to further discuss such counters per event type categories. Would you be able to help creating the right buckets? As I will open a libs PR with proper APIs could try adding them as well. Ty! @gnosek yup yup, will add PSS and try to find out more about |
/milestone 0.35.0 |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
/remove-lifecycle stale |
f4059d0
to
1be3286
Compare
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
…s to MB Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
* renaming to `metrics` for technical clarity * adopt Prometheus like metrics interval settings Co-authored-by: Jason Dellaluce <jasondellaluce@gmail.com> Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Jason Dellaluce <jasondellaluce@gmail.com>
Signed-off-by: Jason Dellaluce <jasondellaluce@gmail.com>
* prefix counters and stats belonging to kernel space w/ `k.` else `u.` for userspace * add n_drops_perc from old stats writer schema * revert one change: file output shall reflect exact same "output_fields" key as rule output, note that src is already part of the "output_fields" schema. Co-authored-by: Jason Dellaluce <jasondellaluce@gmail.com> Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Co-authored-by: Stanley Chan <pocketgamer5000@gmail.com> Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
…rics interval Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
…s falco. and scap. * Ensure each metric field name more consistently adheres to the grammar used in Falco rules: * `falco.`: new field class representing userspace counters, statistics, resource utilization, or necessary information fields * `scap.`: new field class represents counters and statistics mostly obtained from Falco's kernel instrumentation before events are sent to userspace, but can include scap userspace stats as well * minor cleanup Co-authored-by: Jason Dellaluce <jasondellaluce@gmail.com> Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
…rity Co-authored-by: Jason Dellaluce <jasondellaluce@gmail.com> Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
0cf6784
to
82d56f9
Compare
@jasondellaluce feedback incorporated except:
This would require changing formatting also for Falco rules. Right now we use the exact same formatting and I would keep it that way in this PR and rather uniformly refactor @leogr and @jasondellaluce and @happy-dude polished the config description a lot, please take another look. |
Read through the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
LGTM label has been added. Git tree hash: 16e6a97d509759fa56febc3d4c014b315be3d981
|
I'll take care of right number formatting in a follow-up PR. |
{ | ||
msg.delta = msg.stats; | ||
char metric_name[STATS_NAME_MAX] = "falco."; | ||
strncat(metric_name, utilization[stat].name, sizeof(metric_name) - strlen(metric_name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If src contains n or more bytes, strncat() writes n+1 bytes to
dest (n from src plus the terminating null byte). Therefore, the
size of dest must be at least strlen(dest)+n+1.
We should do sizeof(metric_name) - strlen(metric_name) - 1
here to avoid missing the terminator char.
We can do it in a subsequent PR but please do that :) @jasondellaluce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, strncat
note aside.
We can fix it in a subsequent PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: FedeDP, incertum, jasondellaluce The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Same comment here as in another PR: @FedeDP please note in libs we do it the way it currently is e.g. |
Signed-off-by: Melissa Kilby melissa.kilby.oss@gmail.com
What type of PR is this?
/kind design
/kind feature
Any specific area of the project related to this PR?
/area engine
What this PR does / why we need it:
Initial solution to Falco native resource utilization metrics logs support.
Consider these key points about the new
metrics
feature in Falco:Additional highlights:
libbpf
statsWhich issue(s) this PR fixes:
#2222
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: