-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(outputs): expose outputs_queue
config for memory control
#2711
Conversation
falco.yaml
Outdated
# recovery: 1 means simply exit (default behavior). | ||
# recovery: 2 means empty the queue and then continue. | ||
queue_capacity_outputs: | ||
items: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leogr as per our chat preserving the old default (unbounded and exiting). Obviously we can change the encoding etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, @leogr I would like to better understand the interrelationships with for instance output_timeout
, buffered_outputs
and outputs.rate
and outputs.max_burst
. Their descriptions are also still more cryptic especially for the outputs one. aka I am looking for a scaling factor recommendation aka if you change let's
outputs:
rate: 100000
max_burst: 100000000
what capacity of the queue should you choose in relation?
Maybe the timeout default is not right? we block for too long? How was the current default derived?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also reading this raises some question marks ... because not having a recovery strategy as well here appears suboptimal:
# It's important to note that Falco outputs will not be discarded from the
# output queue. This means that if an output channel becomes blocked
# indefinitely, it indicates a potential issue that needs to be addressed by the
# user.
output_timeout: 2000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, @leogr I would like to better understand the interrelationships with for instance
output_timeout
,buffered_outputs
andoutputs.rate
andoutputs.max_burst
.
Let me explain them one by one in a logical order (from upstream to downstream).
-
(step 0: main thread) the rule engine process incoming events
-
(step 1: main thread)
outputs.rate
andoutputs.max_burst
configure simple rate limiter. It is applied immediately and after a rule's condition is met and immediately before pushing the message in the queue:
falco/userspace/falco/app/actions/process_events.cpp
Lines 328 to 344 in e0c6c9d
// As the inspector has no filter at its level, all // events are returned here. Pass them to the falco // engine, which will match the event against the set // of rules. If a match is found, pass the event to // the outputs. std::unique_ptr<falco_engine::rule_result> res = s.engine->process_event(source_engine_idx, ev); if(res) { if (!rate_limiter_enabled || rate_limiter.claim()) { s.outputs->handle_event(res->evt, res->rule, res->source, res->priority_num, res->format, res->tags); } else { falco_logger::log(LOG_DEBUG, "Skipping rate-limited notification for rule " + res->rule + "\n"); } }
It was a legacy way to protect the outputs system from being flooded by a peak of messages. However, this mechanism should be discouraged since the rate limiter discards those events, so one can potentially lose critical alerts. For this reason, it's not enabled by default. -
(step 2) here's the queue
-
(step 3a: in another thread) A worker pops messages from the queue and fans out them to each (enabled) output channel implementation.
-
(step 3b: in another thread) A watchdog checks that the output channel implementation is not blocking the worker (and thus the whole queue) for longer than the value configured in
output_timeout
. When the timeout is reached, a log message is triggered as a last resort. -
(step 4: in the same thread of 3a) each enabled output channel implementation receives the message and outputs it
- Among those output implementations, some support the
buffered_outputs
option (for example, the file output and stdout output. The actual meaning of buffered may be slightly different depending on the specific output, and not all outputs support it. Anyway, it generally refers to enabling (or not) the buffer on the underlying byte stream on the output channel implementation.
- Among those output implementations, some support the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the output_timeout
is so important?
When output consumer blocks (ie. imagine the program
output or any I/O related op that happens downstream to Falco) and the alert can be delivered within the given deadline, Falco uses stderr
to notify the user of a not addressable problem.
Why stderr
? Because output channels are blocked.
Why the problem is not addressable by Falco? Because such a situation indicates an I/O problem downstream to Falco (for example, a network slowdown) or, more rarely, a misconfiguration (for instance, imagine configuring the program
output with a program that blocks for more than 2 seconds, you can simulate this using sleep 10
).
So output_timeout
is just an emergency mechanism to notify the user that something (outside Falco) is blocking Falco and Falco is giving up because it can't handle the problem. It has no direct with Falco's performance or with its behavior during normal operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you Leo for the good explanation!
What about deprecating the rate_limiter
(just put a deprecation warning in Falco and a [DEPRACATED]
label in the config) in this release 0.36
and then remove it in 0.37
? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this @Andreagit97:
Don't wanna come across as always being in favor of just getting rid of things, but at the same time, we do need to improve technical clarity and clean things up. Referring to it as "technical debt" without any negative connotation -- it's normal in every project.
@Andreagit97 would you be willing to run point on this? Asking because you know I will be out most of Sep and will limit code contributions now for this release, but happy to help as reviewer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on the rate limiter. I wonder if it can still be helpful with some use cases or not.
I vaguely recall I had a conversation with @jasondellaluce a while ago.
Anyway, I'm okay with removing it if we get a consensus. In such a case, I recommend putting a deprecation warning for at least 1 release before completely removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I am just confused that we then silently drop events without even counting these drops. Is there another way to do it? If we keep it can we at least add drop counters in?
I guess it comes back to Falco being a bit of a different type of a service, where the end goal would be to not produce too many alerts and try to never drop an event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I will open an issue to deprecate it and see if we reach consensus
falco.yaml
Outdated
# recovery: 1 means simply exit (default behavior). | ||
# recovery: 2 means empty the queue and then continue. | ||
queue_capacity_outputs: | ||
items: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, @leogr I would like to better understand the interrelationships with for instance
output_timeout
,buffered_outputs
andoutputs.rate
andoutputs.max_burst
.
Let me explain them one by one in a logical order (from upstream to downstream).
-
(step 0: main thread) the rule engine process incoming events
-
(step 1: main thread)
outputs.rate
andoutputs.max_burst
configure simple rate limiter. It is applied immediately and after a rule's condition is met and immediately before pushing the message in the queue:
falco/userspace/falco/app/actions/process_events.cpp
Lines 328 to 344 in e0c6c9d
// As the inspector has no filter at its level, all // events are returned here. Pass them to the falco // engine, which will match the event against the set // of rules. If a match is found, pass the event to // the outputs. std::unique_ptr<falco_engine::rule_result> res = s.engine->process_event(source_engine_idx, ev); if(res) { if (!rate_limiter_enabled || rate_limiter.claim()) { s.outputs->handle_event(res->evt, res->rule, res->source, res->priority_num, res->format, res->tags); } else { falco_logger::log(LOG_DEBUG, "Skipping rate-limited notification for rule " + res->rule + "\n"); } }
It was a legacy way to protect the outputs system from being flooded by a peak of messages. However, this mechanism should be discouraged since the rate limiter discards those events, so one can potentially lose critical alerts. For this reason, it's not enabled by default. -
(step 2) here's the queue
-
(step 3a: in another thread) A worker pops messages from the queue and fans out them to each (enabled) output channel implementation.
-
(step 3b: in another thread) A watchdog checks that the output channel implementation is not blocking the worker (and thus the whole queue) for longer than the value configured in
output_timeout
. When the timeout is reached, a log message is triggered as a last resort. -
(step 4: in the same thread of 3a) each enabled output channel implementation receives the message and outputs it
- Among those output implementations, some support the
buffered_outputs
option (for example, the file output and stdout output. The actual meaning of buffered may be slightly different depending on the specific output, and not all outputs support it. Anyway, it generally refers to enabling (or not) the buffer on the underlying byte stream on the output channel implementation.
- Among those output implementations, some support the
falco.yaml
Outdated
# recovery: 1 means simply exit (default behavior). | ||
# recovery: 2 means empty the queue and then continue. | ||
queue_capacity_outputs: | ||
items: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the output_timeout
is so important?
When output consumer blocks (ie. imagine the program
output or any I/O related op that happens downstream to Falco) and the alert can be delivered within the given deadline, Falco uses stderr
to notify the user of a not addressable problem.
Why stderr
? Because output channels are blocked.
Why the problem is not addressable by Falco? Because such a situation indicates an I/O problem downstream to Falco (for example, a network slowdown) or, more rarely, a misconfiguration (for instance, imagine configuring the program
output with a program that blocks for more than 2 seconds, you can simulate this using sleep 10
).
So output_timeout
is just an emergency mechanism to notify the user that something (outside Falco) is blocking Falco and Falco is giving up because it can't handle the problem. It has no direct with Falco's performance or with its behavior during normal operations.
6794afa
to
3139ad2
Compare
3139ad2
to
7924e36
Compare
cb17b1c
to
92bd576
Compare
This PR may bring feature or behavior changes in the Falco engine and may require the engine version to be bumped. Please double check userspace/engine/falco_engine_version.h file. See versioning for FALCO_ENGINE_VERSION. /hold |
ba27f9c
to
ac06c40
Compare
8dba4ee
to
530d5d9
Compare
userspace/falco/falco_outputs.cpp
Outdated
@@ -268,8 +276,21 @@ inline void falco_outputs::push(const ctrl_msg& cmsg) | |||
{ | |||
if (!m_queue.try_push(cmsg)) | |||
{ | |||
fprintf(stderr, "Fatal error: Output queue reached maximum capacity. Exiting.\n"); | |||
exit(EXIT_FAILURE); | |||
switch (m_recovery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we are 100% thread safe here 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, maybe @jasondellaluce once back may have additional ideas that we could also address in a follow up PR since Jason worked a lot on the thread safety aspects all around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would happen in multi-source scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left a few comments.
Also, please keep in mind that this PR should be rebased, and we must make sure the changes introduced by #2663 are preserved (i.e. the WASM build support)
Finally, a question. Since we are dropping a message from the queue, what will happen when a control message is lost (for example, ctrl_msg_type::CTRL_MSG_STOP
or
ctrl_msg_type::CTRL_MSG_REOPEN
)?
I wonder if it can be a problem in some case 🤔
Let me rebase very carefully and then implement each change request very carefully later on. Also @Andreagit97 great suggestions all around!
Good question, we need to check. |
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Hi 👋 kindly checking in on the status of this PR. Eager to give my best to get this over the finish line, at the same time please note I have limited availability next week and after that I am out until after the release (something I have well communicated early on). Rating this PR as important to be able to get to the bottom of memory leaks. Falco effectively can leak memory today in production as confirmed by multiple adopters (including myself). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR LGTM! I still have an open point that I would like to understand unrelated to the PR,
why do we have 2 queues, one for outputs and another for metrics? @jasondellaluce @leogr
@@ -258,6 +260,18 @@ void falco_configuration::load_yaml(const std::string& config_name, const yaml_h | |||
} | |||
|
|||
m_buffered_outputs = config.get_scalar<bool>("buffered_outputs", false); | |||
m_outputs_queue_capacity = config.get_scalar<size_t>("outputs_queue.capacity", DEFAULT_OUTPUTS_QUEUE_CAPACITY_UNBOUNDED_MAX_LONG_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use an int64_t
instead of size_t
? in this way, the DEFAULT_OUTPUTS_QUEUE_CAPACITY
could be simply be -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, at the same time always preferring to stay close to the framework, here TBB, and they use size_t
. Let me know if its a [nit] or if we significantly gain from this change, then I'll do it.
falco.yaml
Outdated
# recovery: 1 means simply exit (default behavior). | ||
# recovery: 2 means empty the queue and then continue. | ||
queue_capacity_outputs: | ||
items: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I will open an issue to deprecate it and see if we reach consensus
This is a good question, actually, the worker thread won't stop because it doesn't receive the stop message... My real question is, should a user use Falco with alert drops? I mean, if you drop alerts it means that Falco is no longer useful, so yes I get the point of having a bounded queue but if the user faces some alert drops or there is a bug or he has to configure Falco with a great queue capacity 🤔 |
on a second thought, I'm asking myself if we really want the IMO what we need to achieve is to prevent the OOM killer from killing us and provide users with a graceful shutdown. If we drop alerts this is a symptom of a bug because we have a dedicated thread that just outputs logs all the time so it's almost impossible that it is not able to keep up with the main thread. It could happen if the tool is misconfigured (too noise, so rule misconfiguration, too outputs, so tool misconfiguration). In my opinion in this case we want to report the problem to the user and stop Falco instead of "hiding" the issue keeping up with drops. Moreover as pointed out by @leogr we could face issues if we drop CTRL messages... WDYT? @incertum @jasondellaluce @leogr Please let's try to reach a consensus before Melissa go away in this way we can close this workstream 👍 |
Currently there can also be issues with the control messages being possibly stuck a long time. Personally think the control messages should not be in the alerts output queue and rather directly be processes or on a separate hot path. @leogr and I chatted and concluded it's not a high priority. Would vouch for cleaning it up in a separate PR regardless.
I do agree that Falco shall not be used with random alert sampling / dropping or dropping based the queue filling up. Here is how I think about this initial change: It's a first step towards performing data-driven decisions (vs speculating) to see how much of an issue this can be in production. It helps determining if this could cause the memory leaks on some servers or it can show that this is not a problem and we have to look elsewhere. Based on some field experience we can come back and optimize this. For example, it would be interesting to see if there could be few kernel side drops, but because Falco is used for too verbose auditing / alerting we fail to keep up in userspace and such. |
Let's experiment with it. Maybe
ACK, few more thoughts: I believe the new Falco metrics counters will be helpful to gain a better understanding about what could be going on on troublesome servers, hence we do count the new drops just like the kernel side drops. In addition, adding the capacity config is also good software engineering practice and effectively will make the kernel buffer and output queue equivalent in terms of end user exposed pre-defined bounds. Re your last point what the end user should do after possibly finding out that the output queue is not working ... I would defer this discussion once we have some data. |
Yep since they are experimental we could also introduce them for now but i would consider it again before promoting them as stable |
I agree with experimenting with this. Also, if we merge this soon, we can include it in Falco 0.36.0 RC1 and give it a try. We will then have a couple of weeks to fine-tune it if needed. |
LGTM label has been added. Git tree hash: 6bc1e99a5c1e81dc9d56395894c2f5836fdc7162
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
Found a thread safety hazard. Overall looks good!
switch (m_outputs_queue_recovery) | ||
{ | ||
case falco_common::RECOVERY_EXIT: | ||
falco_logger::log(LOG_ERR, "Fatal error: Output queue out of memory. Exiting ..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not throwing an exception here? It would cause one open event source thread to be stopped (and in cascade, all the others open in parallel), and Falco to terminate gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good will do!
userspace/falco/falco_outputs.cpp
Outdated
falco_logger::log(LOG_ERR, "Fatal error: Output queue out of memory. Exiting ..."); | ||
exit(EXIT_FAILURE); | ||
case falco_common::RECOVERY_EMPTY: | ||
m_outputs_queue_num_drops += m_queue.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stating the docs of falco_outputs
:
All methods in this class are thread-safe. The output framework supports a multi-producer model where messages are stored in a queue and consumed by each configured output asynchrounously.
Since m_queue
is a thread-safe object as guaranteed by TBB, I'd say m_outputs_queue_num_drops
should be atomic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also realized I think when counting the number of drops for the empty queue case I need to add +1 will do this later.
Co-authored-by: Jason Dellaluce <jasondellaluce@gmail.com> Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
…recovery 'empty' Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
LGTM label has been added. Git tree hash: 935425a0859a89c2261e84d27cc2b0106435b8ae
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Andreagit97, incertum, leogr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
What type of PR is this?
/kind bug
/kind cleanup
Any specific area of the project related to this PR?
/area engine
What this PR does / why we need it:
The tbb output queue is currently unbounded and is a very likely the cause for memory leak issues in production when the pipe is not holding up, see some simulated results #2495 (comment).
We discussed that exposing new configs is best practices offering more options for adopters to control consumption and limits beyond the external cgroups limits. It's also in line with the kernel buffer being bounded and its size exposed as config.
After all no one wants SRE s or system admins calling out Falco may be leaking memory.
There could still be other issues. Therefore, the best next step would be to merge an acceptable solution and then test deploy and see and then go from there.
@leogr @jasondellaluce
Which issue(s) this PR fixes:
#2495 (comment)
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: