-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auditbeat: Add backpressure_strategy option (#7157) #7185
Conversation
ms.log.Warn("setting backlog wait time is not supported in this kernel. Enabling workaround.") | ||
ms.backpressureStrategy |= bsUserSpace | ||
} else { | ||
return errors.New("kernel backlog wait time not supported by kernel, but required by backpressure_strategy.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error strings should not be capitalized or end with punctuation or a newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like what you have done here auto
mode. I tested it on CentOS 6 (vagrant up centos6
). Here are some thoughts:
-
With the
userspace
strategy, add a default rate limit to help ensure that the backlog stays low enough to prevent waiting. We can advise users to test and tune their rate_limits when on older kernels. -
Increase the default
reassembler.queue_size
to 8k. -
Increase the defaultbacklog_limit
from 2^13 to 2^14.
These are some less critical ideas we should discuss.
-
Change
resassembler_lost
to bereassemblerlostMetric.Add(int64(count))
. And potentially change the name again to indicate that it is the number of sequence number gaps rather than individual messages. The other*_lost
metrics are based on individual messages. How aboutreassember_seq_gaps
? -
Add
auditd.received_msgs
metric based on number of successful receive calls. -
Consider removing the
sent async status request seq=43
message. It seems kind of verbose. -
By default add rule to exempt the Auditbeat PID from auditing (
-A exit,never -F pid=$(pgrep auditbeat) -S all
). This will ensure that none of the Auditbeat syscalls are "backlogged" despite whatever user defined rules there are. -
Consider adding more than one worker for parsing queued events.
var wg sync.WaitGroup
wg.Add(int(ms.config.StreamBufferConsumers))
for i:=0; i<int(ms.config.StreamBufferConsumers); i++ {
go func() {
defer wg.Done()
for {
select {
case <-reporter.Done():
return
case msgs := <-out:
reporter.Event(buildMetricbeatEvent(msgs, ms.config))
}
}
}()
}
wg.Wait()
Note that the audit backlog is an array of audit message structs of ~9000 bytes each. Setting the backlog size to 2^14 will allocate 147 megabytes of kernel memory, which seems a little much. I would much prefer Auditbeat to do more buffering in userspace memory that can be reclaimed when it is no longer needed. |
This adds a new configuration option, "backpressure_strategy" to the auditd module in auditbeat. It allows to set different ways in which auditbeat can mitigate or avoid backpressure to propagate into the kernel and having an impact on audited processes. The possible values are: - "kernel": Auditbeat will set the backlog_wait_time in the kernel's audit framework to 0. This causes events to be discarded in kernel if the audit backlog queue fills to capacity. Requires a 3.14 kernel or newer. - "userspace": Auditbeat will drop events when there is backpressure from the publishing pipeline. - "both": "kernel" and "userspace" strategies at the same time. - "auto" (default): The "kernel" strategy will be used, if supported. Otherwise will fall back to "userspace". - "none": No backpressure mitigation measures will be enabled. Closes elastic#7157
This sets a default rate limit of 5000 audit events per second when auditbeat is configured to drop events from user-space instead of in-kernel.
Added a default rule to ignore system calls executed by itself
Originally there was only one consumer for the auditd stream buffer. This patch allows to set up a number of consumers with the new `stream_buffer_consumers` setting in Auditd. By default it will use as many consumers as CPUs, with a maximum of 4.
// with a max of `maxDefaultStreamBufferConsumers` | ||
if numConsumers == 0 { | ||
if numConsumers = runtime.NumCPU(); numConsumers > maxDefaultStreamBufferConsumers { | ||
numConsumers = maxDefaultStreamBufferConsumers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this makes much sense, as opposed to just set a default of 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Eager to get this into a snapshot and do some wider testing.
// By default (stream_buffer_consumers=0) use as many consumers as local CPUs | ||
// with a max of `maxDefaultStreamBufferConsumers` | ||
if numConsumers == 0 { | ||
if numConsumers = runtime.NumCPU(); numConsumers > maxDefaultStreamBufferConsumers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use runtime.GOMAXPROCS(-1)
instead of runtime.NumCPU
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, although it could be beneficial to have more goroutines than CPUs (after all, goroutines aren't threads).
case <-reporter.Done(): | ||
return | ||
case msgs := <-out: | ||
reporter.Event(buildMetricbeatEvent(msgs, ms.config)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that concerns me is that the channel to which reporter.Event
publishes its events only has a buffer size of a single event. It feels like this could very quickly become a bottleneck, especially with this PR adding concurrency in event processing.
Has anyone maybe done some profiling that could disprove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I'm going to do some more testing this week. And I'll investigate this.
Could we see this in 6.x too as well? Kudos for the detailed commit message, however public facing docs are missing, will they be tackled in a follow-up PR? |
@pmoust yes, docs will be ready when this is released |
+1 to backport this into 6.x if possible (I didn't see it documented in https://www.elastic.co/guide/en/beats/auditbeat/6.3/auditbeat-module-auditd.html so I assume it's not in 6.3.x?) |
@jordansissel this will be in 6.4 |
This adds a new configuration option, "backpressure_strategy" to the auditd
module in auditbeat. It allows to set different ways in which auditbeat
can mitigate or avoid backpressure to propagate into the kernel and
having an impact on audited processes.
The possible values are:
audit framework to 0. This causes events to be discarded in kernel if
the audit backlog queue fills to capacity. Requires a 3.14 kernel or
newer.
from the publishing pipeline. If no rate_limit is set then it will set a rate
limit of 5000. Users should test their setup and adjust the rate_limit
option accordingly.
Otherwise will fall back to "userspace".
Closes #7157
Other Changes:
Increase default
reassembler.queue_size
to 8192.Change reassembler lost metric to count sequence gaps. It was renamed to
auditd.reassembler_seq_gaps
.Add received metric that counts the total number of received messages. It's called
auditd.received_msgs
.Auditd module ignores it's own syscall invocations by adding a kernel audit audit rule that ignores events from its own PID. This rule is added anytime that the user has defined audit rules.
Make the number of stream buffer consumers configurable.
Originally there was only one consumer for the auditd stream buffer.
This patch allows to set up a number of consumers with the new
stream_buffer_consumers
setting in Auditd.By default it will use as many consumers as GOMAXPROCS, with a maximum of 4.