Auditbeat: Add backpressure_strategy option (#7157) #7185

adriansr · 2018-05-28T12:06:47Z

This adds a new configuration option, "backpressure_strategy" to the auditd
module in auditbeat. It allows to set different ways in which auditbeat
can mitigate or avoid backpressure to propagate into the kernel and
having an impact on audited processes.

The possible values are:

"kernel": Auditbeat will set the backlog_wait_time in the kernel's
audit framework to 0. This causes events to be discarded in kernel if
the audit backlog queue fills to capacity. Requires a 3.14 kernel or
newer.
"userspace": Auditbeat will drop events when there is backpressure
from the publishing pipeline. If no rate_limit is set then it will set a rate
limit of 5000. Users should test their setup and adjust the rate_limit
option accordingly.
"both": "kernel" and "userspace" strategies at the same time.
"auto" (default): The "kernel" strategy will be used, if supported.
Otherwise will fall back to "userspace".
"none": No backpressure mitigation measures will be enabled.

Closes #7157

Other Changes:

Increase default reassembler.queue_size to 8192.
Change reassembler lost metric to count sequence gaps. It was renamed to auditd.reassembler_seq_gaps.
Add received metric that counts the total number of received messages. It's called auditd.received_msgs.
Auditd module ignores it's own syscall invocations by adding a kernel audit audit rule that ignores events from its own PID. This rule is added anytime that the user has defined audit rules.
Make the number of stream buffer consumers configurable.

Originally there was only one consumer for the auditd stream buffer.
This patch allows to set up a number of consumers with the new
stream_buffer_consumers setting in Auditd.

By default it will use as many consumers as GOMAXPROCS, with a maximum of 4.

houndci-bot · 2018-05-28T12:06:56Z

auditbeat/module/auditd/audit_linux.go

+				ms.log.Warn("setting backlog wait time is not supported in this kernel. Enabling workaround.")
+				ms.backpressureStrategy |= bsUserSpace
+			} else {
+				return errors.New("kernel backlog wait time not supported by kernel, but required by backpressure_strategy.")


error strings should not be capitalized or end with punctuation or a newline

andrewkroh

I like what you have done here auto mode. I tested it on CentOS 6 (vagrant up centos6). Here are some thoughts:

With the userspace strategy, add a default rate limit to help ensure that the backlog stays low enough to prevent waiting. We can advise users to test and tune their rate_limits when on older kernels.
Increase the default reassembler.queue_size to 8k.
~~Increase the default backlog_limit from 2^13 to 2^14.~~

These are some less critical ideas we should discuss.

Change resassembler_lost to be reassemblerlostMetric.Add(int64(count)). And potentially change the name again to indicate that it is the number of sequence number gaps rather than individual messages. The other *_lost metrics are based on individual messages. How about reassember_seq_gaps?
Add auditd.received_msgs metric based on number of successful receive calls.
Consider removing the sent async status request seq=43 message. It seems kind of verbose.
By default add rule to exempt the Auditbeat PID from auditing (-A exit,never -F pid=$(pgrep auditbeat) -S all). This will ensure that none of the Auditbeat syscalls are "backlogged" despite whatever user defined rules there are.
Consider adding more than one worker for parsing queued events.

	var wg sync.WaitGroup
	wg.Add(int(ms.config.StreamBufferConsumers))
	for i:=0; i<int(ms.config.StreamBufferConsumers); i++ {
		go func() {
			defer wg.Done()
			for {
				select {
				case <-reporter.Done():
					return
				case msgs := <-out:
					reporter.Event(buildMetricbeatEvent(msgs, ms.config))
				}
			}
		}()
	}
	wg.Wait()

praseodym · 2018-05-29T21:43:58Z

Note that the audit backlog is an array of audit message structs of ~9000 bytes each. Setting the backlog size to 2^14 will allocate 147 megabytes of kernel memory, which seems a little much. I would much prefer Auditbeat to do more buffering in userspace memory that can be reclaimed when it is no longer needed.

This adds a new configuration option, "backpressure_strategy" to the auditd module in auditbeat. It allows to set different ways in which auditbeat can mitigate or avoid backpressure to propagate into the kernel and having an impact on audited processes. The possible values are: - "kernel": Auditbeat will set the backlog_wait_time in the kernel's audit framework to 0. This causes events to be discarded in kernel if the audit backlog queue fills to capacity. Requires a 3.14 kernel or newer. - "userspace": Auditbeat will drop events when there is backpressure from the publishing pipeline. - "both": "kernel" and "userspace" strategies at the same time. - "auto" (default): The "kernel" strategy will be used, if supported. Otherwise will fall back to "userspace". - "none": No backpressure mitigation measures will be enabled. Closes elastic#7157

This sets a default rate limit of 5000 audit events per second when auditbeat is configured to drop events from user-space instead of in-kernel.

Added a default rule to ignore system calls executed by itself

Originally there was only one consumer for the auditd stream buffer. This patch allows to set up a number of consumers with the new `stream_buffer_consumers` setting in Auditd. By default it will use as many consumers as CPUs, with a maximum of 4.

adriansr · 2018-06-01T05:49:54Z

auditbeat/module/auditd/audit_linux.go

+	// with a max of `maxDefaultStreamBufferConsumers`
+	if numConsumers == 0 {
+		if numConsumers = runtime.NumCPU(); numConsumers > maxDefaultStreamBufferConsumers {
+			numConsumers = maxDefaultStreamBufferConsumers


I am not sure this makes much sense, as opposed to just set a default of 2

andrewkroh

Overall LGTM. Eager to get this into a snapshot and do some wider testing.

andrewkroh · 2018-06-01T19:57:21Z

auditbeat/module/auditd/audit_linux.go

+	// By default (stream_buffer_consumers=0) use as many consumers as local CPUs
+	// with a max of `maxDefaultStreamBufferConsumers`
+	if numConsumers == 0 {
+		if numConsumers = runtime.NumCPU(); numConsumers > maxDefaultStreamBufferConsumers {


Would it make sense to use runtime.GOMAXPROCS(-1) instead of runtime.NumCPU?

I think so, although it could be beneficial to have more goroutines than CPUs (after all, goroutines aren't threads).

praseodym · 2018-06-01T21:24:45Z

auditbeat/module/auditd/audit_linux.go

+				case <-reporter.Done():
+					return
+				case msgs := <-out:
+					reporter.Event(buildMetricbeatEvent(msgs, ms.config))


One thing that concerns me is that the channel to which reporter.Event publishes its events only has a buffer size of a single event. It feels like this could very quickly become a bottleneck, especially with this PR adding concurrency in event processing.

Has anyone maybe done some profiling that could disprove this?

Thanks for pointing this out. I'm going to do some more testing this week. And I'll investigate this.

pmoust · 2018-07-04T10:55:30Z

Could we see this in 6.x too as well?

Kudos for the detailed commit message, however public facing docs are missing, will they be tackled in a follow-up PR?

adriansr · 2018-07-10T20:04:26Z

@pmoust yes, docs will be ready when this is released

jordansissel · 2018-07-25T05:18:28Z

+1 to backport this into 6.x if possible (I didn't see it documented in https://www.elastic.co/guide/en/beats/auditbeat/6.3/auditbeat-module-auditd.html so I assume it's not in 6.3.x?)

adriansr · 2018-07-25T12:40:08Z

@jordansissel this will be in 6.4

houndci-bot reviewed May 28, 2018

View reviewed changes

adriansr added enhancement feedback needed discuss Issue needs further discussion. Auditbeat labels May 28, 2018

adriansr force-pushed the feature/ab/7157 branch from 715d785 to 910521f Compare May 28, 2018 14:11

adriansr requested a review from andrewkroh May 28, 2018 15:02

andrewkroh reviewed May 29, 2018

View reviewed changes

adriansr added 8 commits May 31, 2018 11:47

Increase default reassembler.queue_size to 8192

73d962c

Change reassembler lost metric to count sequence gaps

af156e9

Add received metric

8729c84

Remove annoying debug message

08b6ec2

Set audit rate limit when backlog userspace policy is active

c490765

This sets a default rate limit of 5000 audit events per second when auditbeat is configured to drop events from user-space instead of in-kernel.

Auditd module ignores it's own syscall invocations

483f11c

Added a default rule to ignore system calls executed by itself

adriansr force-pushed the feature/ab/7157 branch from 910521f to 97883d9 Compare June 1, 2018 05:41

adriansr commented Jun 1, 2018

View reviewed changes

andrewkroh reviewed Jun 1, 2018

View reviewed changes

praseodym reviewed Jun 1, 2018

View reviewed changes

adriansr added 2 commits June 4, 2018 11:28

Prefer GOMAXPROCS instead of NumCPU

296277e

Fix tests

52e6915

andrewkroh merged commit 124c8a2 into elastic:master Jun 5, 2018

andrewkroh added the needs_docs label Jun 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auditbeat: Add backpressure_strategy option (#7157) #7185

Auditbeat: Add backpressure_strategy option (#7157) #7185

adriansr commented May 28, 2018 •

edited by andrewkroh

Loading

houndci-bot May 28, 2018

andrewkroh left a comment •

edited

Loading

praseodym commented May 29, 2018

adriansr Jun 1, 2018

andrewkroh left a comment

andrewkroh Jun 1, 2018

praseodym Jun 1, 2018

praseodym Jun 1, 2018

andrewkroh Jun 5, 2018

pmoust commented Jul 4, 2018

adriansr commented Jul 10, 2018

jordansissel commented Jul 25, 2018 •

edited

Loading

adriansr commented Jul 25, 2018

Auditbeat: Add backpressure_strategy option (#7157) #7185

Auditbeat: Add backpressure_strategy option (#7157) #7185

Conversation

adriansr commented May 28, 2018 • edited by andrewkroh Loading

houndci-bot May 28, 2018

Choose a reason for hiding this comment

andrewkroh left a comment • edited Loading

Choose a reason for hiding this comment

praseodym commented May 29, 2018

adriansr Jun 1, 2018

Choose a reason for hiding this comment

andrewkroh left a comment

Choose a reason for hiding this comment

andrewkroh Jun 1, 2018

Choose a reason for hiding this comment

praseodym Jun 1, 2018

Choose a reason for hiding this comment

praseodym Jun 1, 2018

Choose a reason for hiding this comment

andrewkroh Jun 5, 2018

Choose a reason for hiding this comment

pmoust commented Jul 4, 2018

adriansr commented Jul 10, 2018

jordansissel commented Jul 25, 2018 • edited Loading

adriansr commented Jul 25, 2018

adriansr commented May 28, 2018 •

edited by andrewkroh

Loading

andrewkroh left a comment •

edited

Loading

jordansissel commented Jul 25, 2018 •

edited

Loading