Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auditbeat: Add backpressure_strategy option (#7157) #7185

Merged
merged 10 commits into from
Jun 5, 2018

Conversation

adriansr
Copy link
Contributor

@adriansr adriansr commented May 28, 2018

This adds a new configuration option, "backpressure_strategy" to the auditd
module in auditbeat. It allows to set different ways in which auditbeat
can mitigate or avoid backpressure to propagate into the kernel and
having an impact on audited processes.

The possible values are:

  • "kernel": Auditbeat will set the backlog_wait_time in the kernel's
    audit framework to 0. This causes events to be discarded in kernel if
    the audit backlog queue fills to capacity. Requires a 3.14 kernel or
    newer.
  • "userspace": Auditbeat will drop events when there is backpressure
    from the publishing pipeline. If no rate_limit is set then it will set a rate
    limit of 5000. Users should test their setup and adjust the rate_limit
    option accordingly.
  • "both": "kernel" and "userspace" strategies at the same time.
  • "auto" (default): The "kernel" strategy will be used, if supported.
    Otherwise will fall back to "userspace".
  • "none": No backpressure mitigation measures will be enabled.

Closes #7157

Other Changes:

  • Increase default reassembler.queue_size to 8192.

  • Change reassembler lost metric to count sequence gaps. It was renamed to auditd.reassembler_seq_gaps.

  • Add received metric that counts the total number of received messages. It's called auditd.received_msgs.

  • Auditd module ignores it's own syscall invocations by adding a kernel audit audit rule that ignores events from its own PID. This rule is added anytime that the user has defined audit rules.

  • Make the number of stream buffer consumers configurable.

    Originally there was only one consumer for the auditd stream buffer.
    This patch allows to set up a number of consumers with the new
    stream_buffer_consumers setting in Auditd.

    By default it will use as many consumers as GOMAXPROCS, with a maximum of 4.

ms.log.Warn("setting backlog wait time is not supported in this kernel. Enabling workaround.")
ms.backpressureStrategy |= bsUserSpace
} else {
return errors.New("kernel backlog wait time not supported by kernel, but required by backpressure_strategy.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error strings should not be capitalized or end with punctuation or a newline

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what you have done here auto mode. I tested it on CentOS 6 (vagrant up centos6). Here are some thoughts:

  • With the userspace strategy, add a default rate limit to help ensure that the backlog stays low enough to prevent waiting. We can advise users to test and tune their rate_limits when on older kernels.

  • Increase the default reassembler.queue_size to 8k.

  • Increase the default backlog_limit from 2^13 to 2^14.


These are some less critical ideas we should discuss.

  • Change resassembler_lost to be reassemblerlostMetric.Add(int64(count)). And potentially change the name again to indicate that it is the number of sequence number gaps rather than individual messages. The other *_lost metrics are based on individual messages. How about reassember_seq_gaps?

  • Add auditd.received_msgs metric based on number of successful receive calls.

  • Consider removing the sent async status request seq=43 message. It seems kind of verbose.

  • By default add rule to exempt the Auditbeat PID from auditing (-A exit,never -F pid=$(pgrep auditbeat) -S all). This will ensure that none of the Auditbeat syscalls are "backlogged" despite whatever user defined rules there are.

  • Consider adding more than one worker for parsing queued events.

	var wg sync.WaitGroup
	wg.Add(int(ms.config.StreamBufferConsumers))
	for i:=0; i<int(ms.config.StreamBufferConsumers); i++ {
		go func() {
			defer wg.Done()
			for {
				select {
				case <-reporter.Done():
					return
				case msgs := <-out:
					reporter.Event(buildMetricbeatEvent(msgs, ms.config))
				}
			}
		}()
	}
	wg.Wait()

@praseodym
Copy link
Contributor

Note that the audit backlog is an array of audit message structs of ~9000 bytes each. Setting the backlog size to 2^14 will allocate 147 megabytes of kernel memory, which seems a little much. I would much prefer Auditbeat to do more buffering in userspace memory that can be reclaimed when it is no longer needed.

This adds a new configuration option, "backpressure_strategy" to the auditd
module in auditbeat. It allows to set different ways in which auditbeat
can mitigate or avoid backpressure to propagate into the kernel and
having an impact on audited processes.

The possible values are:
- "kernel": Auditbeat will set the backlog_wait_time in the kernel's
  audit framework to 0. This causes events to be discarded in kernel if
  the audit backlog queue fills to capacity. Requires a 3.14 kernel or
  newer.
- "userspace": Auditbeat will drop events when there is backpressure
  from the publishing pipeline.
- "both": "kernel" and "userspace" strategies at the same time.
- "auto" (default): The "kernel" strategy will be used, if supported.
  Otherwise will fall back to "userspace".
- "none": No backpressure mitigation measures will be enabled.

Closes elastic#7157
This sets a default rate limit of 5000 audit events per second when
auditbeat is configured to drop events from user-space instead of
in-kernel.
Added a default rule to ignore system calls executed by itself
Originally there was only one consumer for the auditd stream buffer.
This patch allows to set up a number of consumers with the new
`stream_buffer_consumers` setting in Auditd.

By default it will use as many consumers as CPUs, with a maximum of 4.
// with a max of `maxDefaultStreamBufferConsumers`
if numConsumers == 0 {
if numConsumers = runtime.NumCPU(); numConsumers > maxDefaultStreamBufferConsumers {
numConsumers = maxDefaultStreamBufferConsumers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this makes much sense, as opposed to just set a default of 2

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Eager to get this into a snapshot and do some wider testing.

// By default (stream_buffer_consumers=0) use as many consumers as local CPUs
// with a max of `maxDefaultStreamBufferConsumers`
if numConsumers == 0 {
if numConsumers = runtime.NumCPU(); numConsumers > maxDefaultStreamBufferConsumers {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use runtime.GOMAXPROCS(-1) instead of runtime.NumCPU?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, although it could be beneficial to have more goroutines than CPUs (after all, goroutines aren't threads).

case <-reporter.Done():
return
case msgs := <-out:
reporter.Event(buildMetricbeatEvent(msgs, ms.config))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that concerns me is that the channel to which reporter.Event publishes its events only has a buffer size of a single event. It feels like this could very quickly become a bottleneck, especially with this PR adding concurrency in event processing.

Has anyone maybe done some profiling that could disprove this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I'm going to do some more testing this week. And I'll investigate this.

@andrewkroh andrewkroh merged commit 124c8a2 into elastic:master Jun 5, 2018
@pmoust
Copy link
Member

pmoust commented Jul 4, 2018

Could we see this in 6.x too as well?

Kudos for the detailed commit message, however public facing docs are missing, will they be tackled in a follow-up PR?

@adriansr
Copy link
Contributor Author

@pmoust yes, docs will be ready when this is released

@jordansissel
Copy link
Contributor

jordansissel commented Jul 25, 2018

+1 to backport this into 6.x if possible (I didn't see it documented in https://www.elastic.co/guide/en/beats/auditbeat/6.3/auditbeat-module-auditd.html so I assume it's not in 6.3.x?)

@adriansr
Copy link
Contributor Author

@jordansissel this will be in 6.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants