[Auditbeat] Avoid having Linux wait on clearing a backlog #7157

andrewkroh · 2018-05-22T14:14:38Z

Back-pressure from Auditbeat is propagated to the kernel via the unicast netlink socket buffer and can cause delays in the kernel. The propagation of back-pressure was implemented with the assumption that the kernel drops messages when the backlog queue is full. This assumption is true, but it has an unwanted side-effect. When the backlog queue is full, the kernel will wait for the queue to "drain a little" before providing a buffer to the waiting auditable syscall. If the queue doesn't free up the kernel will log a warning and continue with the syscall.

The waiting period is defined by the audit_backlog_wait_time variable. Prior to v3.14 the variable was not configurable. Then in v3.14 a commit was made to make this configurable through the audit system.

We need to make two changes for Auditbeat:

Add ability to set backlog wait time go-libaudit#34 - For Linux 3.14+ set the backlog_wait_time to 0 by default to ensure that Auditbeat doens't causes any blocking.
Modify Auditbeat such that the socket reading goroutine does not block when the output is blocked (e.g. back-pressure from the publisher pipeline, this can be mitigated by using spooling to disk) or when processing of events cannot keep up with the rate from the kernel.

For confirmed bugs, please report:

Version: 6.2.x - 6.3.0
Operating System: Linux
Discuss Forum URL: https://discuss.elastic.co/t/auditbeat-impacting-system-performance/131290
Steps to Reproduce:
- Enable the auditd module in unicast mode.
- Audit some high volume syscalls.
- Block the output in some way (bring down LS) or suspend the Auditbeat process.
- Wait for the kernel's audit_backlog_limit to be exceeded. (Messages will start showing up in the kernel log with "audit: backlog limit exceeded". The message is rate limited.)
- Syscalls that are auditable will be wait for the audit_backlog_wait_time period.

Workarounds:

If you have kernel v3.14 or newer and the auditd package installed then you can manually set the audit_backlog_wait_time to 0 with sudo auditctl --backlog_wait_time 0.

The text was updated successfully, but these errors were encountered:

dilchenko · 2018-05-23T02:47:55Z

https://bugzilla.redhat.com/show_bug.cgi?id=1437426 deep sigh - not supported even on CentOS 7 :(

Remains to be tested, but I found some references in Redhat docs for RHEL 7 to this option: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/security_guide/sec-defining_audit_rules_and_controls

~~CentOS 7 servers we have are on 3.10 kernel, so I think this might have been back-ported.~~

Edit: further investigation, this seems like it has bee ... discontinued?

kholia · 2018-05-23T05:29:22Z

@dilchenko Even the latest CentOS version 7.5 doesn't have this feature.

kholia · 2018-05-23T05:32:43Z

On Ubuntu 18.04 LTS the default value of backlog_wait_time option is 15000.

praseodym · 2018-05-23T17:18:48Z

I’ve hit exactly this issue in a production environment today. Backpressure in Logstash caused Auditbeat to stop reading its buffer and thus make the kernel on multiple machines grind to a halt. Because Auditbeat was also running on the Logstash receiver box, it actually caused a cascading failure due to our Logstash box becoming unresponsive as well. Manual intervention was required to get the Logstash box up and running again, after which everything recovered.

praseodym · 2018-05-23T17:36:10Z

Also: ~~I haven’t tested this yet, but maybe~~ using socket_type: multicast could be another workaround? Update: this seems to be working.

dilchenko · 2018-05-23T18:50:01Z

The reason I researched the RHEL7 status is because this is the latest version of a major distribution, and it does not have support for backlog_wait_time. Which means auditbeat is currently capable of basically breaking things in production. In retrospective, I am glad we noticed it right away on a busy box with specific syscall volume/pattern. If this happened during, say, traffic spike - I would be seriously confused: based on [fairly extensive] telemetry we collect, I would have a hard time connecting the ill-effects of back-pressure to the cause of it (audit framework being blocking -ish).

To put it differently: we are offering a product that is guaranteed to break production system for any customer not running 3.14+ kernel with that setting tune. I just want to re-emphasize the importance of this issue.

Some time soon, I will get to testing this on our systems with higher limits for rate/backlog. But we won't be able to roll auditbeat out without a workaround for the waiting issue because we would be running a change of breaking production. The workaround needs to be suitable for use on 2.6 or, at least, 3.10.xxx version of kernel - AFAIU, RHEL7 will stick to 3.10 kernel, so best case is they backport the backlog_wait_time support.

praseodym · 2018-05-23T19:20:29Z

RHEL/CentOS 6 is not EOL until November 2020, so we'll be stuck with kernel 2.6 for another while as well.

Audit netlink multicast is supported since kernel 3.16 so that's probably not in RHEL7 either.

This adds a new configuration option, "backpressure_strategy" to the auditd module in auditbeat. It allows to set different ways in which auditbeat can mitigate or avoid backpressure to propagate into the kernel and having an impact on audited processes. The possible values are: - "kernel": Auditbeat will set the backlog_wait_time in the kernel's audit framework to 0. This causes events to be discarded in kernel if the audit backlog queue fills to capacity. Requires a 3.14 kernel or newer. - "userspace": Auditbeat will drop events when there is backpressure from the publishing pipeline. - "both": "kernel" and "userspace" strategies at the same time. - "auto" (default): The "kernel" strategy will be used, if supported. Otherwise will fall back to "userspace". - "none": No backpressure mitigation measures will be enabled. Closes elastic#7157

This adds a new configuration option, "backpressure_strategy" to the auditd module in auditbeat. It allows to set different ways in which auditbeat can mitigate or avoid backpressure to propagate into the kernel and having an impact on audited processes. The possible values are: - "kernel": Auditbeat will set the backlog_wait_time in the kernel's audit framework to 0. This causes events to be discarded in kernel if the audit backlog queue fills to capacity. Requires a 3.14 kernel or newer. - "userspace": Auditbeat will drop events when there is backpressure from the publishing pipeline. If no rate_limit is set then it will set a rate limit of 5000. Users should test their setup and adjust the rate_limit option accordingly. - "both": "kernel" and "userspace" strategies at the same time. - "auto" (default): The "kernel" strategy will be used, if supported. Otherwise will fall back to "userspace". - "none": No backpressure mitigation measures will be enabled. Closes #7157 Other Changes: * Increase default `reassembler.queue_size` to 8192. * Change reassembler lost metric to count sequence gaps. It was renamed to `auditd.reassembler_seq_gaps`. * Add received metric that counts the total number of received messages. It's called `auditd.received_msgs`. * Auditd module ignores it's own syscall invocations by adding a kernel audit audit rule that ignores events from its own PID. This rule is added anytime that the user has defined audit rules. * Make the number of stream buffer consumers configurable. Originally there was only one consumer for the auditd stream buffer. This patch allows to set up a number of consumers with the new `stream_buffer_consumers` setting in Auditd. By default it will use as many consumers as GOMAXPROCS, with a maximum of 4.

praseodym · 2018-06-05T15:48:03Z

Can we get this cherry picked to 6.x as well?

Added documentation for the `backpressure_strategy` option on the auditd module.

jordansissel · 2019-01-10T19:16:36Z

Can we get this cherry picked to 6.x as well?

Noting for posterity: This was released with auditbeat 6.4.0.

ossie-git · 2019-09-12T04:02:59Z

Just curious. Although this was closed, I wonder what the best approach is for systems that don't support audit_backlog_wait_time (basically all RHEL/CentOS 7 versions). Is dropping events in userspace the recommended approach? Also, considering how widely used RHEL/CentOS 7 are, wouldn't it be preferable if some in-memory or on-disk temporary cache was added as an option to handle this scenario?

andrewkroh added bug Auditbeat labels May 22, 2018

praseodym mentioned this issue May 23, 2018

Metricbeat - Audit module - Audit events logged to kernel log #4513

Closed

adriansr self-assigned this May 24, 2018

adriansr mentioned this issue May 28, 2018

Auditbeat: Add backpressure_strategy option (#7157) #7185

Merged

kholia mentioned this issue May 29, 2018

[Metricbeat] - Support reporting of kernel audit subsystem statistics #7191

Closed

andrewkroh closed this as completed in #7185 Jun 5, 2018

adriansr mentioned this issue Jul 10, 2018

Use a separate audit client for lost event monitoring #7561

Merged

adriansr added a commit to adriansr/beats that referenced this issue Jul 23, 2018

Auditd: Document the backpressure_strategy option (elastic#7157)

0963660

Added documentation for the `backpressure_strategy` option on the auditd module.

andrewkroh pushed a commit that referenced this issue Jul 24, 2018

Auditd: Document the backpressure_strategy option (#7157)

7e79bb4

Added documentation for the `backpressure_strategy` option on the auditd module.

ikoniaris mentioned this issue Oct 24, 2019

Configurable audit backlog wait time setting on Linux osquery/osquery#5930

Closed

theopolis mentioned this issue Oct 28, 2019

Configurable audit backlog wait time setting on Linux osquery/osquery#5952

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Auditbeat] Avoid having Linux wait on clearing a backlog #7157

[Auditbeat] Avoid having Linux wait on clearing a backlog #7157

andrewkroh commented May 22, 2018 •

edited

Loading

dilchenko commented May 23, 2018 •

edited

Loading

kholia commented May 23, 2018

kholia commented May 23, 2018 •

edited

Loading

praseodym commented May 23, 2018

praseodym commented May 23, 2018 •

edited

Loading

dilchenko commented May 23, 2018 •

edited

Loading

praseodym commented May 23, 2018

praseodym commented Jun 5, 2018

jordansissel commented Jan 10, 2019

ossie-git commented Sep 12, 2019

[Auditbeat] Avoid having Linux wait on clearing a backlog #7157

[Auditbeat] Avoid having Linux wait on clearing a backlog #7157

Comments

andrewkroh commented May 22, 2018 • edited Loading

dilchenko commented May 23, 2018 • edited Loading

kholia commented May 23, 2018

kholia commented May 23, 2018 • edited Loading

praseodym commented May 23, 2018

praseodym commented May 23, 2018 • edited Loading

dilchenko commented May 23, 2018 • edited Loading

praseodym commented May 23, 2018

praseodym commented Jun 5, 2018

jordansissel commented Jan 10, 2019

ossie-git commented Sep 12, 2019

andrewkroh commented May 22, 2018 •

edited

Loading

dilchenko commented May 23, 2018 •

edited

Loading

kholia commented May 23, 2018 •

edited

Loading

praseodym commented May 23, 2018 •

edited

Loading

dilchenko commented May 23, 2018 •

edited

Loading