Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auditbeat: kauditd deadlocks when netlink channel is congested during startup #26031

Closed
adriansr opened this issue Jun 1, 2021 · 8 comments · Fixed by #26032
Closed

Auditbeat: kauditd deadlocks when netlink channel is congested during startup #26031

adriansr opened this issue Jun 1, 2021 · 8 comments · Fixed by #26032

Comments

@adriansr
Copy link
Contributor

adriansr commented Jun 1, 2021

If the netlink channel used to talk to kauditd is congested, Auditbeat's auditd module initialization can fail when setting the Audit PID:

2021-05-28T16:59:12.545Z	ERROR	[auditd]	auditd/audit_linux.go:154	Failure receiving audit events	{"error": "failed to set audit PID (current audit PID 0): error receiving audit reply: no buffer space available", "errorVerbose": "no buffer space available\nerror receiving audit reply\ngithub.com/elastic/go-libaudit/v2.(*AuditClient).getReply\n\t/home/azureuser/go/pkg/mod/github.com/elastic/go-libaudit/v2@v2.2.0/audit.go:492\ngithub.com/elastic/go-libaudit/v2.(*AuditClient).set\n\t/home/azureuser/go/pkg/mod/github.com/elastic/go-libaudit/v2@v2.2.0/audit.go:531\ngithub.com/elastic/go-libaudit/v2.(*AuditClient).SetPID\n\t/home/azureuser/go/pkg/mod/github.com/elastic/go-libaudit/v2@v2.2.0/audit.go:323\ngithub.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).initClient\n\t/home/azureuser/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:358\ngithub.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).receiveEvents\n\t/home/azureuser/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:386\ngithub.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).Run\n\t/home/azureuser/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:151\ngithub.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).run\n\t/home/azureuser/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203\ngithub.com/elastic/beats/v7/metricbeat/mb/module.(*Wrapper).Start.func1\n\t/home/azureuser/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:147\nruntime.goexit\n\t/home/azureuser/.gvm/versions/go1.15.12.linux.amd64/src/runtime/asm_amd64.s:1374\nfailed to set audit PID (current audit PID 0)\ngithub.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).initClient\n\t/home/azureuser/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:362\ngithub.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).receiveEvents\n\t/home/azureuser/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:386\ngithub.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).Run\n\t/home/azureuser/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:151\ngithub.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).run\n\t/home/azureuser/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203\ngithub.com/elastic/beats/v7/metricbeat/mb/module.(*Wrapper).Start.func1\n\t/home/azureuser/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:147\nruntime.goexit\n\t/home/azureuser/.gvm/versions/go1.15.12.linux.amd64/src/runtime/asm_amd64.s:1374"}

cleaned up error:

failed to set audit PID (current audit PID 0): error receiving audit reply: no buffer space available
no buffer space available
error receiving audit reply
github.com/elastic/go-libaudit/v2.(*AuditClient).getReply
	/go/pkg/mod/github.com/elastic/go-libaudit/v2@v2.2.0/audit.go:492
github.com/elastic/go-libaudit/v2.(*AuditClient).set
	/go/pkg/mod/github.com/elastic/go-libaudit/v2@v2.2.0/audit.go:531
github.com/elastic/go-libaudit/v2.(*AuditClient).SetPID
	/go/pkg/mod/github.com/elastic/go-libaudit/v2@v2.2.0/audit.go:323
github.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).initClient
	/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:356
github.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).receiveEvents
	/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:384
github.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).Run
	/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:151
github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).run
	/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203
github.com/elastic/beats/v7/metricbeat/mb/module.(*Wrapper).Start.func1
	/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:147
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1374
failed to set audit PID (current audit PID 0)
github.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).initClient
	/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:360
github.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).receiveEvents
	/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:384
github.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).Run
	/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:151
github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).run
	/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203
github.com/elastic/beats/v7/metricbeat/mb/module.(*Wrapper).Start.func1
	/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:147
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1374

This error triggers closing of the netlink channel, which deadlocks the kernel when the closing routine first tries to set the Audit PID to zero.

This causes the Auditbeat process to block indefinitely:

(kernel stack)

[<0>] netlink_attachskb+0x1b9/0x1d0
[<0>] netlink_unicast+0xab/0x260
[<0>] audit_receive_msg+0x7e3/0xe60
[<0>] audit_receive+0x47/0xa0
[<0>] netlink_unicast+0x198/0x260
[<0>] netlink_sendmsg+0x2e2/0x3d0
[<0>] sock_sendmsg+0x3e/0x50
[<0>] SYSC_sendto+0x101/0x190
[<0>] SyS_sendto+0xe/0x10
[<0>] do_syscall_64+0x80/0x1e0
[<0>] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[<0>] 0xffffffffffffffff

(usermode stack)

1 @ 0x55cfadf8f4e5 0x55cfadf8ee34 0x55cfadf8b6ce 0x55cfaf63abf7 0x55cfaf639c72 0x55cfaf63ca7e 0x55cfadf7ecee 0x55cfaf6395d9 0x55cfaf63956c 0x55cfaf6566bf 0x55cfaf626079 0x55cfaf62949b 0x55cfadf736e1
#	0x55cfadf8f4e4	syscall.Syscall6+0x4								/home/azureuser/.gvm/versions/go1.15.12.linux.amd64/src/syscall/asm_linux_amd64.s:41
#	0x55cfadf8ee33	syscall.sendto+0x93								/home/azureuser/.gvm/versions/go1.15.12.linux.amd64/src/syscall/zsyscall_linux_amd64.go:1595
#	0x55cfadf8b6cd	syscall.Sendto+0xad								/home/azureuser/.gvm/versions/go1.15.12.linux.amd64/src/syscall/syscall_unix.go:298
#	0x55cfaf63abf6	github.com/elastic/go-libaudit/v2.(*NetlinkClient).Send+0xd6			/home/azureuser/go/pkg/mod/github.com/elastic/go-libaudit/v2@v2.2.0/netlink.go:140
#	0x55cfaf639c71	github.com/elastic/go-libaudit/v2.(*AuditClient).set+0xd1			/home/azureuser/go/pkg/mod/github.com/elastic/go-libaudit/v2@v2.2.0/audit.go:521
#	0x55cfaf63ca7d	github.com/elastic/go-libaudit/v2.(*AuditClient).Close.func1+0xdd		/home/azureuser/go/pkg/mod/github.com/elastic/go-libaudit/v2@v2.2.0/audit.go:443
#	0x55cfadf7eced	sync.(*Once).doSlow+0xed							/home/azureuser/.gvm/versions/go1.15.12.linux.amd64/src/sync/once.go:66
#	0x55cfaf6395d8	sync.(*Once).Do+0x98								/home/azureuser/.gvm/versions/go1.15.12.linux.amd64/src/sync/once.go:57
#	0x55cfaf63956b	github.com/elastic/go-libaudit/v2.(*AuditClient).Close+0x2b			/home/azureuser/go/pkg/mod/github.com/elastic/go-libaudit/v2@v2.2.0/audit.go:436
#	0x55cfaf6566be	github.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).Run+0x5fe	/home/azureuser/go/src/github.com/elastic/beats/auditbeat/module/auditd/audit_linux.go:155
#	0x55cfaf626078	github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).run+0x6b8	/home/azureuser/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203
#	0x55cfaf62949a	github.com/elastic/beats/v7/metricbeat/mb/module.(*Wrapper).Start.func1+0x25a	/home/azureuser/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:147

and any other process attempting to send a message to kauditd also blocks:

[ 1088.476058] INFO: task systemd:1 blocked for more than 120 seconds.
[ 1088.480734]       Not tainted 4.15.0-1113-azure #126~16.04.1-Ubuntu
[ 1088.485363] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1088.502837] systemd         D    0     1      0 0x00000000
[ 1088.502840] Call Trace:
[ 1088.502850]  __schedule+0x3d6/0x8b0
[ 1088.502855]  ? __wake_up_common_lock+0x8e/0xc0
[ 1088.502858]  schedule+0x36/0x80
[ 1088.502861]  schedule_preempt_disabled+0xe/0x10
[ 1088.502863]  __mutex_lock.isra.5+0x270/0x500
[ 1088.502867]  ? __kmalloc_node_track_caller+0x201/0x2c0
[ 1088.502869]  __mutex_lock_slowpath+0x13/0x20
[ 1088.502870]  ? __mutex_lock_slowpath+0x13/0x20
[ 1088.502872]  mutex_lock+0x2f/0x40
[ 1088.502875]  audit_receive+0x2a/0xa0
[ 1088.502878]  netlink_unicast+0x198/0x260
[ 1088.502880]  netlink_sendmsg+0x2e2/0x3d0
[ 1088.502885]  sock_sendmsg+0x3e/0x50
[ 1088.502886]  SYSC_sendto+0x101/0x190
[ 1088.502889]  ? handle_mm_fault+0xe7/0x220
[ 1088.502894]  ? __do_page_fault+0x2ae/0x4d0
[ 1088.502897]  SyS_sendto+0xe/0x10
[ 1088.502900]  do_syscall_64+0x80/0x1e0
[ 1088.502903]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
[ 1088.502906] RIP: 0033:0x7f56ee6e2ab3
[ 1088.502907] RSP: 002b:00007ffd03bd8688 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 1088.502909] RAX: ffffffffffffffda RBX: 0000000000000080 RCX: 00007f56ee6e2ab3
[ 1088.502910] RDX: 0000000000000080 RSI: 00007ffd03bd86c0 RDI: 0000000000000028
[ 1088.502911] RBP: 0000000000000028 R08: 00007ffd03bd86a0 R09: 000000000000000c
[ 1088.502912] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd03bd86c0
[ 1088.502913] R13: 00007ffd03bd86a0 R14: 000000000000046a R15: 000000000000006f

The only solution is to kill -9 the auditbeat process.

For confirmed bugs, please report:

  1. Disable all modules from Auditbeat except auditd, with default settings. Stop any auditd service.
  2. Restart Auditbeat in a loop (helper script (use root))
  3. In a separate terminal, run as many instances as possible of the netlink message flooder as root. In my case it deadlocks quick with 8~12 instances.
  4. Have a separate root terminal ready before the deadlock, to recover the system (killall -9 auditbeat).
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 1, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 1, 2021
adriansr added a commit to adriansr/beats that referenced this issue Jun 1, 2021
This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes elastic#26031
@wenlxie
Copy link

wenlxie commented Jun 2, 2021

@adriansr Thanks for the fix.
I debug this issue yesterday and find the same deadlock behavior.

@newly12
Copy link
Contributor

newly12 commented Jun 2, 2021

@adriansr thanks for this issue as well as the fix. Apart from fixing the deadlock,

  • is there a way to panic auditbeat when such issue happens? If that's possible auditbeat could be restarted either via something systemd or containerd(if running in a container/pod).
  • if not, any chance to detect that auditd module has failed from its API or command line? for the same reason, we could simply add a liveiness/readiness probe to restart it.

since this could also impact/break file integrity module, it seems to make sense to panic auditbeat and have it start over again. any thoughts?

@andrewstucki
Copy link
Contributor

@newly12

since this could also impact/break file integrity module, it seems to make sense to panic auditbeat and have it start over again. any thoughts?

So, the problem here is that the netlink socket buffer used to do the userspace/kernel communication for the audit subsystem is full and nothing is draining it from userspace, so kernel tasks lock up when the kernel waits for space to publish a message over the socket. The file integrity module uses inotify to event on file operations and is independent of the audit framework, so this is completely independent other than the issue of having more systematic effects.

@newly12
Copy link
Contributor

newly12 commented Jun 2, 2021

@andrewstucki I understand auditd is just a submodule of auditbeat, however it's kind of special especially in a k8s environment, i.e. there can only be an auditbeat in a node with auditd module enabled(unlike file integrity module) if I understand correctly(which is why it conflicts with auditd service?), while file integrity module could be turned on multiple paths, and much more paths if autodiscover is enabled against multiple pods.

The fix seems to quit auditd module silently(only error logs) without another chance for auto remediation, like I mentioned previously. Regardless, this doesn't have to be fixed in the same PR, or tracked in the same issue. :)

@andrewstucki
Copy link
Contributor

@newly12 we can discuss more in a https://discuss.elastic.co/ if you're interested, but the fix actually maintains the previous behavior of the auditd module, which, if it fails to register itself as the main audit userspace daemon, returns an error that eventually exits the process. Previously, the error handling logic would occasionally deadlock on this error path due to socket buffer congestion, this fixes the deadlock and maintains the old behavior. Any other changes to the error handling logic would likely result in a breaking change in the behavior of auditbeat and should really come in the form of an enhancement request to redesign the auditbeat startup handling behavior.

WRT the conflict with the auditd service, this is more about the unicast v. multicast behavior of the netlink socket used to communicate between the kernel and userspace -- if set to unicast mode the audit framework assumes you have one primary userspace service consuming events, in multicast mode, you can have multiple services consuming.

I understand that one module quitting a beat affects other modules that the same beat is configured to run, but I don't see how this is particular to the file integrity modules. That said, if you have a clear use-case for re-working the startup and error handling of auditbeat, I'd suggest you open an enhancement request laying out your thoughts. Thanks!

@newly12
Copy link
Contributor

newly12 commented Jun 2, 2021

Thanks. I didn't realize the difference between unifcast vs multicast modes.

but the fix actually maintains the previous behavior of the auditd module, which, if it fails to register itself as the main audit userspace daemon, returns an error that eventually exits the process

Is this the case that would exit auditbeat process even with other moduke i.e. file integrity enabled? The reason I am asking is that I thought auditd as a module is not a critical component/path to auditbeat.

adriansr added a commit that referenced this issue Jun 7, 2021
This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031
mergify bot pushed a commit that referenced this issue Jun 7, 2021
This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)
mergify bot pushed a commit that referenced this issue Jun 7, 2021
This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)
mergify bot pushed a commit that referenced this issue Jun 7, 2021
This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)
adriansr added a commit that referenced this issue Jun 9, 2021
This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)

Co-authored-by: Adrian Serrano <adrisr83@gmail.com>
adriansr pushed a commit that referenced this issue Jun 9, 2021
…6172)

This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)
adriansr pushed a commit that referenced this issue Jun 9, 2021
…6173)

This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)
leweafan pushed a commit to leweafan/beats that referenced this issue Apr 28, 2023
…UFS (elastic#26173)

This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes elastic#26031

(cherry picked from commit 551baaa)
@scottgeary
Copy link

@newly12 We've had to monitor Auditbeat using ss to see if has open netlink sockets to the kernel:

# Note that SS has limited filtering for netlink fields
CONNECTIONS=`ss --no-header --family=netlink sport audit:${YOUR_AUDITBEAT_PID}`

It appears to have 2 connections in a normal state, when the auditd module is operational. We restart the systemd service if this is <2. And it's mostly a very similar buffer error at boot.

Would love it if Auditbeat could monitor itself, or just exit when any of the modules die/hang (auditd or file_integrity).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants