-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auditd: Fix kernel deadlock after ENOBUFS #26032
Conversation
This fixes a deadlock when the netlink channel is congested (initialization fails with "no buffer space available" / errno=ENOBUFS). Closes elastic#26031
Pinging @elastic/security-external-integrations (Team:Security-External Integrations) |
A more correct fix is to refactor the auditd module or go-libaudit to consume audit messages before initialization and until the netlink connection is closed for good. But for a bugfix, I'd rather avoid a complex refactor. I was a bit puzzled that errno=ENOBUFS is consistently received only by the SetPID operation. It seems that it can only be triggered from ACK responses and it can't happen until Auditbeat is configured as the Audit daemon. That would explain why we don't see it for other initialization calls like GetStatus. |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
// EBADFD, or any other error). This happens because the fd is closed. | ||
go func() { | ||
for { | ||
_, err := client.Netlink.Receive(true, discard) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only reader at this point, correct? Do we have to worry about any sort of data races in the case of a successfully started client that also has a read loop, or we don't care because we're shutting down anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the main reader from auditd that may be reading at the same time. In my tests, there's no issue of having both readers in parallel
Thanks for the fix. |
@wenlxie I've seen some cases where the SetPID succeeds but still the next SetPID in Close blocks. It's much less likely though. |
…UFS (elastic#26173) This fixes a deadlock when the netlink channel is congested (initialization fails with "no buffer space available" / errno=ENOBUFS). Closes elastic#26031 (cherry picked from commit 551baaa)
What does this PR do?
This fixes a deadlock when the netlink channel is congested (initialization fails with "no buffer space available" / errno=ENOBUFS).
Apart from preventing the deadlock by consuming events from the netlink channel during close, it adds code to handle the unlikely ENOBUFS during initialization, to prevent failure of the auditd module.
Why is it important?
Prevents a deadlock that has been reported on large deployments where Auditbeat is restarted frequently and the hosts have a large amount of auditd events.
The deadlock is reported to happen around 3 times for ~5000 daily restarts.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
See #26031
Related issues