Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auditd: Fix kernel deadlock after ENOBUFS #26032

Merged
merged 2 commits into from
Jun 7, 2021

Conversation

adriansr
Copy link
Contributor

@adriansr adriansr commented Jun 1, 2021

What does this PR do?

This fixes a deadlock when the netlink channel is congested (initialization fails with "no buffer space available" / errno=ENOBUFS).

Apart from preventing the deadlock by consuming events from the netlink channel during close, it adds code to handle the unlikely ENOBUFS during initialization, to prevent failure of the auditd module.

Why is it important?

Prevents a deadlock that has been reported on large deployments where Auditbeat is restarted frequently and the hosts have a large amount of auditd events.

The deadlock is reported to happen around 3 times for ~5000 daily restarts.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

See #26031

Related issues

This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes elastic#26031
@adriansr adriansr requested a review from a team as a code owner June 1, 2021 08:11
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 1, 2021
@adriansr adriansr added Auditbeat backport-v7.12.0 Automated backport with mergify backport-v7.13.0 Automated backport with mergify backport-v7.14.0 Automated backport with mergify bug Team:Security-External Integrations labels Jun 1, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 1, 2021
@adriansr adriansr added the review label Jun 1, 2021
@adriansr
Copy link
Contributor Author

adriansr commented Jun 1, 2021

A more correct fix is to refactor the auditd module or go-libaudit to consume audit messages before initialization and until the netlink connection is closed for good. But for a bugfix, I'd rather avoid a complex refactor.

I was a bit puzzled that errno=ENOBUFS is consistently received only by the SetPID operation. It seems that it can only be triggered from ACK responses and it can't happen until Auditbeat is configured as the Audit daemon. That would explain why we don't see it for other initialization calls like GetStatus.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 1, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #26032 updated

  • Start Time: 2021-06-02T13:53:00.537+0000

  • Duration: 73 min 5 sec

  • Commit: 238ede0

Test stats 🧪

Test Results
Failed 0
Passed 1004
Skipped 216
Total 1220

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 1004
Skipped 216
Total 1220

auditbeat/module/auditd/audit_linux.go Outdated Show resolved Hide resolved
auditbeat/module/auditd/audit_linux.go Show resolved Hide resolved
// EBADFD, or any other error). This happens because the fd is closed.
go func() {
for {
_, err := client.Netlink.Receive(true, discard)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only reader at this point, correct? Do we have to worry about any sort of data races in the case of a successfully started client that also has a read loop, or we don't care because we're shutting down anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the main reader from auditd that may be reading at the same time. In my tests, there's no issue of having both readers in parallel

@wenlxie
Copy link

wenlxie commented Jun 2, 2021

Thanks for the fix.
Another fix maybe simple:
When there is a setPid error during init, then not do setPid() operation in client.close()?

@adriansr
Copy link
Contributor Author

adriansr commented Jun 2, 2021

@wenlxie I've seen some cases where the SetPID succeeds but still the next SetPID in Close blocks. It's much less likely though.

@adriansr adriansr merged commit 3b50a28 into elastic:master Jun 7, 2021
mergify bot pushed a commit that referenced this pull request Jun 7, 2021
This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)
mergify bot pushed a commit that referenced this pull request Jun 7, 2021
This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)
mergify bot pushed a commit that referenced this pull request Jun 7, 2021
This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)
adriansr added a commit that referenced this pull request Jun 9, 2021
This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)

Co-authored-by: Adrian Serrano <adrisr83@gmail.com>
adriansr pushed a commit that referenced this pull request Jun 9, 2021
…6172)

This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)
adriansr pushed a commit that referenced this pull request Jun 9, 2021
…6173)

This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes #26031

(cherry picked from commit 3b50a28)
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…UFS (elastic#26173)

This fixes a deadlock when the netlink channel is congested
(initialization fails with "no buffer space available" / errno=ENOBUFS).

Closes elastic#26031

(cherry picked from commit 551baaa)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auditbeat backport-v7.12.0 Automated backport with mergify backport-v7.13.0 Automated backport with mergify backport-v7.14.0 Automated backport with mergify bug review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auditbeat: kauditd deadlocks when netlink channel is congested during startup
5 participants