Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Winlogbeat] Tolerate faults when Windows Event Log session is interrupted #28191

Merged
merged 6 commits into from
Oct 6, 2021

Conversation

taylor-swanson
Copy link
Contributor

@taylor-swanson taylor-swanson commented Sep 29, 2021

What does this PR do?

  • Added a retry mechanism to winlog/input and winlogbeat to reopen a
    session to Windows Event Log when certain error conditions are encountered.
  • This applies when opening a session to Windows Event Log and
    also when reading from Windows Event Log.

Why is it important?

  • Filebeat/winlogbeat should be able to tolerate disruptions with the Windows Event Log service without having to be manually restarted by the user.

Checklist

  • My code follows the style guidelines of this project
    • I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Reproducing the issue

  • Run filebeat (with winlog input) or winlogbeat on Windows system. The channel subscription doesn't
    matter, it can be Security or other channels. Make sure debug logging and logging to STDERR is enabled.
.\winlogbeat -e -d "*"
.\filebeat -e -d "*"
  • You should eventually see messages every second indicating no logs to read. This is normal.
  • Stop the Windows Event Log service.
  • The filebeat/winlogbeat log should report errors (usually RPC cancelled). The messages logged every second no longer appear.
  • Restarting the Windows Event Log service has no effect. New event logs are not picked up by filebeat/winlogbeat.
  • The only way to get filebeat/winlogbeat watching logs again is to restart it.

Testing the fix

  • Deploy fixed binary to system
  • Run filebeat/winlogbeat as before
  • Stop the Windows Event Log service.
  • Watch the logs for "Recoverable error". The handle to Windows Event Log should be reopened within a matter of seconds.
  • Start the Windows Event Log service.
  • New events should flow in as they are generated.

Logs

The following logs are from filebeat. Winlogbeat produces similar logs.

Logs showing normal activity (when no event logs are being generated):

2021-09-29T07:09:39.778-0700    DEBUG   [eventlog_detail]       eventlog/wineventlog.go:298     WinEventLog[Security] No more events
2021-09-29T07:09:40.788-0700    DEBUG   [eventlog_detail]       eventlog/wineventlog.go:298     WinEventLog[Security] No more events
2021-09-29T07:09:41.794-0700    DEBUG   [eventlog_detail]       eventlog/wineventlog.go:298     WinEventLog[Security] No more events
2021-09-29T07:09:42.799-0700    DEBUG   [eventlog_detail]       eventlog/wineventlog.go:298     WinEventLog[Security] No more events
...

Logs indicating original problem (error may vary):

2021-09-29T07:09:51.939-0700    WARN    eventlog/wineventlog.go:314     WinEventLog[Security] EventHandles returned error The remote procedure call was cancelled.
2021-09-29T07:09:51.939-0700    ERROR   [input.winlog]  winlog/input.go:123     Error occured while reading from Windows Event Log 'Security': The remote procedure call was cancelled.       {"id": "DF05023D528B0A9F", "input_source": "Security", "eventlog": "Security"}

...

2021-09-29T07:09:51.939-0700    DEBUG   [eventlog]      eventlog/wineventlog.go:284     WinEventLog[Security] Closing handle
2021-09-29T07:09:51.939-0700    ERROR   [input.winlog]  compat/compat.go:122    Input 'winlog' failed with: input.go:130: input DF05023D528B0A9F failed (id=DF05023D528B0A9F)
        The remote procedure call was cancelled.        {"id": "DF05023D528B0A9F"}

Logs showing issue resolved by fix (error may vary):

2021-09-29T07:00:28.159-0700    ERROR   [input.winlog]  winlog/input.go:110     Encountered recoverable error when opening Windows Event Log: The RPC server is unavailable.  {"id": "DF05023D528B0A9F", "input_source": "Security", "eventlog": "Security"}

...

2021-09-29T07:00:33.165-0700    DEBUG   [input.winlog]  winlog/input.go:116     Windows Event Log 'Security' opened successfully     {"id": "DF05023D528B0A9F", "input_source": "Security", "eventlog": "Security"}

Related issues

- Added a retry mechanism to winlog/input and winlogbeat to reopen a
session to Windows Event Log when certain error conditions are encountered.
- This applies when opening a session to Windows Event Log and
also when reading from Windows Event Log.
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Sep 29, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 29, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-10-04T13:21:10.492+0000

  • Duration: 127 min 18 sec

  • Commit: 199572e

Test stats 🧪

Test Results
Failed 0
Passed 47072
Skipped 5330
Total 52402

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

@taylor-swanson taylor-swanson marked this pull request as ready for review September 29, 2021 20:23
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one suggestion from me.

openErr := api.Open(evtCheckpoint)
if eventlog.IsRecoverable(openErr) {
log.Errorf("Encountered recoverable error when opening Windows Event Log: %v", openErr)
time.Sleep(time.Second * 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using timed.Wait(cancelCtx, 5 * time.Second) would be better because it would allow the sleep to unblock if cancellation occurs.

@taylor-swanson taylor-swanson merged commit 3c1731d into elastic:master Oct 6, 2021
@taylor-swanson taylor-swanson deleted the winevent-rpc-failed branch October 6, 2021 13:57
mergify bot pushed a commit that referenced this pull request Oct 6, 2021
…upted (#28191)

- Added a retry mechanism to winlog/input and winlogbeat to reopen a
session to Windows Event Log when certain error conditions are encountered.
- This applies when opening a session to Windows Event Log and
also when reading from Windows Event Log.

(cherry picked from commit 3c1731d)
taylor-swanson added a commit that referenced this pull request Oct 7, 2021
…t Log session is interrupted (#28283)

* [Winlogbeat] Tolerate faults when Windows Event Log session is interrupted (#28191)

- Added a retry mechanism to winlog/input and winlogbeat to reopen a
session to Windows Event Log when certain error conditions are encountered.
- This applies when opening a session to Windows Event Log and
also when reading from Windows Event Log.

(cherry picked from commit 3c1731d)

Co-authored-by: Taylor Swanson <90622908+taylor-swanson@users.noreply.github.com>
v1v added a commit to v1v/beats that referenced this pull request Oct 11, 2021
* upstream/master: (73 commits)
  Remove GCP support from Functionbeat (elastic#28253)
  Move labels and annotations under kubernetes.namespace. (elastic#27917)
  Update go release version 1.17.1 (elastic#27543)
  Osquerybeat: Runner and Fetcher unit tests (elastic#28290)
  Osquerybeat: Improve handling of osquery.autoload file, allow customizations (elastic#28289)
  seccomp: allow clone3 syscall for x86 (elastic#28117)
  packetbeat/protos/dns: don't render missing A and AAAA addresses from truncated records (elastic#28297)
  [7.x] [DOCS] Update api_key example on elasticsearch output (elastic#28288)
  [cloud][docker] use the private docker namespace (elastic#28286)
  Update aws-lambda-go library version to 1.13.3 (elastic#28236)
  Deprecate common.Float (elastic#28280)
  Filebeat: Change compatibility test stage to test against previous minor instead of 7.11 (elastic#28274)
  x-pack/filebeat/module/threatintel/misp: add support for secondary object attribute handling (elastic#28124)
  Explicitly pass http config to doppler consumer (elastic#28277)
  processors/actions/add_fields: Do not panic if event.Fields is nil map (elastic#28219)
  Resolved timestamp for defender atp (elastic#28272)
  [Winlogbeat] Tolerate faults when Windows Event Log session is interrupted (elastic#28191)
  [elastic-agent] proxy requests to subprocesses to their metrics endpoints (elastic#28165)
  Build cloud docker images for elastic-agent (elastic#28134)
  Upgrade k8s go-client library (elastic#28228)
  ...
Icedroid pushed a commit to Icedroid/beats that referenced this pull request Nov 1, 2021
…upted (elastic#28191)

- Added a retry mechanism to winlog/input and winlogbeat to reopen a
session to Windows Event Log when certain error conditions are encountered.
- This applies when opening a session to Windows Event Log and
also when reading from Windows Event Log.
@nicpenning
Copy link
Contributor

I am excited for this fix, as it seems this happens a lot to the Windows Defender channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.16.0 Automated backport with mergify bug Winlogbeat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Winlogbeat] Continue to read channel after Windows Event Log service has been restarted
4 participants