Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce auto detection of format #18095

Merged
merged 12 commits into from
May 14, 2020
Merged

Introduce auto detection of format #18095

merged 12 commits into from
May 14, 2020

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Apr 29, 2020

What does this PR do?

This PR introduces auto-detection of Logstash's log file format (plaintext or JSON) and calls the appropriate ingest pipeline for parsing.

Why is it important?

The logstash Filebeat module has always has the ability to parse either plaintext or JSON logs emitted by Logstash. Prior to this PR users would need to manually choose a format by specifying the var.format configuration setting in their Logstash module configuration.

With this PR they will no longer need to manually choose the format; the module will auto-detect it for them. This is in line with what we do in the elasticsearch Filebeat module.

This change is also a requirement for migration modules to packages (see elastic/package-registry#270 (comment)).

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works Tests already exist.
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 29, 2020
@ycombinator ycombinator added Team:Services (Deprecated) Label for the former Integrations-Services team and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 29, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@ycombinator ycombinator added the in progress Pull request is currently in progress. label Apr 29, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 29, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

Test stats 🧪

Test Results
Failed 0
Passed 2781
Skipped 418
Total 3199

@ycombinator ycombinator marked this pull request as ready for review May 13, 2020 10:20
@ycombinator ycombinator added [zube]: In Review Filebeat Filebeat needs_backport PR is waiting to be backported to other branches. Feature:Stack Monitoring v7.9.0 v8.0.0 and removed [zube]: In Progress in progress Pull request is currently in progress. labels May 13, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/stack-monitoring (Stack monitoring)

@ycombinator ycombinator requested a review from mtojek May 13, 2020 10:22
Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It's great that it was possible to achieve the goal only by using configuration. The only concerns I have are mixed files (plain text combined together with JSON):

[2019-11-20T19:04:48,468][WARN ][org.logstash.dissect.Dissector][the_pipeline_id] Dissector mapping, pattern not found {"field"=>"message", "pattern"=>"%{LogLineTimeStamp->}\t%{Healthy}\t%{Fatals}\t%{Errors}\t%{Warnings}\t%{TimeToBuildPatternsCache}\t%{CachedPatternsCount}\t%{MessagesEnqueued}\t%{DropMsgNoSubscribers}\t%{MessagesEnqueued}\t%{TotalDests}\t%{CycleProcTime}\t%{TimeSinceNap}\t%{QUtilPermilAvg}\t%{QUtilPermilMax}\t%{QUtilPermilCount}\t%{NotifierRequests}\t%{NotifierProcessedRequests}\t%{NotifierRequestsChangeDynamicSubs}\t%{NotifierSentRequestsChangeExtDynamicSubs}\t%{NotifierProcessedRequestsDropped}\t%{NotifierBadTargets}\t%{NotifierCycleTimeNetAvg}\t%{NotifierCycleTimeNetCount}\t%{NotifierUtilAvg->}", "event"=>{"fields"=>{"pipeline"=>"mypipeline", "indexprefix"=>"idx", "regid"=>"w", "env"=>"production"}, "beat"=>{"version"=>"6.8.3", "hostname"=>"myhostname", "name"=>"myname"}, "message"=>"msg", "tags"=>["production", "beats_input_codec_plain_applied"], "host"=>{"name"=>"myhostname"}}}

I'm not blocking this PR.

@@ -9,7 +9,7 @@ This file is generated! See scripts/docs_collector.py
== Logstash module

The +{modulename}+ module parse logstash regular logs and the slow log, it will support the plain text format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: parses

multiline:
pattern: ^\[[0-9]{4}-[0-9]{2}-[0-9]{2}
pattern: ^(\[[0-9]{4}-[0-9]{2}-[0-9]{2}|{)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to trick this pattern with { character? I mean having a plain text file with curly brackets inside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a good point. If there's a multiline log event where, say, the 2nd line starts with a {, then this pattern breaks down. Unfortunately, I'm not really sure how to handle this scenario well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about extending this pattern to {"level" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For JSON-formatted logs, each log line is a JSON object. Being an object, I don't want to depend on a specific property, e.g. level, being the first one.

@ycombinator
Copy link
Contributor Author

The only concerns I have are mixed files (plain text combined together with JSON).

If a plain text log event has JSON anywhere after the first character, it should be handled fine. The problem only comes with plain text log events that have { as the first character of a line, which could happen in multiline plaintext events (see our other discussion in the comment about this).

@mtojek
Copy link
Contributor

mtojek commented May 13, 2020

I wonder if we can use the fact that a plain text file will always be a plain text file (and also the other way round).

@mtojek
Copy link
Contributor

mtojek commented May 13, 2020

Another idea, possible stupid one.

Try to parse the line as JSON and fallback to plain text.

@zube zube bot reopened this May 14, 2020
@zube zube bot closed this May 14, 2020
@zube zube bot reopened this May 14, 2020
@zube zube bot closed this May 14, 2020
@zube zube bot reopened this May 14, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented May 14, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

Test stats 🧪

Test Results
Failed 0
Passed 2781
Skipped 418
Total 3199

@ycombinator
Copy link
Contributor Author

ycombinator commented May 14, 2020

@mtojek Turns out it's not a matter of simply adding a closing ] to the regex pattern. That is, we can't just change it from:

^(\[[0-9]{4}-[0-9]{2}-[0-9]{2}|({.+}))

to:

^(\[[0-9]{4}-[0-9]{2}-[0-9]{2}\]|({.+}))

That's because the timestamp pattern is incomplete in the regex. It only accounts for the date part, not the time part. So either we have to change the regex to:

^((\[[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\])|({.+}))

or to:

^((\[[0-9]{4}-[0-9]{2}-[0-9]{2}[^\]]+\])|({.+}))

I'm not sure either change, for the sake of completeness, is worth the extra processing. The purpose of this regex is simply to detect if a new multiline event should be started (and the previous one completed) or not. So I'm going to leave the regex as-is.

@ycombinator ycombinator merged commit 6bdc7d7 into elastic:master May 14, 2020
@ycombinator ycombinator deleted the fb-ls-multi-pipeline branch May 14, 2020 22:50
v1v added a commit to v1v/beats that referenced this pull request May 15, 2020
…w-oss

* upstream/master: (27 commits)
  Disable host fields for "cloud", panw, cef modules (elastic#18223)
  [docs] Rename monitoring collection from legacy internal collection to legacy collection (elastic#18504)
  Introduce auto detection of format (elastic#18095)
  Add additional fields to address issue elastic#18465 for googlecloud audit log (elastic#18472)
  Fix libbeat import path in seccomp policy template (elastic#18418)
  Address Okta input issue elastic#18530 (elastic#18534)
  [Ingest Manager] Avoid Chown on windows (elastic#18512)
  Fix Cisco ASA/FTD msgs that use a host name as NAT address (elastic#18376)
  [CI] Optimise stash/unstash performance (elastic#18473)
  Libbeat: Remove global loggers from libbeat/metric and libbeat/cloudid (elastic#18500)
  Fix PANW bad mapping of client/source and server/dest packets and bytes (elastic#18525)
  Add a file lock to the data directory on startup to prevent multiple agents. (elastic#18483)
  Followup to 12606 (elastic#18316)
  changed input from syslog to tcp/udp due to unsupported RFC (elastic#18447)
  Improve ECS field mappings in Sysmon module. (elastic#18381)
  [Elastic Agent] Cleaner output of inspect command  (elastic#18405)
  [Elastic Agent] Pick up version from libbeat (elastic#18350)
  Update communitybeats.asciidoc (elastic#18470)
  [Metricbeat] Change visualization interval from 15m to >=15m (elastic#18466)
  docs: Fix typo in kerberos docs (elastic#18503)
  ...
ycombinator added a commit that referenced this pull request May 15, 2020
* Introduce auto detection of format

* Update docs

* Auto detect format for slowlogs

* Exclude JSON logs from multiline matching

* Adding CHANGELOG entry

* Fix typo

* Parsing everything as JSON first

* Going back to old processor definitions

* Adding Known Issues section in doc

* Completing regex pattern

* Updating regex pattern

* Generating docs
@ycombinator ycombinator removed the needs_backport PR is waiting to be backported to other branches. label May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Stack Monitoring Filebeat Filebeat Team:Services (Deprecated) Label for the former Integrations-Services team v7.9.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make logstash Filebeat module use multiple pipelines
4 participants