-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce auto detection of format #18095
Introduce auto detection of format #18095
Conversation
Pinging @elastic/integrations-services (Team:Services) |
💚 Build SucceededExpand to view the summary
Build stats
Test stats 🧪
|
Pinging @elastic/stack-monitoring (Stack monitoring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It's great that it was possible to achieve the goal only by using configuration. The only concerns I have are mixed files (plain text combined together with JSON):
[2019-11-20T19:04:48,468][WARN ][org.logstash.dissect.Dissector][the_pipeline_id] Dissector mapping, pattern not found {"field"=>"message", "pattern"=>"%{LogLineTimeStamp->}\t%{Healthy}\t%{Fatals}\t%{Errors}\t%{Warnings}\t%{TimeToBuildPatternsCache}\t%{CachedPatternsCount}\t%{MessagesEnqueued}\t%{DropMsgNoSubscribers}\t%{MessagesEnqueued}\t%{TotalDests}\t%{CycleProcTime}\t%{TimeSinceNap}\t%{QUtilPermilAvg}\t%{QUtilPermilMax}\t%{QUtilPermilCount}\t%{NotifierRequests}\t%{NotifierProcessedRequests}\t%{NotifierRequestsChangeDynamicSubs}\t%{NotifierSentRequestsChangeExtDynamicSubs}\t%{NotifierProcessedRequestsDropped}\t%{NotifierBadTargets}\t%{NotifierCycleTimeNetAvg}\t%{NotifierCycleTimeNetCount}\t%{NotifierUtilAvg->}", "event"=>{"fields"=>{"pipeline"=>"mypipeline", "indexprefix"=>"idx", "regid"=>"w", "env"=>"production"}, "beat"=>{"version"=>"6.8.3", "hostname"=>"myhostname", "name"=>"myname"}, "message"=>"msg", "tags"=>["production", "beats_input_codec_plain_applied"], "host"=>{"name"=>"myhostname"}}}
I'm not blocking this PR.
@@ -9,7 +9,7 @@ This file is generated! See scripts/docs_collector.py | |||
== Logstash module | |||
|
|||
The +{modulename}+ module parse logstash regular logs and the slow log, it will support the plain text format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: parses
multiline: | ||
pattern: ^\[[0-9]{4}-[0-9]{2}-[0-9]{2} | ||
pattern: ^(\[[0-9]{4}-[0-9]{2}-[0-9]{2}|{) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to trick this pattern with {
character? I mean having a plain text file with curly brackets inside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a good point. If there's a multiline log event where, say, the 2nd line starts with a {
, then this pattern breaks down. Unfortunately, I'm not really sure how to handle this scenario well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about extending this pattern to {"level"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For JSON-formatted logs, each log line is a JSON object. Being an object, I don't want to depend on a specific property, e.g. level
, being the first one.
If a plain text log event has JSON anywhere after the first character, it should be handled fine. The problem only comes with plain text log events that have |
I wonder if we can use the fact that a plain text file will always be a plain text file (and also the other way round). |
Another idea, possible stupid one. Try to parse the line as JSON and fallback to plain text. |
💚 Build SucceededExpand to view the summary
Build stats
Test stats 🧪
|
@mtojek Turns out it's not a matter of simply adding a closing
to:
That's because the timestamp pattern is incomplete in the regex. It only accounts for the date part, not the time part. So either we have to change the regex to:
or to:
I'm not sure either change, for the sake of completeness, is worth the extra processing. The purpose of this regex is simply to detect if a new multiline event should be started (and the previous one completed) or not. So I'm going to leave the regex as-is. |
…w-oss * upstream/master: (27 commits) Disable host fields for "cloud", panw, cef modules (elastic#18223) [docs] Rename monitoring collection from legacy internal collection to legacy collection (elastic#18504) Introduce auto detection of format (elastic#18095) Add additional fields to address issue elastic#18465 for googlecloud audit log (elastic#18472) Fix libbeat import path in seccomp policy template (elastic#18418) Address Okta input issue elastic#18530 (elastic#18534) [Ingest Manager] Avoid Chown on windows (elastic#18512) Fix Cisco ASA/FTD msgs that use a host name as NAT address (elastic#18376) [CI] Optimise stash/unstash performance (elastic#18473) Libbeat: Remove global loggers from libbeat/metric and libbeat/cloudid (elastic#18500) Fix PANW bad mapping of client/source and server/dest packets and bytes (elastic#18525) Add a file lock to the data directory on startup to prevent multiple agents. (elastic#18483) Followup to 12606 (elastic#18316) changed input from syslog to tcp/udp due to unsupported RFC (elastic#18447) Improve ECS field mappings in Sysmon module. (elastic#18381) [Elastic Agent] Cleaner output of inspect command (elastic#18405) [Elastic Agent] Pick up version from libbeat (elastic#18350) Update communitybeats.asciidoc (elastic#18470) [Metricbeat] Change visualization interval from 15m to >=15m (elastic#18466) docs: Fix typo in kerberos docs (elastic#18503) ...
* Introduce auto detection of format * Update docs * Auto detect format for slowlogs * Exclude JSON logs from multiline matching * Adding CHANGELOG entry * Fix typo * Parsing everything as JSON first * Going back to old processor definitions * Adding Known Issues section in doc * Completing regex pattern * Updating regex pattern * Generating docs
What does this PR do?
This PR introduces auto-detection of Logstash's log file format (plaintext or JSON) and calls the appropriate ingest pipeline for parsing.
Why is it important?
The
logstash
Filebeat module has always has the ability to parse either plaintext or JSON logs emitted by Logstash. Prior to this PR users would need to manually choose a format by specifying thevar.format
configuration setting in their Logstash module configuration.With this PR they will no longer need to manually choose the format; the module will auto-detect it for them. This is in line with what we do in the
elasticsearch
Filebeat module.This change is also a requirement for migration modules to packages (see elastic/package-registry#270 (comment)).
Checklist
I have commented my code, particularly in hard-to-understand areasI have added tests that prove my fix is effective or that my feature worksTests already exist.CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Related issues