-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delineate internal metadata fields with a boolean identifier #295
Comments
@lukesteensen so I talked to @binarylogic about this earlier today. For So two questions:
|
That's a great question. The intent is definitely that every event following the same path through the topology should be treated the same way. I feel like the real answer is to move the explicit vs implicit tracking up out of the actual map of fields and up into the type system (i.e. figure it out in |
Yeah, I think this is something we will have to think about. Second, question, @lukesteensen @binarylogic, for stdin and for file, what would the host be? |
The host is the local host for those sources. I assume Rust has a function for retrieving that? In regards to the mixed payload types, I'm of the mindset right now that we follow whatever the first record does. Given Luke's comment, it should be very edge-casey that we'll see mixed records. For sinks that batch we definitely should not be mixing encoding types. The only time that is appropriate is for sinks like the |
@binarylogic does that mean like |
I defer to @lukesteensen for that one then. |
It should match what we interpolate into configs as
Even then I don't think it's appropriate. The "type" of record that comes out of a sink should always be consistent for a given config imo. We could maybe have an encoder option that'd let people opt into mixed encodings, but I'd be pretty surprised if anyone wanted it. To me it feels like a recipe for random unexpected failures in downstream components. |
Agree, we've decided to make the |
This is the successor to #256
There is a complex matrix of possible behaviors when you introduce structured vs non-structured data types across different topologies. This has been discussed at length:
The purpose of this issue is to materialize the conclusion we came to today in Slack so that we can move forward with a number of other dependent issues, such as specifying the hostname key in various sinks, or specifying a target key when parsing a log message, etc.
Quick Background
host
&line
keys in Record struct #155host
back to a definedRecord
field.host
key in the Splunk sink (or any sink that required host): Make splunk to use record host field #276Solution
The solution today, which was proposed by @michaelfairley, is to delineate fields that were implicitly and explicitly set with a boolean. In other words, to change the
structured
map fromstring => string
tostring => (bool, string)
. This means we have one single map representing all structured data, with a simple boolean telling us if it was implicitly or explicitly set.Examples
TCP -> TCP
Input:
"Hello word"
raw text line.tcp (in)
receives data and represents it as:Where
false
means that the data was implicitly set.tcp (out)
is written:Only the raw
"message"
field is written to the TCP sink. This is because the data structured is recognized as "unstructured" since all values arefalse
(implcitly set). This is is the default behavior for unsutrdcutred data in the TCP sink.TCP -> JSON Parser -> TCP
Input:
'{"msg": "Hello word", "key": "val"}'
tcp (in)
receives data and represents it as:Where
false
means that the data was implicitly set.transform (json parser)
transforms the data into:tcp (out)
is written:You'll notice the
tcp.out
declaration includes a requiredencoder
option since it is receiving structured data. This will be handled via Implement topology-aware validations #235. You'll also notice that metadata fields are not included by default. This is because these are internal/transparent fields that are only used when necessary or explicitly included (happy to hear arguments otherwise).TCP -> Splunk
Input:
"Hello word"
raw text lin..tcp (in)
receives data and represents it as:Where
false
means that the data was implicitly set.splunk (out)
forwards the"message"
but also specifies thehost
since Splunk requires this metadata. By default, the sink looks for the"host"
key since this is one of our "common" keys, but the user can willingly change that by setting thehost_field
setting in thesources.out
declaration.Requirements Checklist
This is a checklist to ensure we're handling all of the little details that come with this change. If it helps, we can break these out into separate issues, because I would assume they'll be separate PRs.
timestamp
,message
, andhost
. Alternatively, we could namespace the keys like_timestamp
,_message
, and_host
. Feel free to go with these or choose entirely different names, I'm indifferent.Sources
file
source adds atimestamp
field to represent when the record was receivedfile
source includes thehost
key for the local server.file
source includes afile_key
config option to control the"file"
context key name.file
source includes ahost_key
config option to control the"host"
context key name.file
source behavior is tested.syslog
source adds atimestamp
field to represent when the record was receivedsyslog
source includes thehost
key when intcp
mode, this represents the host of the client.syslog
source includes thehost
key when inunix
mode, this should be the local host.syslog
source includes ahost_key
config option to control the"host"
context key name.syslog
source behavior is tested.stdin
source adds atimestamp
field to represent when the record was receivedstdin
source includes thehost
key for the local server.stdin
source includes ahost_key
config option to control the"host"
context key name.stdin
source behavior is tested.tcp
source adds atimestamp
field to represent when the record was receivedtcp
source includes thehost
key for the remote server.tcp
source includes ahost_key
config option to control the"host"
context key name.tcp
source behavior is tested.Transforms
add_field
transform sets any added fields as explicit.json_parser
transform decodes data and sets all decoded fields as explicit.regex_parser
transform sets any extracted fields as explicit.Sinks
cloudwatch_logs
sink forwards themessage
field only if the record is entirely implicitly structured.cloudwatch_logs
sink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured.cloudwatch_logs
sink maps thetimestamp
field to Cloudwatch's timestamp field and drops that field before encoding the data.console
sink prints themessage
field if the record is entirely implicitly structured.console
sink encodes the data to JSON if the record is not entirely implicitly structured. This payload should include all keys.console
sink provides anencoding
option withjson
,text
. If left unspecified, the behavior is dynamic, choosing the encoding on a per record basis based on it's explicit structured state.elasticsearch
sink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured.http
sink only includes the rawmessage
if the record is entirely implicitly structured. This should betext/plain
, new line delimited.http
sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured. This should beapplication/ndjson
(new line delimited).http
sink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured.kafka
sink only includes the rawmessage
if the record is entirely implicitly structured.kafka
sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.kinesis
sink only includes the rawmessage
if the record is entirely implicitly structured.kinesis
sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.s3
sink only includes the rawmessage
if the record is entirely implicitly structured.s3
sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.splunk
sink only includes the rawmessage
if the record is entirely implicitly structured.splunk
sink maps thehost
field appropriately (should this also be dropped?)kinesis
sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.tcp
sink only includes the rawmessage
if the record is entirely implicitly structured.tcp
sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.The text was updated successfully, but these errors were encountered: