Delineate internal metadata fields with a boolean identifier #295

binarylogic · 2019-04-17T19:07:03Z

This is the successor to #256

There is a complex matrix of possible behaviors when you introduce structured vs non-structured data types across different topologies. This has been discussed at length:

Determine whether data has gone through explicit structuring #256
https://www.notion.so/timber/Raw-vs-Structured-Data-for-Sinks-f4a73d250d88427ca677ea0954e3ea75
And multiple times in Slack

The purpose of this issue is to materialize the conclusion we came to today in Slack so that we can move forward with a number of other dependent issues, such as specifying the hostname key in various sinks, or specifying a target key when parsing a log message, etc.

Quick Background

The issue was originally raised via [RFC] Drop the host & line keys in Record struct #155
The "Structured vs Non-Structured" RFC was created to address this: https://www.notion.so/timber/Raw-vs-Structured-Data-for-Sinks-f4a73d250d88427ca677ea0954e3ea75
This was implemented via Refactor Record and use bytes instead of String #204, which introduced a performance regression in the TCP source.
Perf improvements #269 resolved this performance regression by moving host back to a defined Record field.
This change caused a regression around how we handle the host key in the Splunk sink (or any sink that required host): Make splunk to use record host field #276
Which then led to this comment Make splunk to use record host field #276 (comment)
When then led to another long Slack discussion.

Solution

The solution today, which was proposed by @michaelfairley, is to delineate fields that were implicitly and explicitly set with a boolean. In other words, to change the structured map from string => string to string => (bool, string). This means we have one single map representing all structured data, with a simple boolean telling us if it was implicitly or explicitly set.

Examples

TCP -> TCP

[sources.in]
  type = "tcp"
  # ...

[sinks.out]
  inputs = ["in"]
  type = "tcp"
  # ...

Input: "Hello word" raw text line.

tcp (in) receives data and represents it as:

{
  "timestamp" => (<timestamp>, false)
  "message" => ("Hello world", false),
  "host" => ("my.host.com", false)
}

Where false means that the data was implicitly set.

tcp (out) is written:
```
Hello world\n
```
Only the raw "message" field is written to the TCP sink. This is because the data structured is recognized as "unstructured" since all values are false (implcitly set). This is is the default behavior for unsutrdcutred data in the TCP sink.

TCP -> JSON Parser -> TCP

[sources.in]
  type = "tcp"
  # ...

[transforms.json]
  inputs = ["in"]
  type = "parser"
  format = "json"
  # ...

[sinks.out]
  inputs = ["json"]
  type = "tcp"
  encoder = "json" # required
  # ...

Input: '{"msg": "Hello word", "key": "val"}'

tcp (in) receives data and represents it as:

{
  "timestamp" => (<timestamp>, false)
  "message" => ('{"msg": "Hello word"}', false),
  "host" => ("my.host.com", false)
}

Where false means that the data was implicitly set.

transform (json parser) transforms the data into:

{
  "timestamp" => (<timestamp>, false)
  "message" => ('{"msg": "Hello word"}', false),
  "host" => ("my.host.com", false),
  "msg" => ("Hello world", true),
  "key" => ("val", true)
}

tcp (out) is written:
```
{"msg": "Hello world", "key": "val"}
```
You'll notice the tcp.out declaration includes a required encoder option since it is receiving structured data. This will be handled via Implement topology-aware validations #235. You'll also notice that metadata fields are not included by default. This is because these are internal/transparent fields that are only used when necessary or explicitly included (happy to hear arguments otherwise).

TCP -> Splunk

[sources.in]
  type = "tcp"
  # ...

[sinks.out]
  inputs = ["in"]
  type = "splunk"

Input: "Hello word" raw text lin..

tcp (in) receives data and represents it as:

{
  "timestamp" => (<timestamp>, false)
  "message" => ("Hello world", false),
  "host" => ("my.host.com", false)
}

Where false means that the data was implicitly set.

splunk (out) forwards the "message" but also specifies the host since Splunk requires this metadata. By default, the sink looks for the "host" key since this is one of our "common" keys, but the user can willingly change that by setting the host_field setting in the sources.out declaration.

Requirements Checklist

This is a checklist to ensure we're handling all of the little details that come with this change. If it helps, we can break these out into separate issues, because I would assume they'll be separate PRs.

Decide on reserved field names. Ex: timestamp, message, and host. Alternatively, we could namespace the keys like _timestamp, _message, and _host. Feel free to go with these or choose entirely different names, I'm indifferent.

Sources

Transforms

add_field transform sets any added fields as explicit.
json_parser transform decodes data and sets all decoded fields as explicit.
regex_parser transform sets any extracted fields as explicit.

Sinks

The text was updated successfully, but these errors were encountered:

LucioFranco · 2019-05-14T17:03:49Z

@lukesteensen so I talked to @binarylogic about this earlier today. For http and splunk how do we handle partial json events and none json events in the same batch? So lets say we get an event that is not structured, therefore we encode it as text/plain, then the following event is structured so we encode it as application/x-ndjson. These would end up in the same batch. We can no longer use the same Content-Type in the request and thus the request would fail.

So two questions:

How do we handle multiple encodings in a batch and how does the batch communicate this to the batch fn we provide in the sink?
How do we set the content type dynamically for each type of event in the http sinks?

lukesteensen · 2019-05-14T17:48:12Z

That's a great question. The intent is definitely that every event following the same path through the topology should be treated the same way. I feel like the real answer is to move the explicit vs implicit tracking up out of the actual map of fields and up into the type system (i.e. figure it out in input_type and output_type based on the config), but I don't have a great handle on how much work that would be.

LucioFranco · 2019-05-14T19:20:29Z

Yeah, I think this is something we will have to think about.

Second, question, @lukesteensen @binarylogic, for stdin and for file, what would the host be?

binarylogic · 2019-05-14T19:27:21Z

The host is the local host for those sources. I assume Rust has a function for retrieving that?

In regards to the mixed payload types, I'm of the mindset right now that we follow whatever the first record does. Given Luke's comment, it should be very edge-casey that we'll see mixed records. For sinks that batch we definitely should not be mixing encoding types. The only time that is appropriate is for sinks like the console sink, where the events are streamed.

LucioFranco · 2019-05-14T20:27:27Z

@binarylogic does that mean like Lucio.Macbook host or the ip? for the ip we would need to query the network interfaces. I don't think there is anything in the standard library for this.

binarylogic · 2019-05-14T20:27:57Z

I defer to @lukesteensen for that one then.

lukesteensen · 2019-05-15T16:32:57Z

does that mean like Lucio.Macbook host or the ip?

It should match what we interpolate into configs as $HOSTNAME, which we figure out here. Basically check env vars and fall back to hostname::get_hostname().

For sinks that batch we definitely should not be mixing encoding types. The only time that is appropriate is for sinks like the console sink, where the events are streamed.

Even then I don't think it's appropriate. The "type" of record that comes out of a sink should always be consistent for a given config imo. We could maybe have an encoder option that'd let people opt into mixed encodings, but I'd be pretty surprised if anyone wanted it. To me it feels like a recipe for random unexpected failures in downstream components.

binarylogic · 2019-05-15T17:23:17Z

To me it feels like a recipe for random unexpected failures in downstream components.

Agree, we've decided to make the encoding option required for any sink that batches, this way there are not surprises.

binarylogic added type: enhancement A value-adding code change that enhances its existing functionality. Core: Data Processing labels Apr 17, 2019

binarylogic mentioned this issue Apr 17, 2019

Determine whether data has gone through explicit structuring #256

Closed

binarylogic assigned michaelfairley Apr 18, 2019

binarylogic added this to the 0.1.1 milestone Apr 18, 2019

binarylogic mentioned this issue Apr 18, 2019

Make splunk to use record host field #276

Closed

This was referenced Apr 23, 2019

Move all record fields into structured #308

Merged

Encapsulate structured #312

Merged

Track which fields in structured have been explicitly vs implicitly added #314

Merged

michaelfairley mentioned this issue May 1, 2019

Configurable encoders for TCP + JSON sink #318

Merged

michaelfairley removed their assignment May 10, 2019

binarylogic assigned LucioFranco May 13, 2019

This was referenced May 13, 2019

Add JSON structured support to cloudwatch #348

Closed

Add structured JSON support for cloudwatch #352

Merged

Update sinks to use structured events #356

Merged

LucioFranco mentioned this issue May 14, 2019

Update sources to use structured data dynamic host keys #360

Merged

LucioFranco closed this as completed May 20, 2019

binarylogic added the Core: Logs label Jun 19, 2019

binarylogic added domain: logs Anything related to Vector's log events domain: data model Anything related to Vector's internal data model and removed event type: log labels Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delineate internal metadata fields with a boolean identifier #295

Delineate internal metadata fields with a boolean identifier #295

binarylogic commented Apr 17, 2019 •

edited

Loading

LucioFranco commented May 14, 2019

lukesteensen commented May 14, 2019

LucioFranco commented May 14, 2019

binarylogic commented May 14, 2019

LucioFranco commented May 14, 2019

binarylogic commented May 14, 2019

lukesteensen commented May 15, 2019

binarylogic commented May 15, 2019 •

edited

Loading

Delineate internal metadata fields with a boolean identifier #295

Delineate internal metadata fields with a boolean identifier #295

Comments

binarylogic commented Apr 17, 2019 • edited Loading

Quick Background

Solution

Examples

TCP -> TCP

TCP -> JSON Parser -> TCP

TCP -> Splunk

Requirements Checklist

Sources

Transforms

Sinks

LucioFranco commented May 14, 2019

lukesteensen commented May 14, 2019

LucioFranco commented May 14, 2019

binarylogic commented May 14, 2019

LucioFranco commented May 14, 2019

binarylogic commented May 14, 2019

lukesteensen commented May 15, 2019

binarylogic commented May 15, 2019 • edited Loading

binarylogic commented Apr 17, 2019 •

edited

Loading

binarylogic commented May 15, 2019 •

edited

Loading