Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delineate internal metadata fields with a boolean identifier #295

Closed
43 tasks done
binarylogic opened this issue Apr 17, 2019 · 8 comments
Closed
43 tasks done

Delineate internal metadata fields with a boolean identifier #295

binarylogic opened this issue Apr 17, 2019 · 8 comments
Assignees
Labels
domain: data model Anything related to Vector's internal data model domain: logs Anything related to Vector's log events type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@binarylogic
Copy link
Contributor

binarylogic commented Apr 17, 2019

This is the successor to #256

There is a complex matrix of possible behaviors when you introduce structured vs non-structured data types across different topologies. This has been discussed at length:

The purpose of this issue is to materialize the conclusion we came to today in Slack so that we can move forward with a number of other dependent issues, such as specifying the hostname key in various sinks, or specifying a target key when parsing a log message, etc.

Quick Background

  1. The issue was originally raised via [RFC] Drop the host & line keys in Record struct #155
  2. The "Structured vs Non-Structured" RFC was created to address this: https://www.notion.so/timber/Raw-vs-Structured-Data-for-Sinks-f4a73d250d88427ca677ea0954e3ea75
  3. This was implemented via Refactor Record and use bytes instead of String #204, which introduced a performance regression in the TCP source.
  4. Perf improvements #269 resolved this performance regression by moving host back to a defined Record field.
  5. This change caused a regression around how we handle the host key in the Splunk sink (or any sink that required host): Make splunk to use record host field #276
  6. Which then led to this comment Make splunk to use record host field #276 (comment)
  7. When then led to another long Slack discussion.

Solution

The solution today, which was proposed by @michaelfairley, is to delineate fields that were implicitly and explicitly set with a boolean. In other words, to change the structured map from string => string to string => (bool, string). This means we have one single map representing all structured data, with a simple boolean telling us if it was implicitly or explicitly set.

Examples

TCP -> TCP

[sources.in]
  type = "tcp"
  # ...

[sinks.out]
  inputs = ["in"]
  type = "tcp"
  # ...
  1. Input: "Hello word" raw text line.

  2. tcp (in) receives data and represents it as:

    {
      "timestamp" => (<timestamp>, false)
      "message" => ("Hello world", false),
      "host" => ("my.host.com", false)
    }
    

    Where false means that the data was implicitly set.

  3. tcp (out) is written:

    Hello world\n
    

    Only the raw "message" field is written to the TCP sink. This is because the data structured is recognized as "unstructured" since all values are false (implcitly set). This is is the default behavior for unsutrdcutred data in the TCP sink.

TCP -> JSON Parser -> TCP

[sources.in]
  type = "tcp"
  # ...

[transforms.json]
  inputs = ["in"]
  type = "parser"
  format = "json"
  # ...

[sinks.out]
  inputs = ["json"]
  type = "tcp"
  encoder = "json" # required
  # ...
  1. Input: '{"msg": "Hello word", "key": "val"}'

  2. tcp (in) receives data and represents it as:

    {
      "timestamp" => (<timestamp>, false)
      "message" => ('{"msg": "Hello word"}', false),
      "host" => ("my.host.com", false)
    }
    

    Where false means that the data was implicitly set.

  3. transform (json parser) transforms the data into:

    {
      "timestamp" => (<timestamp>, false)
      "message" => ('{"msg": "Hello word"}', false),
      "host" => ("my.host.com", false),
      "msg" => ("Hello world", true),
      "key" => ("val", true)
    }
    
  4. tcp (out) is written:

    {"msg": "Hello world", "key": "val"}
    

    You'll notice the tcp.out declaration includes a required encoder option since it is receiving structured data. This will be handled via Implement topology-aware validations #235. You'll also notice that metadata fields are not included by default. This is because these are internal/transparent fields that are only used when necessary or explicitly included (happy to hear arguments otherwise).

TCP -> Splunk

[sources.in]
  type = "tcp"
  # ...

[sinks.out]
  inputs = ["in"]
  type = "splunk"
  1. Input: "Hello word" raw text lin..

  2. tcp (in) receives data and represents it as:

    {
      "timestamp" => (<timestamp>, false)
      "message" => ("Hello world", false),
      "host" => ("my.host.com", false)
    }
    

    Where false means that the data was implicitly set.

  3. splunk (out) forwards the "message" but also specifies the host since Splunk requires this metadata. By default, the sink looks for the "host" key since this is one of our "common" keys, but the user can willingly change that by setting the host_field setting in the sources.out declaration.

Requirements Checklist

This is a checklist to ensure we're handling all of the little details that come with this change. If it helps, we can break these out into separate issues, because I would assume they'll be separate PRs.

  • Decide on reserved field names. Ex: timestamp, message, and host. Alternatively, we could namespace the keys like _timestamp, _message, and _host. Feel free to go with these or choose entirely different names, I'm indifferent.

Sources

  • file source adds a timestamp field to represent when the record was received
  • file source includes the host key for the local server.
  • file source includes a file_key config option to control the "file" context key name.
  • file source includes a host_key config option to control the "host" context key name.
  • The above file source behavior is tested.
  • syslog source adds a timestamp field to represent when the record was received
  • syslog source includes the host key when in tcp mode, this represents the host of the client.
  • syslog source includes the host key when in unix mode, this should be the local host.
  • syslog source includes a host_key config option to control the "host" context key name.
  • The above syslog source behavior is tested.
  • stdin source adds a timestamp field to represent when the record was received
  • stdin source includes the host key for the local server.
  • stdin source includes a host_key config option to control the "host" context key name.
  • The above stdin source behavior is tested.
  • tcp source adds a timestamp field to represent when the record was received
  • tcp source includes the host key for the remote server.
  • tcp source includes a host_key config option to control the "host" context key name.
  • The above tcp source behavior is tested.

Transforms

  • add_field transform sets any added fields as explicit.
  • json_parser transform decodes data and sets all decoded fields as explicit.
  • regex_parser transform sets any extracted fields as explicit.

Sinks

  • cloudwatch_logs sink forwards the message field only if the record is entirely implicitly structured.
  • cloudwatch_logs sink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured.
  • cloudwatch_logs sink maps the timestamp field to Cloudwatch's timestamp field and drops that field before encoding the data.
  • console sink prints the message field if the record is entirely implicitly structured.
  • console sink encodes the data to JSON if the record is not entirely implicitly structured. This payload should include all keys.
  • console sink provides an encoding option with json, text. If left unspecified, the behavior is dynamic, choosing the encoding on a per record basis based on it's explicit structured state.
  • elasticsearch sink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured.
  • http sink only includes the raw message if the record is entirely implicitly structured. This should be text/plain, new line delimited.
  • http sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured. This should be application/ndjson (new line delimited).
  • http sink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured.
  • kafka sink only includes the raw message if the record is entirely implicitly structured.
  • kafka sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.
  • kinesis sink only includes the raw message if the record is entirely implicitly structured.
  • kinesis sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.
  • s3 sink only includes the raw message if the record is entirely implicitly structured.
  • s3 sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.
  • splunk sink only includes the raw message if the record is entirely implicitly structured.
  • splunk sink maps the host field appropriately (should this also be dropped?)
  • kinesis sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.
  • tcp sink only includes the raw message if the record is entirely implicitly structured.
  • tcp sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.
@LucioFranco
Copy link
Contributor

@lukesteensen so I talked to @binarylogic about this earlier today. For http and splunk how do we handle partial json events and none json events in the same batch? So lets say we get an event that is not structured, therefore we encode it as text/plain, then the following event is structured so we encode it as application/x-ndjson. These would end up in the same batch. We can no longer use the same Content-Type in the request and thus the request would fail.

So two questions:

  1. How do we handle multiple encodings in a batch and how does the batch communicate this to the batch fn we provide in the sink?
  2. How do we set the content type dynamically for each type of event in the http sinks?

@lukesteensen
Copy link
Member

That's a great question. The intent is definitely that every event following the same path through the topology should be treated the same way. I feel like the real answer is to move the explicit vs implicit tracking up out of the actual map of fields and up into the type system (i.e. figure it out in input_type and output_type based on the config), but I don't have a great handle on how much work that would be.

@LucioFranco
Copy link
Contributor

Yeah, I think this is something we will have to think about.

Second, question, @lukesteensen @binarylogic, for stdin and for file, what would the host be?

@binarylogic
Copy link
Contributor Author

The host is the local host for those sources. I assume Rust has a function for retrieving that?

In regards to the mixed payload types, I'm of the mindset right now that we follow whatever the first record does. Given Luke's comment, it should be very edge-casey that we'll see mixed records. For sinks that batch we definitely should not be mixing encoding types. The only time that is appropriate is for sinks like the console sink, where the events are streamed.

@LucioFranco
Copy link
Contributor

@binarylogic does that mean like Lucio.Macbook host or the ip? for the ip we would need to query the network interfaces. I don't think there is anything in the standard library for this.

@binarylogic
Copy link
Contributor Author

I defer to @lukesteensen for that one then.

@lukesteensen
Copy link
Member

does that mean like Lucio.Macbook host or the ip?

It should match what we interpolate into configs as $HOSTNAME, which we figure out here. Basically check env vars and fall back to hostname::get_hostname().

For sinks that batch we definitely should not be mixing encoding types. The only time that is appropriate is for sinks like the console sink, where the events are streamed.

Even then I don't think it's appropriate. The "type" of record that comes out of a sink should always be consistent for a given config imo. We could maybe have an encoder option that'd let people opt into mixed encodings, but I'd be pretty surprised if anyone wanted it. To me it feels like a recipe for random unexpected failures in downstream components.

@binarylogic
Copy link
Contributor Author

binarylogic commented May 15, 2019

To me it feels like a recipe for random unexpected failures in downstream components.

Agree, we've decided to make the encoding option required for any sink that batches, this way there are not surprises.

@binarylogic binarylogic added domain: logs Anything related to Vector's log events domain: data model Anything related to Vector's internal data model and removed event type: log labels Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: data model Anything related to Vector's internal data model domain: logs Anything related to Vector's log events type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

4 participants