Formalize Vector's configuration schema #9115

binarylogic · 2021-09-11T19:59:24Z

There are a number of problems with Vector's configuration that can be solved with a single source of truth that drives documentation, validation, and translation:

Multiple sources of truth. Vector's configuration schema is defined in .cue files for documentation purposes, but the real schema is defined within the code via serde macros. This creates misalignment that results in bug reports and surprising configuration errors.
Difficult to test. A common bug in Vector is configuration options not decoding as expected. Testing configuration options requires full end-to-end testing of the feature making it much more difficult to test. We should be able to test the configuration decoding step directly without all of the overhead of full integration tests
Poor backward/forward compatibility. Vector's backward compatibility consists of defining aliases for old fields. Old fields are never pruned and users are not notified that they need to adjust their naming.
No deprecation strategy. There is no strategy for deprecating Vector's options, and users are not alerted when they use a deprecated option. Moreover, we don't prune deprecated options since it's not clear which ones are deprecated.
Difficult to integrate. Users use different tools to define and validate Vector configuration. For example, a user created a Vector jsonnet library to more easily define Vector configuration. Lack of a common schema makes this very difficult.

Related Issues

I tried my best to reference all of the relevant issues, but I am certain there are many more.

Cross cutting concerns

Deprecation and backward compatibility
- Know which version a deprecation was introduced
- Handle backward compatibility between versions
- Handle notifying the user that they are using deprecated options
Configuration validation.
- Validate a JSON payload against the Vector schema outside of the Vector binary
- Real-time validation within an editor
Documentation
- Drive documentation by becoming the source of truth (possibly replacing or augmenting the cue data)
UI
- Create a real contract with the upcoming UI

Proposal

To solve this we should converge on a common schema specification for Vector's configuration. JSON schema jumps out as the winner since it is easily understood by humans, parseable, extendable, and supported by many different languages and tools. We can achieve this a couple of ways:

Derive a JSON schema from our cue definitions.
Manually maintain a JSON schema and incorporate that into our cue definitions.
Any others?

I prefer 1 since cue is much more flexible. It allows us to reduce boilerplate, incorporate stricter validation, etc. I think we should consider decoupling our reference cue definitions from the website and defining a separate library. This library's single purpose is to expose Vector's internal configuration schema in a purpose agnostic format. Then our cue data for documentation can include this library and augment it as necessary.

The text was updated successfully, but these errors were encountered:

spencergilbert · 2021-09-13T14:08:11Z

IIRC cue can't directly export json schema today - would we be looking to upstream support, or do some intermediary transformations with other tools?

lukesteensen · 2021-09-17T01:47:22Z

Very much agree with the problem statements here. What's not addressed in the proposal is how to ensure the actual Rust code remains in sync with the specification. Given that we want a single source of truth, that leaves our options as either (1) generating Rust code from an external source of truth (e.g. JSON Schema, cue, etc), or (2) generating those external representations from the Rust code.

There's prior art for going in either direction (e.g. quicktype and the schemars crate), but both have meaningful downsides. Generating Rust structs would make them less explicit in the code, less ergonomic to work with, etc. Generating JSON schemas from hand-written Rust would require a compile-and-run cycle of Vector to update the derived definitions for use with things like the website, which is a big dependency.

My biased opinion is to lean towards generating JSON schemas from the Rust code. The real source of truth is how Vector behaves, which is governed by the code. Making that code generated would obscure it, making it harder to debug, onboard contributors, etc. And if we really want to ensure that the behavior is always in line with an external definition, it'd be hard to avoid that compile step anyway.

Probably the nicest option for generating JSON schemas from our config structs would be to follow the pattern of procedural macros like serde and schemars, but build our own layer on top. That'd be a little bit of work, but you can imagine a final definition looking something like the following:

#[derive(VectorConfig)]
pub struct S3SinkConfig {
    pub bucket: String,
    pub key_prefix: Option<Template>,
    pub options: S3Options,
    pub region: RegionOrEndpoint,
    pub encoding: EncodingConfig<Encoding>,
    #[default("gzip")]
    pub compression: Compression,
    pub batch: BatchConfig,
    pub request: TowerRequestConfig,
    #[deprecated(since = "0.15.0", replacement = "auth")]
    pub assume_role: Option<String>,
    pub auth: AwsAuthentication,
}

This would let us have a library of common behaviors in terms of nesting, notification for deprecated fields, parsing human-friendly units for options like counts and byte sizes, etc. It could also get rid of some of our existing boilerplate like inventory::submit, and maybe clean up our currently quite hacky version of env var interpolation.

And most importantly, we could either delegate to something like schemars or generate our own implementation of the logic to output a JSON schema that aligns with how that struct will be parsed (which would be a great proptest target btw, with the property being anything that satisfies the schema should be parsed and anything that doesn't should fail).

I'm not entirely set on the idea given the complexity and downsides, but I do think it has enough potential to be worth some discussion.

jszwedko · 2022-04-08T17:23:54Z

Closing this as in-lieu of #12141 . This is more of a problem statement.

binarylogic added type: enhancement A value-adding code change that enhances its existing functionality. domain: config Anything related to configuring Vector domain: external docs Anything related to Vector's external, public documentation labels Sep 11, 2021

binarylogic changed the title ~~Drive Vector's configuration with a JSON schema definition~~ Drive Vector's configuration decoding with a JSON schema definition Sep 11, 2021

binarylogic changed the title ~~Drive Vector's configuration decoding with a JSON schema definition~~ Formalize Vector's configuration schema Oct 5, 2021

binarylogic mentioned this issue Oct 5, 2021

Formalize Vector's configuration schema RFC #9481

Closed

binarylogic mentioned this issue Oct 20, 2021

ARC was not enabled by default #9727

Closed

binarylogic mentioned this issue Dec 9, 2021

enhancement(config): Add acknowledgement config to globals #10374

Merged

jszwedko mentioned this issue Dec 28, 2021

rename of splunk_hec token to default_token not documented #10610

Closed

jszwedko mentioned this issue Feb 1, 2022

Configuration error. error=sinks.my_sink_id: invalid type: unit value, expected string or map at line 19 column 9 #11141

Closed

binarylogic added the needs: rfc Needs an RFC before work can begin. label Feb 13, 2022

tobz self-assigned this Feb 22, 2022

leebenson mentioned this issue Feb 25, 2022

feat(cli): Add vector config subcommand to output a normalized configuration #11442

Merged

hhromic mentioned this issue Apr 7, 2022

Config: decoding and encoding in sources/sinks are not fully validated #12127

Closed

jszwedko closed this as completed Apr 8, 2022

shymega mentioned this issue May 25, 2022

Improve error when a table is passed for a single key #2066

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formalize Vector's configuration schema #9115

Formalize Vector's configuration schema #9115

binarylogic commented Sep 11, 2021 •

edited by jszwedko

Loading

spencergilbert commented Sep 13, 2021

lukesteensen commented Sep 17, 2021

jszwedko commented Apr 8, 2022

Formalize Vector's configuration schema #9115

Formalize Vector's configuration schema #9115

Comments

binarylogic commented Sep 11, 2021 • edited by jszwedko Loading

Related Issues

Cross cutting concerns

Proposal

spencergilbert commented Sep 13, 2021

lukesteensen commented Sep 17, 2021

jszwedko commented Apr 8, 2022

binarylogic commented Sep 11, 2021 •

edited by jszwedko

Loading