Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formalize Vector's configuration schema #9115

Closed
10 of 22 tasks
binarylogic opened this issue Sep 11, 2021 · 3 comments
Closed
10 of 22 tasks

Formalize Vector's configuration schema #9115

binarylogic opened this issue Sep 11, 2021 · 3 comments
Assignees
Labels
domain: config Anything related to configuring Vector domain: external docs Anything related to Vector's external, public documentation needs: rfc Needs an RFC before work can begin. type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@binarylogic
Copy link
Contributor

binarylogic commented Sep 11, 2021

There are a number of problems with Vector's configuration that can be solved with a single source of truth that drives documentation, validation, and translation:

  1. Multiple sources of truth. Vector's configuration schema is defined in .cue files for documentation purposes, but the real schema is defined within the code via serde macros. This creates misalignment that results in bug reports and surprising configuration errors.
  2. Difficult to test. A common bug in Vector is configuration options not decoding as expected. Testing configuration options requires full end-to-end testing of the feature making it much more difficult to test. We should be able to test the configuration decoding step directly without all of the overhead of full integration tests
  3. Poor backward/forward compatibility. Vector's backward compatibility consists of defining aliases for old fields. Old fields are never pruned and users are not notified that they need to adjust their naming.
  4. No deprecation strategy. There is no strategy for deprecating Vector's options, and users are not alerted when they use a deprecated option. Moreover, we don't prune deprecated options since it's not clear which ones are deprecated.
  5. Difficult to integrate. Users use different tools to define and validate Vector configuration. For example, a user created a Vector jsonnet library to more easily define Vector configuration. Lack of a common schema makes this very difficult.

Related Issues

I tried my best to reference all of the relevant issues, but I am certain there are many more.

Cross cutting concerns

  1. Deprecation and backward compatibility
    • Know which version a deprecation was introduced
    • Handle backward compatibility between versions
    • Handle notifying the user that they are using deprecated options
  2. Configuration validation.
    • Validate a JSON payload against the Vector schema outside of the Vector binary
    • Real-time validation within an editor
  3. Documentation
    • Drive documentation by becoming the source of truth (possibly replacing or augmenting the cue data)
  4. UI
    • Create a real contract with the upcoming UI

Proposal

To solve this we should converge on a common schema specification for Vector's configuration. JSON schema jumps out as the winner since it is easily understood by humans, parseable, extendable, and supported by many different languages and tools. We can achieve this a couple of ways:

  1. Derive a JSON schema from our cue definitions.
  2. Manually maintain a JSON schema and incorporate that into our cue definitions.
  3. Any others?

I prefer 1 since cue is much more flexible. It allows us to reduce boilerplate, incorporate stricter validation, etc. I think we should consider decoupling our reference cue definitions from the website and defining a separate library. This library's single purpose is to expose Vector's internal configuration schema in a purpose agnostic format. Then our cue data for documentation can include this library and augment it as necessary.

@binarylogic binarylogic added type: enhancement A value-adding code change that enhances its existing functionality. domain: config Anything related to configuring Vector domain: external docs Anything related to Vector's external, public documentation labels Sep 11, 2021
@binarylogic binarylogic changed the title Drive Vector's configuration with a JSON schema definition Drive Vector's configuration decoding with a JSON schema definition Sep 11, 2021
@spencergilbert
Copy link
Contributor

IIRC cue can't directly export json schema today - would we be looking to upstream support, or do some intermediary transformations with other tools?

@lukesteensen
Copy link
Member

Very much agree with the problem statements here. What's not addressed in the proposal is how to ensure the actual Rust code remains in sync with the specification. Given that we want a single source of truth, that leaves our options as either (1) generating Rust code from an external source of truth (e.g. JSON Schema, cue, etc), or (2) generating those external representations from the Rust code.

There's prior art for going in either direction (e.g. quicktype and the schemars crate), but both have meaningful downsides. Generating Rust structs would make them less explicit in the code, less ergonomic to work with, etc. Generating JSON schemas from hand-written Rust would require a compile-and-run cycle of Vector to update the derived definitions for use with things like the website, which is a big dependency.

My biased opinion is to lean towards generating JSON schemas from the Rust code. The real source of truth is how Vector behaves, which is governed by the code. Making that code generated would obscure it, making it harder to debug, onboard contributors, etc. And if we really want to ensure that the behavior is always in line with an external definition, it'd be hard to avoid that compile step anyway.

Probably the nicest option for generating JSON schemas from our config structs would be to follow the pattern of procedural macros like serde and schemars, but build our own layer on top. That'd be a little bit of work, but you can imagine a final definition looking something like the following:

#[derive(VectorConfig)]
pub struct S3SinkConfig {
    pub bucket: String,
    pub key_prefix: Option<Template>,
    pub options: S3Options,
    pub region: RegionOrEndpoint,
    pub encoding: EncodingConfig<Encoding>,
    #[default("gzip")]
    pub compression: Compression,
    pub batch: BatchConfig,
    pub request: TowerRequestConfig,
    #[deprecated(since = "0.15.0", replacement = "auth")]
    pub assume_role: Option<String>,
    pub auth: AwsAuthentication,
}

This would let us have a library of common behaviors in terms of nesting, notification for deprecated fields, parsing human-friendly units for options like counts and byte sizes, etc. It could also get rid of some of our existing boilerplate like inventory::submit, and maybe clean up our currently quite hacky version of env var interpolation.

And most importantly, we could either delegate to something like schemars or generate our own implementation of the logic to output a JSON schema that aligns with how that struct will be parsed (which would be a great proptest target btw, with the property being anything that satisfies the schema should be parsed and anything that doesn't should fail).

I'm not entirely set on the idea given the complexity and downsides, but I do think it has enough potential to be worth some discussion.

@jszwedko
Copy link
Member

jszwedko commented Apr 8, 2022

Closing this as in-lieu of #12141 . This is more of a problem statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: config Anything related to configuring Vector domain: external docs Anything related to Vector's external, public documentation needs: rfc Needs an RFC before work can begin. type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

5 participants