Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new schema.yml syntax #790

Closed
drewbanin opened this issue Jun 12, 2018 · 6 comments
Closed

new schema.yml syntax #790

drewbanin opened this issue Jun 12, 2018 · 6 comments
Assignees
Labels
dbt-docs [dbt feature] documentation site, powered by metadata artifacts

Comments

@drewbanin
Copy link
Contributor

Related: #375

schema.yml files currently exist solely to specify schema tests for models. The schema.yml syntax should be extended to account for:

  • model metadata
    • markdown descriptions
  • column definitions
  • markdown description
  • type
  • extras (column encodings for redshift / bq partition keys?)
  • tests, as before

Bonus:

  • schemas should be able to "extend" other schemas, eg. snowplow_sessions <- snowplow_sessions_tmp

Proposed syntax:

snowplow_sessions:
    comment: "A table of sessions sourced from snowplow events"

    options:
        strict: True
        extends: snowplow_sessions_tmp

    columns:
        - name: session_id
          comment: "The unique id for the session"
          tests:
              - unique
              - not_null
              - relationships:
                    to: snowplow_page_views
                    field: session_id

options

strict
options: True | False
default: False

If true, the columns specified in the columns section must match the actual columns in the model in the database. If there is a mismatch (either too many, or not enough columns), then an error will be raised. If false, then the check will not occur.

extends:
options: null | model name | list<model name>
default: null

If a model name is provided, then this model will "inherit" the schema from the parent model. This will entail copying over descriptions, column definitions, strictness, etc. This will be exceedingly useful for "chains" of models which share a similar schema, as duplicating the documentation would be both time consuming and error prone.

comments

Comments can either be long-form, unstructured Markdown, or, they can contain a ref to a documentation node. These documentation nodes will live in markdown files inside of markdown blocks, eg:

{% docs model.snowplow_sessions %}

### Lorem ipsum
- dolor sit amet
- consectetur adipiscing elit
- sed do eiusmod

{% enddocs %}

This block will serve a few purposes:

  1. typing markdown inside of yaml is terrible
  2. putting these blocks in .md files will make text editors behave sanely
  3. the docs definitions can be referenced in multiple places, eg. for a column that appears in many models

This is a super natural use case for jinja. I can totally imagine writing macros to render tables, enforce docs guidelines, render links, etc etc etc.

Implementation

Each entry in the schema.yml files should be munged into the same JSON schema used for catalog entries. The two are very similar: they have comments, a list of columns, and those columns have names / types / etc. If we keep the data structures similar, then it should be easy to overlay the schema and catalog data on top of the manifest data for dbt docs purposes.

We should preserve backwards compatibility for schema tests either by 1) adding a version number header or 2) just continuing to parse the constraints section of the old schema.yml files.

@drewbanin drewbanin added this to the Betsy Ross (unreleased) milestone Jun 12, 2018
@drewbanin drewbanin added the dbt-docs [dbt feature] documentation site, powered by metadata artifacts label Jun 12, 2018
@drewbanin drewbanin mentioned this issue Jun 26, 2018
@drewbanin
Copy link
Contributor Author

cc @jthandy @cmcarthur

@jthandy
Copy link
Member

jthandy commented Jun 27, 2018

strict

the mechanism you're proposing to test (count of columns) doesn't feel right--seems like each individual column should be validated for existence if we're going to have this option at all. i also don't feel like this is something that must be prioritized for the initial release.

extends

are we planning on implementing this in the near-term? i love the idea but am just worried that it adds near-term complexity.

comment

i would propose not calling this comment, but rather docs. i'd like to be consistent from the beginning with how we're referring to documentation throughout dbt.

@drewbanin
Copy link
Contributor Author

@jthandy sure, for strict, I meant more that both cases will be checked: documenting a column that does not exist, or failing to document a column that does exist, will result in an error.

I spoke with @cmcarthur about the implementation for extends, and we probably need to use networkx to actually traverse the whole extends graph. I want to make sure the decisions / implementation we choose now makes it feasible to add extends in the future, but I agree, might not be suitable for v1.

I feel the same way about comment. I think docs might not be exactly right though... maybe description? To me, docs are comprised of a description, tests, lineage, sample data, etc.

@cmcarthur
Copy link
Member

👍 for description

@drewbanin when we discussed this yesterday we came up with a very different looking schema.yml format, can you post the updated structure here (with models and sources?)

@jthandy
Copy link
Member

jthandy commented Jun 27, 2018

👍 for description

@drewbanin
Copy link
Contributor Author

After speaking with @cmcarthur, we're going scope these schema definitions under a models: key, and we're also going to require a version indicator. That will look like:

version: 2

models:
  - name: events
    description: "a description..."

    columns:
        - name: event_time
          description: "def"
          tests:
              - primary_key
              - unique

sources:
  - name: snowplow
    description: "Snowplow dataset"
    tables:
        - name: snowplow_event_2
          description: An immutable log of events collected by Snowplow
          sql_table_name: snowplow.web_page
          columns:
            - name: collector_tstamp
              description: Timestamp for the event recorded by the collector

See #814 for more information on the sources section. This change will allow sources and schemas to conceivably live in the same file. @jthandy and I also discussed potentially renaming tests to properties, as this section could eventually include things that are not strictly tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt-docs [dbt feature] documentation site, powered by metadata artifacts
Projects
None yet
Development

No branches or pull requests

4 participants