Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strict field validation for schema.yml #1570

Closed
jwerderits opened this issue Jun 25, 2019 · 26 comments
Closed

strict field validation for schema.yml #1570

jwerderits opened this issue Jun 25, 2019 · 26 comments
Labels
stale Issues that have gone stale

Comments

@jwerderits
Copy link
Contributor

Issue: field validation for schema.yml

Issue description

Invalid/nonexistent field names can have descriptions in schema.yml that are populated in the documentation.
If a field is deleted, but was previously documented (correctly), the docs will indicate that that field still exists.

Fix

Apply column validation for fields that don't have tests applied to them

Steps to reproduce

You can create any column name and add a description in schema.yml and it will be populated in the documentation

    columns:
        - name: deleted_column_name
          description: this field has been deleted
@drewbanin
Copy link
Contributor

Hey @jwerderits - thanks for making this issue! How do you think dbt should handle this in practice? Do you think a dbt run should fail? Or dbt docs generate? Or something else?

@jwerderits
Copy link
Contributor Author

@drewbanin I think it makes the most sense for a dbt test to fail on this condition because it is already involved with schema validation. Encapsulating the logic here provides a warning that there are errors within the project without preventing a run or docs generate to fail. It might also be useful to dbt docs generate to help find where the erroneous field(s) exists.

@drewbanin drewbanin changed the title field validation for schema.yml strict field validation for schema.yml Jul 1, 2019
@jack-arthurton-cko
Copy link

This would be really useful! Would it be possible to extend the validation to check for the opposite scenario, columns which exist in the model but don't have entry in the schema.yml file?

@drewbanin
Copy link
Contributor

@JackArthurton yeah! I think that's a great idea. I'm imagining that this would be opt-in, so you could annotate a schema.yml specification with strict: true (or similar) which would throw an error if the schema specification and the model are mismatched

@mattm
Copy link

mattm commented Jul 30, 2019

Drew pointed me to this issue when I asked about this on Slack.

My question was:

Is there a way to configure tests to identify columns that are in the results but not covered by a schema test?

Love the strict: true idea - that seems like it would do the trick.

@stumelius
Copy link

Any progress regarding field validation?

I'd like to be able to define the schema like this:

models:
  - name: my_model
    columns:
      - name: column_1
         tests:
           - exists

@drewbanin
Copy link
Contributor

hey @smomni - we haven't prioritized this one yet! If you're in a pinch, you can actually define your own custom schema test in your dbt project: https://docs.getdbt.com/docs/custom-schema-tests

If you create a macro in your macros/ directory like this:

{% macro test_exists(model, column_name) %}

    select count({{ column_name }})
    from {{ model }}
    where 1=0
    limit 1

{% endmacro %}

Then you'll be able to assert that columns exist with:

models:
  - name: my_model
    columns:
      - name: column_1
         tests:
           - exists

I think the version of this that we add to dbt natively will be a little bit smarter than this. We can just run a single query to find all of the columns in a table, then check them against the columns in the schema file. The schema test i shared above will run one query per column which probably isn't optimal

@mjhoshea
Copy link

Hey guys this looks like it could be a really useful idea. For some context - we want to create a self service ELT pipeline whereby people can create their own datasets from upstream sources. In order to mitigate the risk of undocumented datasets entering the warehouse it would be nice to enforce people to document the newly created tables and views with descriptions.

The ideal use would be a project level flag that turns on the necessity of schemas to have descriptions. This would then be used as part of CI/CD to stop undocumented schemas getting to production.

@ArafathC
Copy link

ArafathC commented May 6, 2020

hey @smomni - we haven't prioritized this one yet! If you're in a pinch, you can actually define your own custom schema test in your dbt project: https://docs.getdbt.com/docs/custom-schema-tests

If you create a macro in your macros/ directory like this:

{% macro test_exists(model, column_name) %}

    select count({{ column_name }})
    from {{ model }}
    where 1=0
    limit 1

{% endmacro %}

Then you'll be able to assert that columns exist with:

models:
  - name: my_model
    columns:
      - name: column_1
         tests:
           - exists

I think the version of this that we add to dbt natively will be a little bit smarter than this. We can just run a single query to find all of the columns in a table, then check them against the columns in the schema file. The schema test i shared above will run one query per column which probably isn't optimal

this approach doesn't fail if the column is not present in the table

@drewbanin
Copy link
Contributor

@ArafathC are you sure about that? The query should fail because the specified column does not exist in the table. I'm curious if you could elaborate about what you mean

@ArafathC
Copy link

ArafathC commented May 8, 2020

@drewbanin Sorry about that.. Was able to double check and confirm it works

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Dec 21, 2021
@alexrosenfeld10
Copy link
Contributor

alexrosenfeld10 commented Feb 20, 2022

Hey @drewbanin any activity here? I would love to make use of such a feature.

Internally, I wrote an app that does all this kind of checking on our existing reporting warehouse, which doesn't use dbt. I'm migrating it over to use dbt, and this is definitely a gap we're seeing. Thanks in advance for any updates / help

@alexrosenfeld10
Copy link
Contributor

Also, sidenote, I do think there's an error in the macro posted above. It needs to be limit 0, because otherwise it will return one row with a value of 0, and the fact that it returns rows means the test will fail.

@alexrosenfeld10
Copy link
Contributor

Just found this: https://github.com/calogica/dbt-expectations/tree/0.5.1/#table-shape it's pretty awesome. Gets to most of what I need here. 👍

@benhinchley
Copy link

This is what I've been using to check if all the columns defined exist.

{% test columns_exist(model) %}
    {% if execute %}
        {% set models = [] %}
        {% for node in graph.nodes.values() | selectattr("resource_type", "equalto", "model") | selectattr("name", "equalto", model.identifier) %}
            {% do models.append(node) %}
        {% endfor %}

        {% if models|count != 1 %}
            {{ exceptions.raise_compiler_error(model.identifier ~ " not found in graph") }}
        {% endif %}

        {% set test_model = models|first %}

        with column_check as (
            select
            {% for column_name in test_model.columns %}
                {{ column_name }},
            {% endfor %}
            from {{ model }}
            where 1=0
            limit 1
        )

        select * from column_check
    {% endif %}
{% endtest %}

@yummydum
Copy link

yummydum commented Apr 9, 2022

I think this feature is very useful as well. It would be very nice if I can enforce the followings.

  • Every column in the yaml file exists in the database
  • All columns in the database are in the yaml file
  • There is a description on the column

Are there plans to prioritize this one @drewbanin ? Seems like dbt exception does not cover this use case.

@schylarbrock
Copy link

schylarbrock commented May 16, 2022

benhinchley

Thank you for the help! One quick thing: I think you will have an error with trailing commas here (for the last column). I believe it should be updated to (edit on line 17):

{% test columns_exist(model) %}
    {% if execute %}
        {% set models = [] %}
        {% for node in graph.nodes.values() | selectattr("resource_type", "equalto", "model") | selectattr("name", "equalto", model.identifier) %}
            {% do models.append(node) %}
        {% endfor %}

        {% if models|count != 1 %}
            {{ exceptions.raise_compiler_error(model.identifier ~ " not found in graph") }}
        {% endif %}

        {% set test_model = models|first %}

        with column_check as (
            select
            {% for column_name in test_model.columns %}
                {{ column_name }} {% if not loop.last %},{% endif %}
            {% endfor %}
            from {{ model }}
            where 1=0
            limit 1
        )

        select * from column_check
    {% endif %}
{% endtest %}

@benhinchley
Copy link

benhinchley

Thank you for the help! One quick thing: I think you will have an error with trailing commas here (for the last column). I believe it should be updated to (edit on line 17):


{% test columns_exist(model) %}

    {% if execute %}

        {% set models = [] %}

        {% for node in graph.nodes.values() | selectattr("resource_type", "equalto", "model") | selectattr("name", "equalto", model.identifier) %}

            {% do models.append(node) %}

        {% endfor %}



        {% if models|count != 1 %}

            {{ exceptions.raise_compiler_error(model.identifier ~ " not found in graph") }}

        {% endif %}



        {% set test_model = models|first %}



        with column_check as (

            select

            {% for column_name in test_model.columns %}

                {{ column_name }} {% if not loop.last %},{% endif %}

            {% endfor %}

            from {{ model }}

            where 1=0

            limit 1

        )



        select * from column_check

    {% endif %}

{% endtest %}

Ahh yep I'm predominantly using BigQuery these days where trailing commas in the select statement are allowed.

@elyobo
Copy link

elyobo commented Jun 22, 2022

Agree with all of @yummydum's suggestions, with the aim that the schema be complete (all columns present and described) and accurate (at least in that it doesn't document columns that do not exist). It looks like @benhinchley's macro ensures the latter part of this (documented columns must exist) but there doesn't seem to be a solution for the other requirements in here.

Is it worth reopening the issue (seems to have been closed by a bot, not closed as fixed or will not fix)?

@Klimmy
Copy link

Klimmy commented Apr 7, 2023

@benhinchley, @schylarbrock, thank you for the solution with a test code. Very handy!

I've adjusted it to be a singular one (not a generic one). So with this version, we don't need to specify a test for every model in schema.yml.

{% set models = [] %}

{% for node in graph.nodes.values() | selectattr("resource_type", "equalto", "model") %}
    {% if node.columns|count > 0 %}

        {% do models.append(node) %}

    {% endif %}
    
{% endfor %}

{% for model in models %}
    
    SELECT
    NULL AS placeholder
    FROM {{ model.database ~ "." ~ model.schema ~ "." ~ model.name }}
    WHERE 1=0
    {% for column_name in model.columns %}
        AND {{ column_name }} IS NULL
    {% endfor %}
    {% if not loop.last %}UNION ALL{% endif %}

{% endfor %}

@aadarshsingh191198
Copy link

So, to summarise there can be four possible cases:

  1. Column present in schema, not present in the model
  2. Column present in the model, not present in the schema
  3. Model present but not listed in the schema
  4. Model listed in the schema but not present

Using the test mentioned by @Klimmy,
a. we can successfully test for 1. The test failed when the schema had columns that the model didn't have
b. we can't test for 2. The test passed when the schema didn't have columns that the model had
c. we can't test for 3. The test passed when the schema didn't have the model that the pipeline have
d. we can't test for 4. The test passed when the schema had a model which the pipeline didn't have.

Now, the interesting part -

d. DBT throws a warning during the first compilation/ run for case 4.
c. {% if node.columns|count > 0 %} is true for case 3. So, we can throw an exception and if the model is not listed in the schema, the test will fail.

This way, I am able to cover 3 out of 4 cases. Please correct if there is any concern with my approach.

A doubt:

What I didn't understand. In the snippet by @Klimmy , why do we need

 {% if node.columns|count > 0 %}

        {% do models.append(node) %}

    {% endif %}

Why can't we just do {% do models.append(node) %} without checking for non-zero columns condition?

@Infiniverse
Copy link

Just coming to DBT and hitting this issue. It doesn’t sound too hard to resolve. Why has it not been added to DBT core yet?

@ryanb8
Copy link

ryanb8 commented Jun 27, 2024

You can also use the contract enforced property to have the model fail at runtime!

https://docs.getdbt.com/reference/resource-configs/contract

schema.yml snippet:

    config:
      contract:
        enforced: true

@elycenok-wowcorp
Copy link

elycenok-wowcorp commented Jul 19, 2024

I've tried encorced

      - name: random_name
        data_type: string     

test doesn't fail

@elycenok-wowcorp
Copy link

elycenok-wowcorp commented Jul 20, 2024

ok problem solved

  1. realised the config was overridden in the .sql file
    so adding contract did the trick
{{
    config(
        materialized="table",
        alias="....",
        contract={ "enforced": True }
    )
}}
  1. after that dbt run started to fail due to missing columns in the YAML file
    generated the missing columns in YAML using dbt-codegen package
    also found and raised a documentation issue:
    On hub.getdbt.com you've got an error in arguments of generate_model_yaml (model_names not model_name) dbt-codegen#176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests