Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/add interval arg to values every n datepart #110

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
74441b7
add `interval` argument
lewisarmistead Sep 28, 2021
894ceec
update docs for explaining new `interval` arg
lewisarmistead Sep 28, 2021
42a8dc8
expect_table_columns_to_match_ordered_list: refactor row_number to us…
clausherther Sep 28, 2021
cd25dea
handle data types for `mod` and incorporate windowing
lewisarmistead Oct 5, 2021
2028910
Add support for dbt 0.21 (#116)
clausherther Oct 6, 2021
6114264
Update README.md
clausherther Oct 6, 2021
e42393a
remove unintentional styling changes
lewisarmistead Oct 7, 2021
b3c5231
Fix expect_row_values_to_have_data_for_every_n_datepart errors when b…
jeremyyeo Oct 8, 2021
ed0b4bd
add `interval` argument
lewisarmistead Sep 28, 2021
ab08f6f
update docs for explaining new `interval` arg
lewisarmistead Sep 28, 2021
771a6d8
handle data types for `mod` and incorporate windowing
lewisarmistead Oct 5, 2021
de94e83
remove unintentional styling changes
lewisarmistead Oct 7, 2021
e46833e
Merge branch 'feature/add_interval_arg_to_values_every_n_datepart' of…
clausherther Oct 8, 2021
dc449e2
Change datepart param and fix formatting
clausherther Oct 8, 2021
9f10973
Reformat join to match prior style
clausherther Oct 8, 2021
b9a6082
simplify tie-out of model data to spine with interval truncation
lewisarmistead Oct 26, 2021
0c8b21a
Add schema test
clausherther Oct 26, 2021
e505eb2
replace calls to subquery with calls directly to columns in `model_da…
lewisarmistead Oct 27, 2021
66c4243
add comments/examples for new interval additions
lewisarmistead Nov 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
# dbt-expectations v0.4.3

## Fixes
* Fixes incompatibility on Snowflake with use of `row_number()` without `order by` in `expect_table_columns_to_match_ordered_list`([#112](https://github.com/calogica/dbt-expectations/pull/112))

## Features

## Under the hood
* Supports dbt 0.21.x

# dbt-expectations v0.4.2

## Features
Expand Down
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

## Install

`dbt-expectations` currently supports `dbt 0.20.x`.
`dbt-expectations` currently supports `dbt 0.21.x`.

Check [dbt Hub](https://hub.getdbt.com/calogica/dbt_expectations/latest/) for the latest installation instructions, or [read the docs](https://docs.getdbt.com/docs/package-management) for more information on installing packages.

Expand Down Expand Up @@ -914,7 +914,13 @@ tests:

Expects model to have values for every grouped `date_part`.

For example, this tests whether a model has data for every `day` (grouped on `date_col`) from either a specified `start_date` and `end_date`, or for the `min`/`max` value of the specified `date_col`.
For example, this tests whether a model has data for every `day` (grouped on `date_col`) between either:

- The `min`/`max` value of the specified `date_col` (default).
- A specified `test_start_date` and/or `test_end_date`.
- if `test_start_date` or `test_end_date` are not specified, `min`/`max` of `date_col` are used, respectively

Note: `test_end_date` is exclusive (e.g. a test with `test_end_date` value of `'2020-01-05'` will pass for a model's `max` `date_col` of `'2021-01-04'`).

*Applies to:* Model, Seed, Source

Expand All @@ -924,3 +930,5 @@ tests:
date_col: date_day
date_part: day
```

The `interval` argument will optionally group `date_part` by a given integer to test data presence at a lower granularity, e.g. adding `interval: 7` to the example above will test whether a model has data for each 7-`day` period instead of for each `day`.
2 changes: 1 addition & 1 deletion dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
name: 'dbt_expectations'
version: '0.4.0'

require-dbt-version: [">=0.20.0", "<0.21.0"]
require-dbt-version: [">=0.20.0", "<0.22.0"]
config-version: 2

target-path: "target"
Expand Down
4 changes: 4 additions & 0 deletions integration_tests/models/schema_tests/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,10 @@ models:
- dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
date_col: date_day
date_part: day
- dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
date_col: date_day
date_part: day
interval: 7

columns:
- name: date_day
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{%- test expect_row_values_to_have_data_for_every_n_datepart(model,
date_col,
date_part="day",
interval=None,
row_condition=None,
test_start_date=None,
test_end_date=None) -%}
Expand All @@ -21,11 +22,11 @@

{% endset %}

{% endif %}
{%- set dr = run_query(sql) -%}
{%- set db_start_date = dr.columns[0].values()[0].strftime('%Y-%m-%d') -%}
{%- set db_end_date = dr.columns[1].values()[0].strftime('%Y-%m-%d') -%}

{%- set dr = run_query(sql) -%}
{%- set db_start_date = dr.columns[0].values()[0].strftime('%Y-%m-%d') -%}
{%- set db_end_date = dr.columns[1].values()[0].strftime('%Y-%m-%d') -%}
{% endif %}

{% if not test_start_date %}
{% set start_date = db_start_date %}
Expand All @@ -42,12 +43,53 @@
with base_dates as (

{{ dbt_date.get_base_dates(start_date=start_date, end_date=end_date, datepart=date_part) }}
{% if interval %}
{#
Filter the date spine created above down to the interval granularity using a modulo operation.
The number of date_parts after the start_date divided by the integer interval will produce no remainder for the desired intervals,
e.g. for 2-day interval from a starting Jan 1, 2020:
params: start_date = '2020-01-01', date_part = 'day', interval = 2
date spine created above: [2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05, ...]
The first parameter to the `mod` function would be the number of days between the start_date and the spine date, i.e. [0, 1, 2, 3, 4 ...]
The second parameter to the `mod` function would be the integer interval, i.e. 2
This modulo operation produces the following remainders: [0, 1, 0, 1, 0, ...]
Filtering the spine only where this remainder == 0 will return a spine with every other day as desired, i.e. [2020-01-01, 2020-01-03, 2020-01-05, ...]
#}
where mod(
cast({{ dbt_utils.datediff("'" ~ start_date ~ "'", 'date_' ~ date_part, date_part) }} as {{ dbt_utils.type_int() }}),
cast({{interval}} as {{ dbt_utils.type_int() }})
) = 0
{% endif %}

),
model_data as (

select
{% if not interval %}

cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as date_{{ date_part }},

{% else %}
{#
Use a modulo operator to determine the number of intervals that a date_col is away from the interval-date spine
and subtracts that amount to effectively slice each date_col record into its corresponding spine bucket,
e.g. given a date_col of with records [2020-01-01, 2020-01-02, 2020-01-03, 2020-01-11, 2020-01-12]
if we want to slice these dates into their 2-day buckets starting Jan 1, 2020 (start_date = '2020-01-01', date_part='day', interval=2),
the modulo operation described above will produce these remainders: [0, 1, 0, 0, 1]
subtracting that number of days from the observations will produce records [2020-01-01, 2020-01-01, 2020-01-03, 2020-01-11, 2020-01-11],
all of which align with records from the interval-date spine
#}
{{dbt_utils.dateadd(
date_part,
"mod(
cast(" ~ dbt_utils.datediff("'" ~ start_date ~ "'", date_col, date_part) ~ " as " ~ dbt_utils.type_int() ~ " ),
cast(" ~ interval ~ " as " ~ dbt_utils.type_int() ~ " )
) * (-1)",
"cast( " ~ dbt_utils.date_trunc(date_part, date_col) ~ " as " ~ dbt_expectations.type_datetime() ~ ")"
)}} as date_{{ date_part }},

{% endif %}

count(*) as row_cnt
from
{{ model }} f
Expand All @@ -58,6 +100,7 @@ model_data as (
date_{{date_part}}

),

final as (

select
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

{% for col_name in relation_column_names %}
select
{{ loop.index }} as relation_column_idx,
'{{ col_name }}' as relation_column
{% if not loop.last %}union all{% endif %}
{% endfor %}
Expand All @@ -15,34 +16,17 @@

{% for col_name in column_list %}
select
{{ loop.index }} as input_column_idx,
'{{ col_name }}' as input_column

{% if not loop.last %}union all{% endif %}
{% endfor %}
),
relation_columns_sequence as (
-- (BigQuery won't let you use a window function without a "from" clause)
select
relation_column,
row_number() over() as relation_column_idx
from
relation_columns

),
input_columns_sequence as (
-- (BigQuery won't let you use a window function without a "from" clause)
select
input_column,
row_number() over() as input_column_idx
from
input_columns

)
select *
from
relation_columns_sequence r
relation_columns r
full outer join
input_columns_sequence i on r.relation_column = i.input_column and r.relation_column_idx = i.input_column_idx
input_columns i on r.relation_column = i.input_column and r.relation_column_idx = i.input_column_idx
where
-- catch any column in input list that is not in the sequence of table columns
-- or any table column that is not in the input sequence
Expand Down