-
-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/add interval arg to values every n datepart #110
Feature/add interval arg to values every n datepart #110
Conversation
for checking presence every n-date_parts instead of every date_part
[expect_row_values_to_have_data_for_every_n_datepart](https://github.com/calogica/dbt-expectations/tree/0.4.2#expect_row_values_to_have_data_for_every_n_datepart)
@lewisarmistead thanks for this PR! However, looks like at least in Postgres, both arguments to - name: timeseries_data_extended
tests:
- dbt_expectations.expect_table_columns_to_match_ordered_list:
column_list: ["date_day", "row_value", "row_value_log"]
- dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
date_col: date_day
date_part: day
- dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
date_col: date_day
date_part: day
interval: 2 which fails on me:
Also tried with
|
@lewisarmistead btw, I'm not sure I super understand the use case (or the implementation of it). I know you're mostly interested in hourly intervals, but something like |
Re: the type issue above, this may work (I can't see needing non-integer arguments?)
|
…e loop.index (calogica#112) * Fixes calogica#111 - refactor row_number to use loop.index * Update CHANGELOG
@clausherther - apologies for the intermittent attention here, but thanks for the great feedback! I started work on your suggestions and will push up some changes for your review soon. RE: integer arguments for the RE: checking data presence over a period rather than each X interval - I did misunderstand this, and I'm close to getting the proper aggregation developed for the condition when the interval arg is passed. I'll have something for your review soon on that front too. |
@lewisarmistead thanks for working on this and contributing in the first place! Let me know if I can help with anything here. |
This test will handle the mod function, which only takes integer arguments, more stably. It also aggregates row counts across intervals when joining on the date spine to correctly detect data presence in the target model update conditions based on interval update styling update styling
Thanks @clausherther - I added some changes reflecting the comments above. I didn't want to slow down execution of the original test, so I favored no-op conditions over readability/concision. Let me know your thoughts! |
…oth start and end dates are set (calogica#115) * fix none when both test dates are set * Add support for dbt 0.21 (calogica#116) * Update README.md * fix none when both test dates are set * Update README Co-authored-by: Claus Herther <claus@calogica.com>
for checking presence every n-date_parts instead of every date_part
[expect_row_values_to_have_data_for_every_n_datepart](https://github.com/calogica/dbt-expectations/tree/0.4.2#expect_row_values_to_have_data_for_every_n_datepart)
This test will handle the mod function, which only takes integer arguments, more stably. It also aggregates row counts across intervals when joining on the date spine to correctly detect data presence in the target model update conditions based on interval update styling update styling
… github.com:lewisarmistead/dbt-expectations into pr/lewisarmistead/110
from | ||
base_date_windows d | ||
left join model_data f | ||
on f.date_{{ date_part }} >= d.date_{{ date_part }} and f.date_{{ date_part }} < d.interval_end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lewisarmistead fyi, BigQuery won't let you do a left join like that
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
base_date_windows d | ||
left join | ||
model_data f | ||
on d.date_{{ date_part }} <= f.date_{{ date_part }} and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lewisarmistead FYI, BigQuery won't let you do a left join
like this.
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
the condition added to the model_data CTE is meant to emulate (kind of) Snowflake's [`TIME_SLICE`](https://docs.snowflake.com/en/sql-reference/functions/time_slice.html), which should allow exact matches to the base_dates CTE for better time bucketing
@clausherther - I recently pushed some changes that I think fix the BigQuery issue and simplify the additions in this PR overall. I wound up spinning up an instance of BigQuery, so I've now tested on Snowflake and BigQuery. Apologies for the wait here, and thanks for the feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of small comments, I think we're almost there! Generally, though: could we a few comments to explain what the mod and other additions do? I'm afraid a year (or a month?) from now, I might not remember what these sections do. Thanks!!
{{dbt_utils.dateadd(date_part, 'interval_diff', 'truncated_date')}} as date_{{ date_part }}, | ||
count(*) as row_cnt | ||
from ( | ||
select |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we break this out into a CTE instead of a subquery? Generally, we try to stay away from subqueries.
select | ||
cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as truncated_date, | ||
mod( | ||
cast({{dbt_utils.datediff("'"~start_date~"'", date_col, date_part)}} as {{ dbt_utils.type_int() }}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit pick: we'll want space before and after the ~
operator, e.g. "'" ~ start_date ~ "'"
from ( | ||
select | ||
cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as truncated_date, | ||
mod( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this mod transform the same as on line 47? If so, I'd like to refactor this in a way that doesn't repeat code.
@clausherther - thanks for the feedback. To address some of the comments you left, I removed the subquery in the model_data CTE and instead passed the columns from that subquery into the I like your idea of making the use of these
{% macro datetime_interval_diff(start_date, date_col, date_part, interval) %}
mod(
cast({{ dbt_utils.datediff("'" ~ start_date ~ "'", date_col, date_part) }} as {{ dbt_utils.type_int() }}),
cast({{interval}} as {{ dbt_utils.type_int() }})
)
{% endmacro %}
I'm open to any ideas on this front, so let me know your thoughts! |
I'm also noticing upon rereading that creating a macro for how we're using |
Hi @lewisarmistead! Really appreciate you refactoring this, I think even with the section that concats the mod statement, this looks a lot cleaner to me (we had two uses of mod previously I think?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we add some comments to the outlined section, I think we're good to go. Thanks!
cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as date_{{ date_part }}, | ||
|
||
{% else %} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to add comments, along with an example showing sample data, of what this section does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments but had some difficulty keeping them brief. Please edit as you see fit, and feel free to reach out with any questions
Thanks, @clausherther! Let me know if you need anything else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@lewisarmistead thanks for all the code comments and hanging in there with me as we worked through this! |
Happy to contribute - thanks for the feedback @clausherther ! |
This PR addresses issue #109 by adding an optional argument for grouping by count of
date_part
in expect_row_values_to_have_data_for_every_n_datepart.On the note from that issue about whether
mod
will work across databases, I saw that dbt Labs usesmod
in one of their cross-DB utility macros, so I'm assuming it's safe. I verified that the changes run using Snowflake, and I saw that BigQuery, Postgres, Redshift, & Spark all have the samemod
function as well.