Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add optional lookback_interval param to insert_by_period materialization #394

Closed
wants to merge 23 commits into from
Closed

Feature: Add optional lookback_interval param to insert_by_period materialization #394

wants to merge 23 commits into from

Conversation

GClunies
Copy link

@GClunies GClunies commented Jul 18, 2021

This is a:

  • bug fix PR with no breaking changes — please ensure the base branch is master
  • new functionality — please ensure the base branch is the latest dev/ branch
  • a breaking change — please ensure the base branch is the latest dev/ branch

Description & motivation

The insert_by_period materialization does not work with sessionization models like segment_web_page_views__sessionized from the dbt Segment package due to some unusually complex logic. Since these sessionization models are often very large and complex, I would like to be able to use the insert_by_period materialization with them.

This is my first PR to a dbt repo, let alone a dev branch, so please let me know if any additional work is needed for this PR.

Also, huge thanks to Scott Barber on the dbt-labs customer support engineering team to get me started on refactoring the macro! His input and guidance were a massive help and this was a great learning experience for me.

Describe your changes, and why you're making them.

This PR adds an optional lookback_interval parameter to the insert_by_period materialization, allowing it to be used with sessionization models like segment_web_page_views__sessionized from the dbt Segment package.

This is achieved by allowing the end user to include __PERIOD_FILTER_WITH_LOOKBACK__ in their model SQL, which the materialization replaces at runtime of each period using this modified macro logic in the materialization code.

{% macro default__get_period_sql(target_cols_csv, sql, timestamp_field, period, start_timestamp, stop_timestamp, offset, lookback_interval) -%}

  {%- set period_filter -%}
    ("{{timestamp_field}}" >  '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and
     "{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' + interval '1 {{period}}' and
     "{{timestamp_field}}" <  '{{stop_timestamp}}'::timestamp)
  {%- endset -%}

  {%- set period_filter_with_lookback -%}
    ("{{timestamp_field}}" >  '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' - interval '{{lookback_interval}}' and
     "{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' + interval '1 {{period}}' and
     "{{timestamp_field}}" <  '{{stop_timestamp}}'::timestamp)
  {%- endset -%}

  {%- set filtered_sql = sql | replace("__PERIOD_FILTER__", period_filter) | replace("__PERIOD_FILTER_WITH_LOOKBACK__", period_filter_with_lookback) -%}

  select
    {{target_cols_csv}}
  from (
    {{filtered_sql}}
  )

{%- endmacro %}

In addition to the integration tests I have added in this PR, I have tested the application of these changes to the materialization on my sessionization models @Surfline to ensure idempotency of session_ids is still maintained. I have added documentation on usage to the README.

The materialization can also be used on any model where loading by period is desired, but a lookback interval is also required to filter the upstream data (which is typically accomplished using incremental logic).

Checklist

  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I have "dispatched" any new macro(s) so non-core adapters can also use them (e.g. the star() source)
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

clrcrl and others added 21 commits June 6, 2021 15:08
* Tidy up changelog

* Add 0.7.0 entry to changelog

* Add order_by argument to get_column_values (#349)

* Add slugify macro to utils, use in pivot macro (#314)

* 0.20.0 compatibility (#371)

* Explicitly redefine Redshift -> default

* Upgrade generic tests

* Rm namespaces macro. New dispatch syntax

* Run tests with 0.20.0rc1

* Update changelog, readme

Co-authored-by: Jeremy Cohen <jeremy@fishtownanalytics.com>

* Simplify concat (#373)

* Postgres also have an alternative concat binary operation (#296)

* Update default implementation of concat macro

Co-authored-by: Christophe Duong <christophe.duong@gmail.com>

Co-authored-by: Jeremy Cohen <jeremy@fishtownanalytics.com>
Co-authored-by: Christophe Duong <christophe.duong@gmail.com>
* power and pow are synonyms (except in TSQL)

* contrib

* power is more crossdb friendly than pow

Co-authored-by: Jeremy Cohen <jeremy@fishtownanalytics.com>
…materialization was in an old version of dbt_utils. not v0.7.0
@GClunies GClunies changed the title Feature: Add optionallookback_interval' param to insert by period` materialization for use with dbt Segment package models Feature: Add optionallookback_interval param to insert by period materialization for use with dbt Segment package models Jul 18, 2021
@GClunies GClunies changed the title Feature: Add optionallookback_interval param to insert by period materialization for use with dbt Segment package models Feature: Add optional lookback_interval param to insert by period materialization for use with dbt Segment package models Jul 18, 2021
@GClunies GClunies changed the title Feature: Add optional lookback_interval param to insert by period materialization for use with dbt Segment package models Feature: Add optional lookback_interval param to insert_by_period materialization Jul 18, 2021
@GClunies
Copy link
Author

I also see issue dbt-labs/dbt-labs-experimental-features#32 talking about modernizing this materialization. But, it looks like all of the changes discussed there have happened in the dbtvault repo and are largely Snowflake focused.

Would be great to modernize the original redshift version as well (or a universal materialization). But that work seems like a separate PR.

@barberscott
Copy link

@jtcohen6 since you have done a fair amount of work with Segment in the past, thought you might have the context to review this -- it's an application of insert_by_period that allows for a window to use for sessionization-type queries.

@jasnonaz
Copy link
Contributor

Hi @GClunies! This is a really fantastic addition to the codebase and clearly has a ton of thought and care put into it. We are actually in the process of determining what the future of the insert_by_period materialization is. It's less battle tested and cross compatible and much more experimental than the level of assurance you would normally expect in dbt-utils and it'd quite likely we will be moving it into a repo for experimental functionality in the near future. We'd love this addition and other things like cross database functionality to get added when we do that.

In the meantime I'm going to close this as it isn't a good fit for utils right now but we really, really do appreciate the contribution and look forward to merging this in when we ultimately determine where insert_by_period is going to live.

@jasnonaz jasnonaz closed this Aug 30, 2021
@GClunies
Copy link
Author

@jasnonaz thanks for the kind words and update! Agree that a separate repo makes a lot of sense. Feel free to ping me in the new repo when you get to it. For now I have this as a macro in our dbt project so no risk of a breaking change for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants