Modernize legacy insert by period materialization #410

HorvathDanielMarton · 2021-09-03T17:21:34Z

This is a:

bug fix PR with no breaking changes — please ensure the base branch is master
new functionality — please ensure the base branch is the latest dev/ branch
a breaking change — please ensure the base branch is the latest dev/ branch

Description & motivation

The goal of this change is to get the insert_by_period materialization compatible with Snowflake while using the currently latest dbt (0.21.0b2) and keeping the same functionality.

The insert_by_period materialization was introduced in this discourse post and the intention behind it is to help with “problematic” initial builds of data models. Here, problematic means that some tables are just so large and/or complex that a simple --full-refresh is not adequate because the build will be incredibly slow, inefficient, and can even get stuck.

Unfortunately, the currently available version does not support Snowflake and it’s advised to refactor it by using the refreshed code for the incremental materialization as a good starting point. As a result, this change resolves dbt-labs/dbt-labs-experimental-features#32.

The materialization itself was completely built from the current incremental materialization and the modifications done to it was inspired by the original insert_by_period.

The macros that were originally in the insert_by_period.sql was moved to a helpers.sql file where the Snowflake version of get_period_boundaries(), and get_period_sql() are introduced. Also, a new check_for_period_filter() macro has been added.

The materialization seems fully functional and so far works as expected, nevertheless, I'm still testing it against production models.

Ideas

Currently, in the config we can only set a period of day, week, month, etc... as a unit. However, we might want to build our models fortnightly (ie. 2-week chunks). We could introduce an optional configuration that lets us use a configurable number of periods as a chunk. The fortnightly example could look like this:

{{
    config(
        materialized = "insert_by_period",
        timestamp_field = "created_at",
        period = 'week',
        interval_size = 2,
        start_date = "2021-01-01",
        stop_date = "2021-08-31"
    )
}}

In some of our more complex models with multiple sources, we might need to offset the __PERIOD_FILTER__ by some time (for example a month lookback time, but it should be configurable). Currently, this materialization doesn't support this, but that is surely a shortcoming for some of our models where insert_by_period would come in handy because of their massive size & complexity. Also, it seems like our team is not the only ones who bumped into this: Feature: Add optional lookback_interval param to insert_by_period materialization #394

A weird behavior

The materialization can produce a weird behavior if it’s executed on an already existing model because the “last incremental run” may have a very short overlap with the specified date range. Let’s take the following config as an example:

{{
    config(
        materialized = "insert_by_period",
        timestamp_field = "created_at",
        period = 'day',
        start_date = "2021-08-28",
        stop_date = "2021-08-31"
    )
}}

This is going to build a model with events 2021-08-28 < created_at < 2021-08-31, so it will have 3 days, thus in 3 steps it will be built, given it’s a fresh start and the run is uninterrupted. After building the 1st day, if the run is terminated, the next run will have 3 steps, again. It's not unlikely, that the last step will only insert a handful of events (many magnitudes smaller, than the previous ones). Like on this CLI output:

This may raise some eyebrows, but it makes perfect sense if we look at the query’s where clause - which is basically this:

where created_at > '2021-08-30 23:59:59.852'::timestamp 
   and  created_at <= '2021-08-31 23:59:59.852'::timestamp -- this isn't relevant at all
   and  created_at <  '2021-08-30 23:59:59.999'::timestamp)

So the last iteration is only looking for events that happened between 2021-08-30 23:59:59.852 and 2021-08-30 23:59:59.999, which is far from even being a second. It would be nice to deal with this, so it won't cause suspicion that something is buggy.

This is most conspicuous for large models with a lot of events.

Checklist

I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
- BigQuery
- Postgres
- Redshift
- Snowflake
I have "dispatched" any new macro(s) so non-core adapters can also use them (e.g. the star() source)
I have updated the README.md (if applicable)
I have added tests & descriptions to my models (and macros if applicable)
I have added an entry to CHANGELOG.md

…-by-period-macro

TBhimdi · 2021-10-08T00:57:45Z

macros/materializations/helpers.sql

+      select
+          coalesce(max({{timestamp_field}}), '{{start_date}}')::timestamp as start_timestamp,
+          coalesce(
+            {{dbt_utils.dateadd('millisecond',


nanosecond is more accurate, i.e. timestamp(9), millisecond only covers timestamp(3).

TBhimdi · 2021-10-08T01:04:22Z

macros/materializations/helpers.sql

+
+    select
+      start_timestamp,
+      stop_timestamp,


Downstream, load_result truncates these timestamp(9) to timestamp(6), which breaks timestamp comparisons downstream, thus inserting duplicates if incremental is rerun after a completed load.

The fix is to cast both start_timestamp and stop_timestamp here to varchar, so that load_result gets a string and not a timestamp.

joellabes · 2021-10-13T04:33:21Z

Whew! Thanks for all your work on this @HorvathDanielMarton.

I just spent some time talking about this very macro in the dbt Community Slack as well.

In short, we (the dbt Labs Developer Experience team) are very interested in the future of this materialization because it does some important work. What we're not certain about yet is exactly how it fits into the wider dbt ecosystem:

Should it be the fifth native materialization?
An additional set of config on the existing incremental materialization?
Be expanded to cover all four core adapters in scope of this package (Snowflake, BigQuery, Redshift, Postgres)?
Be extracted from this package and remain a Redshift-only oddity? Unlikely, given your work here.

I promise this PR won't languish forever, but please bear with us while we work out where we're headed 🧭

joellabes · 2021-10-13T04:35:01Z

Worth noting: anyone else who comes along in the meantime is of course welcome to make a copy of the code in this PR and apply it to their own project!

HorvathDanielMarton · 2021-10-13T13:06:47Z

Worth noting: anyone else who comes along in the meantime is of course welcome to make a copy of the code in this PR and apply it to their own project!

@joellabes Thank you! To make it easier for the curious ones, just update your packages.yml to include the branch as follows

packages:
  - git: "https://github.com/sspinc/dbt-utils.git"
    revision: fix-legacy-insert-by-period-macro

and execute dbt deps. After that you’ll be ready to play around with this materialization.

nhrebinka · 2021-11-30T17:21:50Z

macros/materializations/helpers.sql

+
+  {%- set period_filter -%}
+    ("{{timestamp_field}}" >  '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and
+     "{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' + interval '1 {{period}}' and


This line will not work properly if period = "month"
example:

select '2018-12-31 00:00:00'::timestamp + interval '4 month' + interval '1 month', '2018-12-31 00:00:00'::timestamp + interval '5 month'

2019-05-30 00:00:00 & 2019-05-31 00:00:00 => one day will not be processed.
Redshift can deal with the brackets, like '{{start_timestamp}}'::timestamp + (interval '{{offset}} {{period}}' + interval '1 {{period}}'), Snowflake will not like it.
Suggestion (that I cannot test right away): calculate offset_plus_one as offset + 1 and use it in setting the filter, like

{%- set period_filter -%} ("{{timestamp_field}}" > '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and "{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset_plus_one}} {{period}}' and "{{timestamp_field}}" < '{{stop_timestamp}}'::timestamp) {%- endset -%}

I tested this block and it works ok. Feel free to use it.

{% macro get_period_sql(target_cols_csv, sql, timestamp_field, period, start_timestamp, stop_timestamp, offset) -%} {%- set offset_plus_one = offset + 1 -%} {%- set period_filter -%} ("{{timestamp_field}}" > '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and "{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset_plus_one}} {{period}}' and "{{timestamp_field}}" < '{{stop_timestamp}}'::timestamp) {%- endset -%} {%- set filtered_sql = sql | replace("__PERIOD_FILTER__", period_filter) -%} select {{target_cols_csv}} from ( {{filtered_sql}} ) {%- endmacro %}

joellabes · 2022-02-01T03:52:08Z

Folks on this thread will probably be interested in this Discussion: #487

I'd appreciate your input!

smitsrr · 2022-09-19T13:49:56Z

looks like this can be moved to the experimental features repo :-)
https://github.com/dbt-labs/dbt-labs-experimental-features/blob/main/insert_by_period/macros/insert_by_period_materialization.sql

joellabes · 2022-09-20T03:40:11Z

Unsurprisingly, I can't transfer a PR from one repo to another, but if you want to reopen the PR over there I'll be happy to merge it 👍

Daniel Horvath added 29 commits August 17, 2021 10:16

Update legacy macro to 0.21.latest incremental

3493d3f

Raise error if no filter specified

051a2e4

Create empty table first

8536054

Add configurations and period filter

96ff9ed

Make macro work on manually defined simple example

7ac2058

Implement for loop with predefined time boundaries

364772b

Move period boundary macros to helpers.sql

147494a

Modify helper macros to support Snowflake

3bb17d5

Use macros to automate the for loop

a8d3259

Refactor and add logging, comments

3e100c2

Use incremental_upsert() in the for loop

760a33d

Introduce check_for_period_filter macro

ec73c80

Dispatch check_for_period_filter() macro

f876d1d

Remove unnecessary code, improve logging

a961622

Handle full refresh

ed6d55c

Remove verbose CLI logging

e53a801

Update CHANGELOG.md

b360a91

Update README.md

53017a4

Alias table expression in a helper function

888d781

Update integration tests

6fe4adb

Set default period to week

af2dd57

Use make_temp_relation() in for loop

cdba4b5

Update README.md

7c81020

Run test_insert_by_period for all targets

f82f2ee

Merge remote-tracking branch 'upstream/master' into fix-legacy-insert…

171e5df

…-by-period-macro

Add aggressive CLI logging for debug purposes

e1cdcee

Add logging to get_period_boundaries()

713f8a6

Extend helper function logging

1ed1e58

Remove debug logging

9c680f9

HorvathDanielMarton mentioned this pull request Sep 6, 2021

Fix legacy insert by period materialization sspinc/dbt-utils#1

Open

12 tasks

HorvathDanielMarton mentioned this pull request Sep 6, 2021

Modernize insert_by_period materialization dbt-labs/dbt-labs-experimental-features#32

Open

HorvathDanielMarton changed the title ~~Modernize legacy insert by period macro~~ Modernize legacy insert by period materialization Sep 6, 2021

TBhimdi reviewed Oct 8, 2021

View reviewed changes

joellabes mentioned this pull request Nov 1, 2021

[Feature] The insert_by_period materialization should graduate to part of the main project dbt-labs/dbt-core#4174

Closed

1 task

joellabes mentioned this pull request Nov 9, 2021

Remove extra semicolon in insert_by_period materialization #439

Merged

12 tasks

joellabes added the pending label Nov 11, 2021

nhrebinka reviewed Nov 30, 2021

View reviewed changes

laconc added enhancement New feature or request and removed enhancement New feature or request labels Dec 2, 2021

joellabes mentioned this pull request Jan 23, 2022

Delete rows with matching unique key #478

Closed

12 tasks

joellabes closed this Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize legacy insert by period materialization #410

Modernize legacy insert by period materialization #410

HorvathDanielMarton commented Sep 3, 2021 •

edited

Loading

TBhimdi Oct 8, 2021

TBhimdi Oct 8, 2021

joellabes commented Oct 13, 2021

joellabes commented Oct 13, 2021

HorvathDanielMarton commented Oct 13, 2021

nhrebinka Nov 30, 2021

nhrebinka Nov 30, 2021

joellabes commented Feb 1, 2022

smitsrr commented Sep 19, 2022

joellabes commented Sep 20, 2022

Modernize legacy insert by period materialization #410

Modernize legacy insert by period materialization #410

Conversation

HorvathDanielMarton commented Sep 3, 2021 • edited Loading

Description & motivation

Ideas

A weird behavior

Checklist

TBhimdi Oct 8, 2021

Choose a reason for hiding this comment

TBhimdi Oct 8, 2021

Choose a reason for hiding this comment

joellabes commented Oct 13, 2021

joellabes commented Oct 13, 2021

HorvathDanielMarton commented Oct 13, 2021

nhrebinka Nov 30, 2021

Choose a reason for hiding this comment

nhrebinka Nov 30, 2021

Choose a reason for hiding this comment

joellabes commented Feb 1, 2022

smitsrr commented Sep 19, 2022

joellabes commented Sep 20, 2022

HorvathDanielMarton commented Sep 3, 2021 •

edited

Loading