Fix legacy insert by period materialization #1

HorvathDanielMarton · 2021-08-27T12:40:35Z

This is a:

bug fix PR with no breaking changes — please ensure the base branch is master
new functionality — please ensure the base branch is the latest dev/ branch
a breaking change — please ensure the base branch is the latest dev/ branch

Description & motivation

The goal of this change is to get the insert_by_period materialization compatible with Snowflake while using the currently latest dbt (0.21.0b2) and keeping the same functionality.

The insert_by_period materialization was introduced in this discourse post and the intention behind it is to help with “problematic” initial builds of data models. Here, problematic means that some tables are just so large and/or complex that a simple --full-refresh is not adequate because the build will be incredibly slow, inefficient, and can even get stuck.

Unfortunately, the currently available version does not support Snowflake and it’s advised to refactor it by using the refreshed code for the incremental materialization as a good starting point. As a result, this change addresses dbt-labs/dbt-labs-experimental-features#32.

The materialization itself was completely built from the current incremental materialization and the modifications done to it was inspired by the original insert_by_period.

The macros that were originally in the insert_by_period.sql was moved to a helpers.sql file where the Snowflake version of get_period_boundaries(), and get_period_sql() are introduced. Also, a new check_for_period_filter() macro has been added.

The materialization seems fully functional and so far works as expected, nevertheless, I'm still testing it against production models.

Ideas

Currently, in the config we can only set a period of day, week, month, etc... as a unit. However, we might want to build our models fortnightly (ie. 2-week chunks). We could introduce an optional configuration that lets us use a configurable number of periods as a chunk. The fortnightly example could look like this:

{{
    config(
        materialized = "insert_by_period",
        timestamp_field = "created_at",
        period = 'week',
        interval_size = 2,
        start_date = "2021-01-01",
        stop_date = "2021-08-31"
    )
}}

In some of our more complex models with multiple sources, we might need to offset the __PERIOD_FILTER__ by some time (for example a month lookback time, but it should be configurable). Currently, this materialization doesn't support this, but that is surely a shortcoming for some of our models where insert_by_period would come in handy because of their massive size & complexity. Also, it seems like our team is not the only ones who bumped into this: Feature: Add optional lookback_interval param to insert_by_period materialization dbt-labs/dbt-utils#394

A weird behavior

The materialization can produce a weird behavior if it’s executed on an already existing model because the “last incremental run” may have a very short overlap with the specified date range. Let’s take the following config as an example:

{{
    config(
        materialized = "insert_by_period",
        timestamp_field = "created_at",
        period = 'day',
        start_date = "2021-08-28",
        stop_date = "2021-08-31"
    )
}}

This is going to build a model with events 2021-08-28 < created_at < 2021-08-31, so it will have 3 days, thus in 3 steps it will be built, given it’s a fresh start and the run is uninterrupted. After building the 1st day, if the run is terminated, the next run will have 3 steps, again. It's not unlikely, that the last step will only insert a handful of events (many magnitudes smaller, than the previous ones). Like on this CLI output:

This may raise some eyebrows, but it makes perfect sense if we look at the query’s where clause - which is basically this:

where created_at > '2021-08-30 23:59:59.852'::timestamp 
   and  created_at <= '2021-08-31 23:59:59.852'::timestamp -- this isn't relevant at all
   and  created_at <  '2021-08-30 23:59:59.999'::timestamp)

So the last iteration is only looking for events that happened between 2021-08-30 23:59:59.852 and 2021-08-30 23:59:59.999, which is far from even being a second. It would be nice to deal with this, so it won't cause suspicion that something is buggy.

This is most conspicuous for large models with a lot of events.

Checklist

I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
- BigQuery
- Postgres
- Redshift
- Snowflake
I have "dispatched" any new macro(s) so non-core adapters can also use them (e.g. the star() source)
I have updated the README.md (if applicable)
I have added tests & descriptions to my models (and macros if applicable)
I have added an entry to CHANGELOG.md

macros/materializations/helpers.sql

* PR - lowercase `except` values in star() resolves dbt-labs#402 * Update test_star.sql * Update CHANGELOG.md * Update test_star.sql Co-authored-by: jasnonaz <jason.ganz64@gmail.com>

etoulas · 2021-08-31T14:14:11Z

One more question.

I encountered another issue with Postgres as described in the same comment dbt-labs/dbt-labs-experimental-features#32. But I couldn't track down in your new code what happened with it. Perhaps you can comment if this part follows the same logic as before. And if it this error will reappear.

The error I got (after fixing the subquery alias above) was: "cannot create temporary relation in non-temporary schema".

I "solved" it by changing the schema parameter's default value to None in insert_by_period_materialization.sql#L130:

    {%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                               schema=None, type='table') -%}  -- this is line 130

Potentially breaking other features or supported DB engines, but my setup was fine that.

HorvathDanielMarton · 2021-08-31T15:24:54Z

One more question.

I encountered another issue with Postgres as described in the same comment dbt-labs#192 (comment). But I couldn't track down in your new code what happened with it. Perhaps you can comment if this part follows the same logic as before. And if it this error will reappear.

The error I got (after fixing the subquery alias above) was: "cannot create temporary relation in non-temporary schema".

I "solved" it by changing the schema parameter's default value to None in insert_by_period_materialization.sql#L130:
    {%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                               schema=None, type='table') -%}  -- this is line 130
Potentially breaking other features or supported DB engines, but my setup was fine that.

I updated the code as a direct result of your question, hopefully for the better!

As I started from the incremental materialization, I took a look and I realized, that api.Relation.create() is not used anymore but I still used it in L80 of the materialization query. On issue dbt-labs/dbt-labs-experimental-features#32, the make_temp_relation() function is specifically called out in the description, so I updated the code to use that instead.

Unfortunately, I can’t test this on Postgres, but I tested on Snowflake, and the updated code works as expected. I think using a dbt core function is the way to go forward.

Thanks for your comment!

…-by-period-macro

HorvathDanielMarton · 2021-09-06T08:50:58Z

All the checks passed for Postgres, so I think we are good there. The rest of the integration checks won’t pass here, in this repository. Thank you again @etoulas for your comments. I opened a PR with this change in dbt-utils too, so if you’d like to follow the conversation, here it is: dbt-labs#410

@rgabo thanks again for your continuous support on this. What do you think, should we 1) wait for some feedback on the other PR or 2) merge this back locally and modify it later if it’s necessary based on the review we receive? Also, do you have any idea why their CI didn't run for our PR?

rgabo · 2021-09-06T08:56:48Z

@HorvathDanielMarton with regards to this PR, we can simply use the functionality from the fix-legacy-insert-by-period-macro branch and pull in dbt-utils from our fork for the time being. If there is a new release of dbt-utils while the PR is being reviewed and improved, we are going to need to bring the branch up-to-speed anyways.

With that in mind, we can close this PR and keep all relevant discussion in dbt-labs#410.

As far as CI goes, it is likely that PRs are not automatically run against CI, we'll likely need an author sprinkle his magic on the PR to trigger CI 😄

HorvathDanielMarton added 14 commits August 17, 2021 10:16

Update legacy macro to 0.21.latest incremental

3493d3f

Raise error if no filter specified

051a2e4

Create empty table first

8536054

Add configurations and period filter

96ff9ed

Make macro work on manually defined simple example

7ac2058

Implement for loop with predefined time boundaries

364772b

Move period boundary macros to helpers.sql

147494a

Modify helper macros to support Snowflake

3bb17d5

Use macros to automate the for loop

a8d3259

Refactor and add logging, comments

3e100c2

Use incremental_upsert() in the for loop

760a33d

Introduce check_for_period_filter macro

ec73c80

Dispatch check_for_period_filter() macro

f876d1d

Remove unnecessary code, improve logging

a961622

HorvathDanielMarton added the in progress Still in progresss label Aug 27, 2021

HorvathDanielMarton requested a review from rgabo August 27, 2021 12:44

etoulas suggested changes Aug 30, 2021

View reviewed changes

macros/materializations/helpers.sql Show resolved Hide resolved

danieldiamond and others added 9 commits August 30, 2021 10:06

Update schema_tests links (dbt-labs#385)

1e52037

PR - lowercase except values in star() (dbt-labs#403)

5abb160

* PR - lowercase `except` values in star() resolves dbt-labs#402 * Update test_star.sql * Update CHANGELOG.md * Update test_star.sql Co-authored-by: jasnonaz <jason.ganz64@gmail.com>

Handle full refresh

ed6d55c

Remove verbose CLI logging

e53a801

Update CHANGELOG.md

b360a91

Update README.md

53017a4

Alias table expression in a helper function

888d781

Update integration tests

6fe4adb

Set default period to week

af2dd57

Use make_temp_relation() in for loop

cdba4b5

HorvathDanielMarton removed the in progress Still in progresss label Aug 31, 2021

HorvathDanielMarton added 2 commits September 1, 2021 15:10

Update README.md

7c81020

Run test_insert_by_period for all targets

f82f2ee

HorvathDanielMarton force-pushed the fix-legacy-insert-by-period-macro branch from 19dd99f to f82f2ee Compare September 1, 2021 14:55

HorvathDanielMarton added 5 commits September 1, 2021 17:04

Merge remote-tracking branch 'upstream/master' into fix-legacy-insert…

171e5df

…-by-period-macro

Add aggressive CLI logging for debug purposes

e1cdcee

Add logging to get_period_boundaries()

713f8a6

Extend helper function logging

1ed1e58

Remove debug logging

9c680f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix legacy insert by period materialization #1

Fix legacy insert by period materialization #1

HorvathDanielMarton commented Aug 27, 2021 •

edited

Loading

etoulas commented Aug 31, 2021

HorvathDanielMarton commented Aug 31, 2021

HorvathDanielMarton commented Sep 6, 2021

rgabo commented Sep 6, 2021

Fix legacy insert by period materialization #1

Are you sure you want to change the base?

Fix legacy insert by period materialization #1

Conversation

HorvathDanielMarton commented Aug 27, 2021 • edited Loading

Description & motivation

Ideas

A weird behavior

Checklist

etoulas commented Aug 31, 2021

HorvathDanielMarton commented Aug 31, 2021

HorvathDanielMarton commented Sep 6, 2021

rgabo commented Sep 6, 2021

HorvathDanielMarton commented Aug 27, 2021 •

edited

Loading