Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix legacy insert by period materialization #1

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

HorvathDanielMarton
Copy link

@HorvathDanielMarton HorvathDanielMarton commented Aug 27, 2021

This is a:

  • bug fix PR with no breaking changes — please ensure the base branch is master
  • new functionality — please ensure the base branch is the latest dev/ branch
  • a breaking change — please ensure the base branch is the latest dev/ branch

Description & motivation

The goal of this change is to get the insert_by_period materialization compatible with Snowflake while using the currently latest dbt (0.21.0b2) and keeping the same functionality.

The insert_by_period materialization was introduced in this discourse post and the intention behind it is to help with “problematic” initial builds of data models. Here, problematic means that some tables are just so large and/or complex that a simple --full-refresh is not adequate because the build will be incredibly slow, inefficient, and can even get stuck.

Unfortunately, the currently available version does not support Snowflake and it’s advised to refactor it by using the refreshed code for the incremental materialization as a good starting point. As a result, this change addresses dbt-labs/dbt-labs-experimental-features#32.

The materialization itself was completely built from the current incremental materialization and the modifications done to it was inspired by the original insert_by_period.

The macros that were originally in the insert_by_period.sql was moved to a helpers.sql file where the Snowflake version of get_period_boundaries(), and get_period_sql() are introduced. Also, a new check_for_period_filter() macro has been added.

The materialization seems fully functional and so far works as expected, nevertheless, I'm still testing it against production models.

Ideas

  1. Currently, in the config we can only set a period of day, week, month, etc... as a unit. However, we might want to build our models fortnightly (ie. 2-week chunks). We could introduce an optional configuration that lets us use a configurable number of periods as a chunk. The fortnightly example could look like this:
{{
    config(
        materialized = "insert_by_period",
        timestamp_field = "created_at",
        period = 'week',
        interval_size = 2,
        start_date = "2021-01-01",
        stop_date = "2021-08-31"
    )
}}
  1. In some of our more complex models with multiple sources, we might need to offset the __PERIOD_FILTER__ by some time (for example a month lookback time, but it should be configurable). Currently, this materialization doesn't support this, but that is surely a shortcoming for some of our models where insert_by_period would come in handy because of their massive size & complexity. Also, it seems like our team is not the only ones who bumped into this: Feature: Add optional lookback_interval param to insert_by_period materialization dbt-labs/dbt-utils#394

A weird behavior

The materialization can produce a weird behavior if it’s executed on an already existing model because the “last incremental run” may have a very short overlap with the specified date range. Let’s take the following config as an example:

{{
    config(
        materialized = "insert_by_period",
        timestamp_field = "created_at",
        period = 'day',
        start_date = "2021-08-28",
        stop_date = "2021-08-31"
    )
}}

This is going to build a model with events 2021-08-28 < created_at < 2021-08-31, so it will have 3 days, thus in 3 steps it will be built, given it’s a fresh start and the run is uninterrupted. After building the 1st day, if the run is terminated, the next run will have 3 steps, again. It's not unlikely, that the last step will only insert a handful of events (many magnitudes smaller, than the previous ones). Like on this CLI output:

Screen Shot 2021-08-31 at 17 05 03

This may raise some eyebrows, but it makes perfect sense if we look at the query’s where clause - which is basically this:

where created_at > '2021-08-30 23:59:59.852'::timestamp 
   and  created_at <= '2021-08-31 23:59:59.852'::timestamp -- this isn't relevant at all
   and  created_at <  '2021-08-30 23:59:59.999'::timestamp)

So the last iteration is only looking for events that happened between 2021-08-30 23:59:59.852 and 2021-08-30 23:59:59.999, which is far from even being a second. It would be nice to deal with this, so it won't cause suspicion that something is buggy.

This is most conspicuous for large models with a lot of events.

Checklist

  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I have "dispatched" any new macro(s) so non-core adapters can also use them (e.g. the star() source)
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

@HorvathDanielMarton HorvathDanielMarton added the in progress Still in progresss label Aug 27, 2021
@etoulas
Copy link

etoulas commented Aug 31, 2021

One more question.

I encountered another issue with Postgres as described in the same comment dbt-labs/dbt-labs-experimental-features#32. But I couldn't track down in your new code what happened with it. Perhaps you can comment if this part follows the same logic as before. And if it this error will reappear.

The error I got (after fixing the subquery alias above) was: "cannot create temporary relation in non-temporary schema".

I "solved" it by changing the schema parameter's default value to None in insert_by_period_materialization.sql#L130:

    {%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                               schema=None, type='table') -%}  -- this is line 130

Potentially breaking other features or supported DB engines, but my setup was fine that.

@HorvathDanielMarton
Copy link
Author

One more question.

I encountered another issue with Postgres as described in the same comment dbt-labs#192 (comment). But I couldn't track down in your new code what happened with it. Perhaps you can comment if this part follows the same logic as before. And if it this error will reappear.

The error I got (after fixing the subquery alias above) was: "cannot create temporary relation in non-temporary schema".

I "solved" it by changing the schema parameter's default value to None in insert_by_period_materialization.sql#L130:

    {%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                               schema=None, type='table') -%}  -- this is line 130

Potentially breaking other features or supported DB engines, but my setup was fine that.

I updated the code as a direct result of your question, hopefully for the better!

As I started from the incremental materialization, I took a look and I realized, that api.Relation.create() is not used anymore but I still used it in L80 of the materialization query. On issue dbt-labs/dbt-labs-experimental-features#32, the make_temp_relation() function is specifically called out in the description, so I updated the code to use that instead.

Unfortunately, I can’t test this on Postgres, but I tested on Snowflake, and the updated code works as expected. I think using a dbt core function is the way to go forward.

Thanks for your comment!

@HorvathDanielMarton HorvathDanielMarton removed the in progress Still in progresss label Aug 31, 2021
@HorvathDanielMarton HorvathDanielMarton force-pushed the fix-legacy-insert-by-period-macro branch from 19dd99f to f82f2ee Compare September 1, 2021 14:55
@HorvathDanielMarton
Copy link
Author

All the checks passed for Postgres, so I think we are good there. The rest of the integration checks won’t pass here, in this repository. Thank you again @etoulas for your comments. I opened a PR with this change in dbt-utils too, so if you’d like to follow the conversation, here it is: dbt-labs#410

@rgabo thanks again for your continuous support on this. What do you think, should we 1) wait for some feedback on the other PR or 2) merge this back locally and modify it later if it’s necessary based on the review we receive? Also, do you have any idea why their CI didn't run for our PR?

@rgabo
Copy link

rgabo commented Sep 6, 2021

@HorvathDanielMarton with regards to this PR, we can simply use the functionality from the fix-legacy-insert-by-period-macro branch and pull in dbt-utils from our fork for the time being. If there is a new release of dbt-utils while the PR is being reviewed and improved, we are going to need to bring the branch up-to-speed anyways.

With that in mind, we can close this PR and keep all relevant discussion in dbt-labs#410.

As far as CI goes, it is likely that PRs are not automatically run against CI, we'll likely need an author sprinkle his magic on the PR to trigger CI 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants