-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modernize insert_by_period materialization #32
Comments
This would be really useful for us for use with our dbtvault package which currently uses Snowflake. Starting to experiment! Has anything changed since this issue was created that will be useful to know? |
Hey @DVAlexHiggs - check out the latest docs on custom materializations over here: https://docs.getdbt.com/docs/writing-code-in-dbt/extending-dbts-programming-environment/creating-new-materializations We've also added some helper functions that should simplify materialization code a whole bunch. Check out the default incremental materialization for some inspiration: https://github.com/fishtown-analytics/dbt/blob/dev/octavius-catto/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql |
Hi. We've recently released v0.7.0 of dbtvault, which includes a modified snowflake implementation of this macro. I believe it is a more modern version, using some of the new adapter functions and such. Have a look here: We've also added a number of additional features such as inferred date ranges, and a 'BASE LOAD' feature which gets around initial load problems. How useful is this for dbt-utils? I am happy to modify this accordingly for dbt-utils and contribute back. |
@DVAlexHiggs thanks a lot for writing the comment above! I'm a Snowflake dbt user and after trying the |
I'd like to see Postgres and other warehouses supported. I got this running on Postgres 11 making two small modifications (to get rid of the errors). On the first try I got the error "subquery in FROM must have an alias". select
{{target_cols_csv}}
from (
{{filtered_sql}}
) t -- this is line 49 Once that was fixed I got another error saying "cannot create temporary relation in non-temporary schema". {%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
schema=None, type='table') -%} -- this is line 130 This did the trick and it worked for me on Postgres. |
@fernandobrito I've made some modifications to the macro here for Snowflake. It's a bit flaky - it always fails the first time with an |
We have fully working version with extra features included in our dbtvault package: |
Oh wow, thank you @DVAlexHiggs !! |
To anyone, for whom this materialization is relevant & interested, we have a working version of |
Thanks for your work on this, @HorvathDanielMarton. I'm using dbtvault (see link above) just for their implementation of this macro, and although it works great, extra dependencies mean extra time waiting until all of them catch up with new dbt releases, so having it on |
I can confirm that @etoulas's solution works for PostgreSQL. Can this be added to the package? |
@moltar we're in the process of extracting the insert_by_period materialization from this project and moving it to the experiments repo, to better reflect its level of maturity. Once that move is complete, we'd welcome a PR to incorporate this fix! |
Looks like the experiments repo is now ready for this edit suggestion! https://github.com/dbt-labs/dbt-labs-experimental-features/blob/main/insert_by_period/macros/insert_by_period_materialization.sql |
Sure is! Bring on the PRs (although the CI testing is currently busted 😬) |
Universal approach if you have source history tables and would like to run them sequentially how to they got imported and produce CDC by taking advantage of macros & the built in SNAPSHOTs: I needed to do this for Big Query and did it manually bc I am fairly new to creating custom macros. Basically, we have tons of source tables that has multiple data deliveries each time interval. We just append the newest results. We do record each delivery date as an existing field Strategy summary - create a bunch of macros and a snapshot model and you literally dont have to change how you ETL records to maintain cdc : 1.) Create macro to Create brand new table: 2.) Create a macro to store the date value for VAR{{ min_cdc_date }}: 3.) Create a macro to store the date value for VAR{{ max_cdc_date }}: 4.) Create a macro to:
& also check
9.Create a macro that conditionally populates VAR{{snapshot_sql}} as follows:
wrap the macros logically into one or two wrapper macros and call those in your snapshot model and it should create a CDC table for you out of already historical data. If logically nested, you can keep your pipelines exactly rhe, with addition to these macros being called in the snapshot models. Just execute the VAR{{ sql }} Possible alternative: Could create a view for each date you want and just select * from those and drop them. Or even a macro for custom views but the SQL for it may get tricky depending on CDC type, if there were schema changes, etc.) --bigquery specific: import into existing source history table. Using time_travel feature and ability to log source model run times (adding date_time column)), just query the table using time travel (keeps past 7 days of records) 1.create a copy of the existing source table 3.Create snapshot model for daily loads
so schedule a query to delete the minimum partition daily and have the snapshot model run right after it and you should be able to build a cdc table from already populated history tables |
The insert_by_period materialization was written ~2 years ago, and since then, we've improved the way we write materializations. e.g.:
non_destructive
flagmake_temp_relation
Additionally, this materialization doesn't work on warehouses other than Redshift (see dbt-labs/dbt-utils#189)
It's time to give this some love: check out the refreshed code for the incremental materialization as a good starting point.
The text was updated successfully, but these errors were encountered: