-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gets columns to update from config for BQ and Snowflake #3100
Conversation
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @prratek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this started, @prratek!
Thoughts from me:
- Naming: Is
update_columns
specific enough? Should we call thisincremental_update_columns
, or is that too unwieldy? - Configs: Should users be able to set
update_columns
indbt_project.yml
? If so, we'll need to add it to the SnowflakeConfig and BigQueryConfig classes. - We'll want a test case to run Snowflake and BigQuery. I think it could be a straightforward addition to
001_simple_copy_test
.
plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql
Outdated
Show resolved
Hide resolved
Co-authored-by: Jeremy Cohen <jtcohen6@gmail.com>
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @prratek |
Re: configs - What do you think? This seems like a model specific config so I'm not sure what the benefit is of being able to set it in And I'll take a look at the contents of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incremental_update_columns
is a bit of a mouthful but seems more consisted with incremental_strategy so we could go with that - I'll make the change.
I could be persuaded either way! I feel like I've regretted being too vague more often than I've regretted being too explicit.
Re: configs - What do you think? This seems like a model specific config so I'm not sure what the benefit is of being able to set it in
dbt_project.yml
. But then I don't see the harm either so idk 🤷♂️
Yeah, I think that's right. It's marginally better to have it. You never know!
plugins/snowflake/dbt/include/snowflake/macros/materializations/incremental.sql
Outdated
Show resolved
Hide resolved
Co-authored-by: Jeremy Cohen <jtcohen6@gmail.com>
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @prratek |
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @prratek |
@jtcohen6 it shouldn't be possible to specify |
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @prratek |
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @prratek |
@prratek Thanks for the Draft PR. |
@bramanscg that's a good catch! Paraphrasing to make sure I understand you correctly - an insert into an incremental model should always use all the columns in the destination table and so the config should only impact "update" behavior. As it stands, this PR just sets I wonder if there's also some complexity we need to think about with behavior for the |
@prratek I wonder if the change needs to be made inside merge.sql (macro default__get_merge_sql: line 38) {% if unique_key %} |
Hmm I'm not sure we'd want to modify Another possible approach could be to have |
Fair enough. For our implementation, I am creating a separate macro "bigquery__get_merge_sql" and its so far giving me the expected result. {% macro bigquery__get_merge_sql(target, source, unique_key, dest_columns, predicates) -%}
{% endmacro %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @bramanscg! We definitely want to use the full dest_columns
when inserting new records, and the user-configured subset only when updating existing rows.
I think the right move here is updating default__get_merge_sql
, which is used by both Snowflake and BigQuery—but not by any database that does not support merge
, e.g. Postgres/Redshift.
This functionality is exclusive to the merge
DML statement, and it's only relevant if a unique_key
is defined. For instance, there's no effect for the Snowflake delete+insert
or BigQuery insert_overwrite
strategies. This feels obvious in hindsight, but it only just clicked for me. That makes me think a better name for this config would be merge_update_columns
. What do you both think?
In any case, if we make the following changes, I think this will "just work":
$ git diff
diff --git a/core/dbt/include/global_project/macros/materializations/common/merge.sql b/core/dbt/include/global_project/macros/materializations/common/merge.sql
index b2f9c7f4..b3f1e0a4 100644
--- a/core/dbt/include/global_project/macros/materializations/common/merge.sql
+++ b/core/dbt/include/global_project/macros/materializations/common/merge.sql
@@ -18,6 +18,8 @@
{% macro default__get_merge_sql(target, source, unique_key, dest_columns, predicates) -%}
{%- set predicates = [] if predicates is none else [] + predicates -%}
{%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}
+ {%- set update_columns = config.get('merge_update_columns',
+ default = dest_columns | map(attribute="name") | list) -%}
{%- set sql_header = config.get('sql_header', none) -%}
{% if unique_key %}
@@ -37,8 +39,8 @@
{% if unique_key %}
when matched then update set
- {% for column in dest_columns -%}
- {{ adapter.quote(column.name) }} = DBT_INTERNAL_SOURCE.{{ adapter.quote(column.name) }}
+ {% for column_name in update_columns -%}
+ {{ adapter.quote(column_name) }} = DBT_INTERNAL_SOURCE.{{ adapter.quote(column_name) }}
{%- if not loop.last %}, {%- endif %}
{%- endfor %}
{% endif %}
Pro: get_merge_sql
checks for the incremental_update_columns
config. We don't need to update the incremental materializations at all, and we don't need to change the signature of get_merge_sql
at all.
Con: In our grand future vision of materializations, we want to move away from one-off macros constantly grabbing values off the config
, and instead make the materialization the sole puller and pusher of all config
values. We're not in that world just yet, and we're hardly digging ourselves a deeper hole here in the meantime.
@prratek Could you give this a go locally and see if it works for you with some test cases? If so, we can jam on how to write the integration tests for Snowflake and BigQuery.
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @prratek |
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @prratek |
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @prratek |
Awesome work @prratek! As soon as you sign the CLA, I can give this a final review. |
I did sign it, but after the check had already been marked as failed. Is there a way to get the bot to re-check? |
@cla-bot check |
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @prratek |
The cla-bot has been summoned, and re-checked this pull request! |
Oh, I think I see what happened: you submitted the form with your GitHub account name as |
@cla-bot check |
The cla-bot has been summoned, and re-checked this pull request! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prratek Thanks for all the hard work on this!
Thanks for your help along the way! |
Can anyone suggest how to do update on the specify columns in dbt |
I am interested by an option which does exactly the opposite of merge_update_columns My use case : when i run models, I always have a date that I don't want to update in merge statements because it is the creation date of the records. So I would like to have a merge_no_update_columns option in incremental models to keep original creation dates when records are updated. That would be easier to configure merge_no_update_columns because if the table contains several columns, we only need to set one of them in the parameter and if the model needs new columns in future, we don't have to modify the config. It is easy to add this new option merge_no_update_columns in merge.sql macro:
Howerver, it seems to be difficult to have this two options (merge_update_columns and merge_no_update_columns) in the same config. Have you got an idea and what was the use case to add merge_update_columns ? Thank you |
{% set update_cols = dbt_utils.star(this, except = ['dont', 'update', 'these', 'columns']) %}
{{ config(
materialized = 'incremental',
merge_update_columns = update_cols
}} |
@jtcohen6 , thanks for your response which is a very good idea ! With this solution, we have to create a new macro like star_array which returns an array of column names (because the original star macro only returns a string) However, I'm new with the jinja syntax and I have an issue with this line : So, what is the good syntax to set merge_update_columns with the update_cols value ? I hope that it's possible to assign the value in this config block ! Thank you |
@ppo-05 Ah, you're right—I thought about this a bit more, and the solution I suggested above will not work. Because the So I think we're back to: you can certainly achieve this functionality by reimplementing the |
@jtcohen6 @prratek Finaly, is it so confusing to have this two parameters in the config ? We could consider that merge_update_column has priority on merge_no_update_column and write it in the docs (so if both parameters are defined, only merge_update_column will be used). These two options are effectively interesting and depends on our use cases. The new code will be this one (I have tested it and it is ok) At the begining of the default__get_merge_sql macro :
Between "merge into" and "when not matched then insert" :
|
resolves #1862
Description
Incremental models currently default to updating all columns. For databases that support
merge
statements, this PR allows the user to pass in anupdate_columns
config parameter to selectively update only a subset of columns by replacing the call toadapter.get_columns_in_relation
in the materialization for BigQuery and Snowflake.Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.