Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gets columns to update from config for BQ and Snowflake #3100

Merged
merged 30 commits into from
Mar 22, 2021

Conversation

prratek
Copy link
Contributor

@prratek prratek commented Feb 13, 2021

resolves #1862

Description

Incremental models currently default to updating all columns. For databases that support merge statements, this PR allows the user to pass in an update_columns config parameter to selectively update only a subset of columns by replacing the call to adapter.get_columns_in_relation in the materialization for BigQuery and Snowflake.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot
Copy link

cla-bot bot commented Feb 13, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @prratek

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this started, @prratek!

Thoughts from me:

  • Naming: Is update_columns specific enough? Should we call this incremental_update_columns, or is that too unwieldy?
  • Configs: Should users be able to set update_columns in dbt_project.yml? If so, we'll need to add it to the SnowflakeConfig and BigQueryConfig classes.
  • We'll want a test case to run Snowflake and BigQuery. I think it could be a straightforward addition to 001_simple_copy_test.

Co-authored-by: Jeremy Cohen <jtcohen6@gmail.com>
@cla-bot
Copy link

cla-bot bot commented Feb 16, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @prratek

@prratek
Copy link
Contributor Author

prratek commented Feb 16, 2021

Thanks for getting this started, @prratek!

Thoughts from me:

  • Naming: Is update_columns specific enough? Should we call this incremental_update_columns, or is that too unwieldy?
  • Configs: Should users be able to set update_columns in dbt_project.yml? If so, we'll need to add it to the SnowflakeConfig and BigQueryConfig classes.
  • We'll want a test case to run Snowflake and BigQuery. I think it could be a straightforward addition to 001_simple_copy_test.

incremental_update_columns is a bit of a mouthful but seems more consisted with incremental_strategy so we could go with that - I'll make the change.

Re: configs - What do you think? This seems like a model specific config so I'm not sure what the benefit is of being able to set it in dbt_project.yml. But then I don't see the harm either so idk 🤷‍♂️

And I'll take a look at the contents of 001_simple_copy_test and see if I can make sense of it haha. I'm pretty new to the dbt codebase and might ask for pointers. I'll also need to look through the other failing tests in the CI pipeline to see what changes are needed, right?

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incremental_update_columns is a bit of a mouthful but seems more consisted with incremental_strategy so we could go with that - I'll make the change.

I could be persuaded either way! I feel like I've regretted being too vague more often than I've regretted being too explicit.

Re: configs - What do you think? This seems like a model specific config so I'm not sure what the benefit is of being able to set it in dbt_project.yml. But then I don't see the harm either so idk 🤷‍♂️

Yeah, I think that's right. It's marginally better to have it. You never know!

Co-authored-by: Jeremy Cohen <jtcohen6@gmail.com>
@cla-bot
Copy link

cla-bot bot commented Feb 20, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @prratek

@cla-bot
Copy link

cla-bot bot commented Feb 20, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @prratek

@prratek
Copy link
Contributor Author

prratek commented Feb 20, 2021

@jtcohen6 it shouldn't be possible to specify incremental_update_columns without also specifying a uniqueness constraint with unique_key, right?

@cla-bot
Copy link

cla-bot bot commented Feb 20, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @prratek

@cla-bot
Copy link

cla-bot bot commented Feb 20, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @prratek

@bramanscg
Copy link

@prratek Thanks for the Draft PR.
Am I right in saying that if the "incremental_update_columns" are specified in the config, the same set of incremental columns would be used in the insert & the update part of the merge query? Shouldn't the config just loop through the dest_cols for the update part, but use all the columns in the relation for the insert?

@prratek
Copy link
Contributor Author

prratek commented Feb 22, 2021

@prratek Thanks for the Draft PR.
Am I right in saying that if the "incremental_update_columns" are specified in the config, the same set of incremental columns would be used in the insert & the update part of the merge query? Shouldn't the config just loop through the dest_cols for the update part, but use all the columns in the relation for the insert?

@bramanscg that's a good catch! Paraphrasing to make sure I understand you correctly - an insert into an incremental model should always use all the columns in the destination table and so the config should only impact "update" behavior. As it stands, this PR just sets dest_columns and lets downstream logic handle the rest, and I'm not sure what behavior that would entail - I need to figure out how get_merge_sql uses dest_columns.

I wonder if there's also some complexity we need to think about with behavior for the incremental_overwrite strategy on BigQuery.

@bramanscg
Copy link

@prratek I wonder if the change needs to be made inside merge.sql (macro default__get_merge_sql: line 38)

{% if unique_key %}
when matched then update set
{% set incremental_update_columns = config.get('incremental_update_columns', none) %}
{% if incremental_update_columns is none %}
{% set incremental_update_columns = dest_columns %}
{% endif %}
{% for column in incremental_update_columns -%}
{{ adapter.quote(column.name) }} = DBT_INTERNAL_SOURCE.{{ adapter.quote(column.name) }}
{%- if not loop.last %}, {%- endif %}
{%- endfor %}
{% endif %}

@prratek
Copy link
Contributor Author

prratek commented Feb 24, 2021

@prratek I wonder if the change needs to be made inside merge.sql (macro default__get_merge_sql: line 38)

{% if unique_key %}
when matched then update set
{% set incremental_update_columns = config.get('incremental_update_columns', none) %}
{% if incremental_update_columns is none %}
{% set incremental_update_columns = dest_columns %}
{% endif %}
{% for column in incremental_update_columns -%}
{{ adapter.quote(column.name) }} = DBT_INTERNAL_SOURCE.{{ adapter.quote(column.name) }}
{%- if not loop.last %}, {%- endif %}
{%- endfor %}
{% endif %}

Hmm I'm not sure we'd want to modify default__get_merge_sql for something that isn't relevant to all adapters. Plus, do you know what macro actually gets called by the [{{ adapter.dispatch('get_merge_sql')(...) }}](https://github.com/fishtown-analytics/dbt/blob/develop/core/dbt/include/global_project/macros/materializations/common/merge.sql#L4) for BigQuery and/or Snowflake? I imagine it's not always executing default__get_merge_sql.

Another possible approach could be to have get_merge_sql accept both dest_columns and update_columns and use that to vary behavior for BigQuery and Snowflake when relevant. But i'm also a little out of my depth now and could use some guidance. @jtcohen6?

@prratek prratek changed the title Gets columns to updated from config for BQ and Snowflake Gets columns to update from config for BQ and Snowflake Feb 24, 2021
@bramanscg
Copy link

Fair enough. For our implementation, I am creating a separate macro "bigquery__get_merge_sql" and its so far giving me the expected result.

{% macro bigquery__get_merge_sql(target, source, unique_key, dest_columns, predicates) -%}
{%- set predicates = [] if predicates is none else [] + predicates -%}
{%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}
{%- set sql_header = config.get('sql_header', none) -%}

{%- set incremental_updated_columns = config.get('incremental_updated_columns', none) -%}
{%- set incremental_updated_columns_exists = 'N' -%}
{% if incremental_updated_columns is none %}
     {% set incremental_updated_columns = dest_columns %}
{% else %}
    {%- set incremental_updated_columns_exists = 'Y' -%}
{% endif %}

{% if unique_key %}
    {% set unique_key_match %}
        DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
    {% endset %}
    {% do predicates.append(unique_key_match) %}
{% else %}
    {% do predicates.append('FALSE') %}
{% endif %}

{{ sql_header if sql_header is not none }}

merge into {{ target }} as DBT_INTERNAL_DEST
    using {{ source }} as DBT_INTERNAL_SOURCE
    on {{ predicates | join(' and ') }}

{% if unique_key %}
when matched then update set
    {% for column in incremental_updated_columns -%}
        {% if incremental_updated_columns_exists == 'N' -%}
            {{ adapter.quote(column.name) }} = DBT_INTERNAL_SOURCE.{{ adapter.quote(column.name) }}
        {% else %}
            {{ adapter.quote(column) }} = DBT_INTERNAL_SOURCE.{{ adapter.quote(column) }}
        {% endif %}
        {%- if not loop.last %}, {%- endif %}
    {%- endfor %}
{% endif %}

when not matched then insert
    ({{ dest_cols_csv }})
values
    ({{ dest_cols_csv }})

{% endmacro %}

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @bramanscg! We definitely want to use the full dest_columns when inserting new records, and the user-configured subset only when updating existing rows.

I think the right move here is updating default__get_merge_sql, which is used by both Snowflake and BigQuery—but not by any database that does not support merge, e.g. Postgres/Redshift.

This functionality is exclusive to the merge DML statement, and it's only relevant if a unique_key is defined. For instance, there's no effect for the Snowflake delete+insert or BigQuery insert_overwrite strategies. This feels obvious in hindsight, but it only just clicked for me. That makes me think a better name for this config would be merge_update_columns. What do you both think?

In any case, if we make the following changes, I think this will "just work":

$ git diff
diff --git a/core/dbt/include/global_project/macros/materializations/common/merge.sql b/core/dbt/include/global_project/macros/materializations/common/merge.sql
index b2f9c7f4..b3f1e0a4 100644
--- a/core/dbt/include/global_project/macros/materializations/common/merge.sql
+++ b/core/dbt/include/global_project/macros/materializations/common/merge.sql
@@ -18,6 +18,8 @@
 {% macro default__get_merge_sql(target, source, unique_key, dest_columns, predicates) -%}
     {%- set predicates = [] if predicates is none else [] + predicates -%}
     {%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}
+    {%- set update_columns = config.get('merge_update_columns',
+        default = dest_columns | map(attribute="name") | list) -%}
     {%- set sql_header = config.get('sql_header', none) -%}

     {% if unique_key %}
@@ -37,8 +39,8 @@

     {% if unique_key %}
     when matched then update set
-        {% for column in dest_columns -%}
-            {{ adapter.quote(column.name) }} = DBT_INTERNAL_SOURCE.{{ adapter.quote(column.name) }}
+        {% for column_name in update_columns -%}
+            {{ adapter.quote(column_name) }} = DBT_INTERNAL_SOURCE.{{ adapter.quote(column_name) }}
             {%- if not loop.last %}, {%- endif %}
         {%- endfor %}
     {% endif %}

Pro: get_merge_sql checks for the incremental_update_columns config. We don't need to update the incremental materializations at all, and we don't need to change the signature of get_merge_sql at all.
Con: In our grand future vision of materializations, we want to move away from one-off macros constantly grabbing values off the config, and instead make the materialization the sole puller and pusher of all config values. We're not in that world just yet, and we're hardly digging ourselves a deeper hole here in the meantime.

@prratek Could you give this a go locally and see if it works for you with some test cases? If so, we can jam on how to write the integration tests for Snowflake and BigQuery.

plugins/bigquery/dbt/adapters/bigquery/impl.py Outdated Show resolved Hide resolved
Co-authored-by: Jeremy Cohen <jtcohen6@gmail.com>
@cla-bot
Copy link

cla-bot bot commented Mar 3, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @prratek

@cla-bot
Copy link

cla-bot bot commented Mar 16, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @prratek

@cla-bot
Copy link

cla-bot bot commented Mar 16, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @prratek

@prratek prratek marked this pull request as ready for review March 16, 2021 20:31
@jtcohen6
Copy link
Contributor

Awesome work @prratek! As soon as you sign the CLA, I can give this a final review.

@prratek
Copy link
Contributor Author

prratek commented Mar 17, 2021

Awesome work @prratek! As soon as you sign the CLA, I can give this a final review.

I did sign it, but after the check had already been marked as failed. Is there a way to get the bot to re-check?

@jtcohen6
Copy link
Contributor

@cla-bot check

@cla-bot
Copy link

cla-bot bot commented Mar 17, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @prratek

@cla-bot
Copy link

cla-bot bot commented Mar 17, 2021

The cla-bot has been summoned, and re-checked this pull request!

@jtcohen6
Copy link
Contributor

Oh, I think I see what happened: you submitted the form with your GitHub account name as @prratek instead of prratek. Sorry, that's a silly oversight by our bot! I'll see if I can get this fixed, but that might take a minute. In the meantime, would you mind resubmitting?

@prratek
Copy link
Contributor Author

prratek commented Mar 17, 2021

@cla-bot check

@cla-bot
Copy link

cla-bot bot commented Mar 17, 2021

The cla-bot has been summoned, and re-checked this pull request!

@cla-bot cla-bot bot added the cla:yes label Mar 17, 2021
@prratek prratek requested a review from jtcohen6 March 18, 2021 13:55
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prratek Thanks for all the hard work on this!

@jtcohen6 jtcohen6 merged commit 77c1071 into dbt-labs:develop Mar 22, 2021
@prratek
Copy link
Contributor Author

prratek commented Mar 23, 2021

@prratek Thanks for all the hard work on this!

Thanks for your help along the way!

@prratek prratek deleted the specify-cols-to-update branch March 23, 2021 03:47
@ramtej23
Copy link

ramtej23 commented Jun 3, 2021

Can anyone suggest how to do update on the specify columns in dbt

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 3, 2021

Hi @ramtej23, this feature will be included in the next version of dbt (v0.20.0): a new config merge_update_columns (docs)

@ppo-38
Copy link

ppo-38 commented Aug 24, 2021

Hi @jtcohen6 , @prratek

I am interested by an option which does exactly the opposite of merge_update_columns

My use case : when i run models, I always have a date that I don't want to update in merge statements because it is the creation date of the records. So I would like to have a merge_no_update_columns option in incremental models to keep original creation dates when records are updated.

That would be easier to configure merge_no_update_columns because if the table contains several columns, we only need to set one of them in the parameter and if the model needs new columns in future, we don't have to modify the config.

It is easy to add this new option merge_no_update_columns in merge.sql macro:

{%- set no_update_columns = config.get('merge_no_update_columns', default = []) -%}
when matched then update set
     {%- for column in dest_columns -%}
            {%- if column.name not in no_update_columns %}
                {{ adapter.quote(column.name) }} = DBT_INTERNAL_SOURCE.{{ adapter.quote(column.name) }}
                {%- if not loop.last %}, {%- endif %}
            {%- endif %}
        {%- endfor -%}
    {% endif %}

Howerver, it seems to be difficult to have this two options (merge_update_columns and merge_no_update_columns) in the same config. Have you got an idea and what was the use case to add merge_update_columns ?

Thank you

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 24, 2021

@ppo-05 Have you considered using a macro like dbt_utils.star to grab the list of columns you don't want, and then passing those into merge_update_columns? So for instance:

{% set update_cols = dbt_utils.star(this, except = ['dont', 'update', 'these', 'columns']) %}

{{ config(
    materialized = 'incremental',
    merge_update_columns = update_cols
}}

@ppo-38
Copy link

ppo-38 commented Aug 24, 2021

@jtcohen6 , thanks for your response which is a very good idea !

With this solution, we have to create a new macro like star_array which returns an array of column names (because the original star macro only returns a string)

However, I'm new with the jinja syntax and I have an issue with this line : merge_update_columns = update_cols
The variable update_cols is empty in the target run model whereas I can print and check the value of update_cols which looks something like that : ['col1', 'col2']

So, what is the good syntax to set merge_update_columns with the update_cols value ? I hope that it's possible to assign the value in this config block ! Thank you

@jtcohen6
Copy link
Contributor

@ppo-05 Ah, you're right—I thought about this a bit more, and the solution I suggested above will not work. Because the star macro requires an introspective database query, it returns None at parse time, which is when dbt renders and stores values of configs.

So I think we're back to: you can certainly achieve this functionality by reimplementing the get_merge_sql macro in your local project, using a code snippet similar to the one you have above. I agree it's a bit confusing to define configs for both merge_update_columns and merge_no_update_columns.

@ppo-38
Copy link

ppo-38 commented Aug 25, 2021

@jtcohen6 @prratek Finaly, is it so confusing to have this two parameters in the config ? We could consider that merge_update_column has priority on merge_no_update_column and write it in the docs (so if both parameters are defined, only merge_update_column will be used). These two options are effectively interesting and depends on our use cases.

The new code will be this one (I have tested it and it is ok)

At the begining of the default__get_merge_sql macro :

 {%- set update_columns = config.get('merge_update_columns', default = dest_columns | map(attribute="quoted") | list) -%}
 {%- set no_update_columns = config.get('merge_no_update_columns', default = []) -%}

Between "merge into" and "when not matched then insert" :

{% if unique_key %}
    when matched then update set
        {% if config.get('merge_update_columns') %}
            {%- for column_name in update_columns -%}
                {{ column_name }} = DBT_INTERNAL_SOURCE.{{ column_name }}
                {%- if not loop.last %}, {%- endif %}
            {%- endfor %}
        {% else %}
            {%- for column in dest_columns -%}
                {%- if column.name not in no_update_columns %}
                    {{ adapter.quote(column.name) }} = DBT_INTERNAL_SOURCE.{{ adapter.quote(column.name) }}
                    {%- if not loop.last %}, {%- endif %}
                {%- endif %}
            {%- endfor -%}
        {% endif %}
    {% endif %}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify Columns to update on incremental Models
5 participants