Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-602] [Feature] Support for more complex MERGE update rules #5215

Closed
1 task done
cwelton opened this issue May 4, 2022 · 3 comments
Closed
1 task done

[CT-602] [Feature] Support for more complex MERGE update rules #5215

cwelton opened this issue May 4, 2022 · 3 comments
Labels
enhancement New feature or request Team:Adapters Issues designated for the adapter area of the code wontfix Not a bug or out of scope for dbt-core

Comments

@cwelton
Copy link

cwelton commented May 4, 2022

Is this your first time opening an issue?

Describe the Feature

Current incremental model MERGE semantics are limited.

Using merge_update_columns allow for statements of this form:

WHEN MATCHED THEN UPDATE SET 
   balance = NEW.balance

But do not allow for statements of this form:

WHEN MATCHED THEN UPDATE SET 
   balance = OLD.balance + NEW.balance_delta

The later can be important for many types of MERGE statements.

Describe alternatives you've considered

Option 1: Add an extra join to the definition of the incremental model.

This works but is substantially less efficient that producing an appropriate MERGE statement.

Option 2: Create my own get_merge_sql macro

This also works, and is efficient, but the functionally is broadly useful and would be valuable to have native support in dbt.

Who will this benefit?

Incremental model builders who have needs for more complicated MERGE behavior.

Examples would be maintaining an account balance table based on double entry book ledger.

Are you interested in contributing this feature?

I have some working code, but not sure if there are better ways of doing it.

Anything else?

Example syntax of MERGE:

postgres: https://www.postgresql.org/message-id/attachment/23520/sql-merge.html
bigquery: https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement
snowflake: https://docs.snowflake.com/en/sql-reference/sql/merge.html

All support column_name = expression, where expression can support arbitrary expressions between both the source query and target table.

@cwelton cwelton added enhancement New feature or request triage labels May 4, 2022
@github-actions github-actions bot changed the title [Feature] Support for more complex MERGE update rules [CT-602] [Feature] Support for more complex MERGE update rules May 4, 2022
@jtcohen6 jtcohen6 self-assigned this May 9, 2022
@jtcohen6 jtcohen6 added the Team:Adapters Issues designated for the adapter area of the code label May 9, 2022
@jtcohen6
Copy link
Contributor

@cwelton Thanks for opening the issue!

I do think the right answer here is option 2, creating your own get_merge_sql macro. I'll say a bit more about this below.

dbt holds a pretty strong belief that DDL + DML is boilerplate. It should be as "naive" as possible, and it should not be the place where business logic is defined and executed. It's next-to-impossible to "preview" your query results before actually updating the existing table. This is why I've opted for the join-based approach you mention, or a longer lookback window: I can actually run the SQL that returns the results, as a read-only query, and verify that they're correct before they are upserted/merged into the table. Even if that lookback/join is more expensive than performing the calculation within the update statement, it's much less expensive to reason about, for the human reading the code.

The separation between transformation/business logic and materialization logic is, IMO, one of dbt-core's most powerful abstractions. It means that anyone who can write SQL that returns the right dataset, can ship that dataset into production, reliably and consistently. For that reason, I'm going to close the issue, as it's not the pattern I want to see us encouraging. But, as promised, I'll say more below :)


As a broader matter, we are not going to be able to support every possible configuration for every incremental processing situation with our built-in macros and configurations. That's okay, IMO. We should aim to:

  • support the most common and straightforward ones out of the box ("make the easy things easy")
  • enable users with advanced merging needs to have a clear override path

In the first category, we do support merge_update_columns as a config on BigQuery/Snowflake/Databricks—though even that pushes the limits of complexity a bit. Lots of people wanted it, it was fairly trivial to configure and to implement, and it solved a real need for many many folks.

The good news on the second bullet: this is work we've scoped to pick up very soon, and which we hope to ship in the next minor release of dbt-core (v1.2). As detailed in the "desirable outcomes" of #5089:

All "merge" macros (get_merge_sql, get_insert_delete_merge_sql, etc) share a function signature. They should support a common set of arguments, including predicates for custom extensions (see: #4546).

This should make it possible to get what you're after. I'm sure you've already written code in this vein:

models:
  - name: my_complex_incremental_model.sql
    config:
      materialized: incremental
      incremental_strategy: merge
      custom_merge_update_behavior:
        - column_name: balance
          # "DEST" is "OLD", "SOURCE" is "NEW"
          expression: "DBT_INTERNAL_DEST.balance + DBT_INTERNAL_SOURCE.balance_delta"
) }}
    {% set custom_merge_update_behavior = config.get('custom_merge_update_behavior') %}

    {% if unique_key %}
    when matched then update set
        {% for column_name in update_columns -%}
          {% if column name in ... %}  {# pseudo code #}
            {{ column_name }} = {{ custom_merge_update_behavior.expression }}
          {% else %}  {# standard behavior #}
            {{ column_name }} = DBT_INTERNAL_SOURCE.{{ column_name }}
            {%- if not loop.last %}, {%- endif %}
        {%- endfor %}
    {% endif %}

Knowing that the macro has a solidified signature, and predictable inputs and outputs, should give you that much more confidence about the custom reimplementation over the long run.


Aside: I don't think Postgres actually supports the merge statement!

The doc you linked is the one I find when I google for it, but it's for PostgreSQL 8.4devel, and I can't find any newer docs. Postgres 9.5 did add support for insert ... on conflict update, which would allow for the kind of custom merge-y behavior you're describing, where an old row is not directly replaced (set to new value, or deleted + re-inserted), but whereby the update itself recalculates the value from both the old + the new.

Related:

@jtcohen6 jtcohen6 added wontfix Not a bug or out of scope for dbt-core and removed triage labels May 13, 2022
@ferki
Copy link

ferki commented May 20, 2022

Aside: I don't think Postgres actually supports the merge statement!

@jtcohen6: at least not yet :) MERGE is expected to be part of the upcoming PostgreSQL 15 release (via PostgreSQL 15 Beta 1 Released!).

@dongchris
Copy link

@jtcohen6 hi, i'm interested in more functionality in the merge statement, particularly around support for
keep_oldest and keep_newest if an incoming row has the same merge keys as an existing row.

The current behavior is always_replace, where we keep the incoming row.

If we want to keep the existing row (never_replace), we can use merge_update_column=<merge_key> so this works as well.

The last two of keep_newest and keep_oldest is currently not supported, which would use a specified timestamp column to determine which row to keep.

Let me know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Adapters Issues designated for the adapter area of the code wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

4 participants