Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Cartesian Join based deletion is causing performance problems when it hits a certain scale for microbatch models #10863

Open
2 tasks done
Tracked by #10624
graciegoheen opened this issue Oct 16, 2024 · 0 comments
Labels
bug Something isn't working microbatch

Comments

@graciegoheen
Copy link
Contributor

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Cartesian Join based deletion is causing data spilling to disk which heavily bogs down performance

The delete statement looks like:

delete from analytics_dev.dbt_aescay.my_model DBT_INTERNAL_TARGET
    using analytics_dev.dbt_aescay.my_model__dbt_tmp DBT_TMP_TARGET
    where (
    DBT_INTERNAL_TARGET.event_at >= TIMESTAMP '2024-10-14 00:00:00+00:00'
    and DBT_INTERNAL_TARGET.event_at < TIMESTAMP '2024-10-15 00:00:00+00:00'
    
    );

But we are not doing anything with my_model__dbt_tmp in the where clause.

We can simplify this logic and improve the performance, by instead doing:

delete from <existing> where <date range>;
insert into <existing> from <new data for same date range>;

One advantage of microbatch is that we know in advance the exact boundaries of every batch (time range, cf. "static" insert_overwrite).

In a world where we support "microbatch merge" models (= update batches by upserting on unique_key, rather than full batch replacement), then we would want to join (using) based on unique_key match, like so:

delete from analytics_dev.dbt_aescay.my_model DBT_INTERNAL_TARGET
    using analytics_dev.dbt_aescay.my_model__dbt_tmp DBT_TMP_TARGET
    where DBT_INTERNAL_TARGET.event_id = DBT_TMP_TARGET.event_id
    and (
    DBT_INTERNAL_TARGET.event_at >= TIMESTAMP '2024-10-14 00:00:00+00:00'
    and DBT_INTERNAL_TARGET.event_at < TIMESTAMP '2024-10-15 00:00:00+00:00'
    
    );

But this shouldn't be the default assumption.

Expected Behavior

We should delete this line.

Steps To Reproduce

See here.

Relevant log output

No response

Environment

- OS:
- Python:
- dbt:

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working microbatch
Projects
None yet
Development

No branches or pull requests

1 participant