Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include a unique identifier as part of the temporary relation name #2881

Closed
samwedge opened this issue Nov 10, 2020 · 6 comments
Closed

Include a unique identifier as part of the temporary relation name #2881

samwedge opened this issue Nov 10, 2020 · 6 comments
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@samwedge
Copy link

samwedge commented Nov 10, 2020

Describe the feature

We schedule our DBT models using Airflow. Each day, our models run with a run date that allows us to populate all data for that specific day. We also trigger runs with "old" run dates to allow us to run DBT with historic dates to backfill the table with a full history, or sometimes run it for ad-hod days to fix a data issue.

When DBT runs, it creates a temporary table name which comprises the model name with the suffix __dbt_tmp. This means if we were to run DBT for two different dates simultaneously, they would interfere with each other with one run deleting another run's temp table or loading the wrong data into the database. Including a date or, more generically, some sort of UUID as part of the temporary name would prevent this issue from occurring.

Describe alternatives you've considered

We currently work around this by being very careful when running jobs. We need to ensure that we don't run the same model twice simultaneously.
We have also in some instances rewritten sql to load all data between a specified date and the current date, which then precludes the need for "backfilling". But this only works for smaller datasets and also means the DBT run isn't repeatable (the data inserted depends on the data already present in the table).

Additional context

We use BigQuery. I imagine this is an issue for all databases.

Who will this benefit?

This will have a speed improvement for anyone running DBT on a day-by-day basis as they will be able to run multiple dates simultaneously. It will help reduce the potential for conflicts to occur.

Are you interested in contributing this feature?

Absolutely happy to create a PR for this. I may need some assistance to ensure adding this feature does not cause/exacerbate issues such as the one highlighted here: #2869

@samwedge samwedge added enhancement New feature or request triage labels Nov 10, 2020
@jtcohen6 jtcohen6 removed the triage label Nov 10, 2020
@jtcohen6
Copy link
Contributor

Thanks for the detailed writeup @samwedge. This varies quite a bit depending on the database you're using, since it sits at the intersection of transactions and temporary tables.

When DBT runs, it creates a temporary table name which comprises the model name with the suffix __dbt_tmp.

To my knowledge, the only times dbt creates tables on BigQuery with the suffix __dbt_tmp are:

  • incremental models with dynamic insert_overwrite strategy (here)
  • snapshots (here)

Those could be "real" temp tables (a relatively new feature on BigQuery), created and used inside of a script. At the moment, they're still artificial temp tables (12-hour time-to-live) for historical reasons.

In the standard table and view materializations, dbt-bigquery uses create or replace, skipping the creation of a tmp object entirely.

By comparison, on Postgres:

  • We need to truncate identifiers to accommodate suffixes because of stringent length requirements. This isn't an issue on BQ.
  • We can create and use temp tables inside of atomic transactions, and they automatically disappear soon after, avoiding any risk of conflict. Transactions aren't available on BQ; the closest thing we have is scripting.

Recommendation

By design, dbt really is not intended to have multiple concurrent runs that are mutating the same database objects—especially on databases without transactions. You could be running a query to replace an object in one run, meanwhile it's an active dependency for another run's query. Instead, dbt supports concurrency via threading and environments.

We currently work around this by being very careful when running jobs. We need to ensure that we don't run the same model twice simultaneously.

I get that this is a constraint, but I still think it's a useful one. I highly recommend that, as much as you can, you orchestrate your backfill runs in Airflow such that they do not overlap with daily runs in the same production environment.

Resolution

Rather than adding a unique identifier to the temp suffix, I think the real resolution here is to switch the incremental and snapshot materializations to use "real" temp tables, run inside of scripts. (The incremental materialization already uses scripting; this would require some reworking or BQ-specific reimplementation of the snapshot materialization.) To my mind, that would solve any possibility of conflict on BigQuery entirely.

Let me know if that's something you'd be interested in taking on!

@samwedge
Copy link
Author

Hi @jtcohen6, thanks for the recommended resolution and clarification on when temp tables are used. Using "real" temp tables seems like the best solution. I'm definitely interested in contributing. I'll get everything building locally and take a look.

@danielerapati
Copy link

Hi,
we definitely run into a related problem in Redshift.
We have a setup similar to @samwedge with different airflow jobs running dbt run with different configurations but sharing a few models and running at different times.
Rarely (annoyingly this is often when jobs have failed for a different reason and need to be manually restarted from Airflow) two jobs would clash on the same model and hit the same __dbt_backup table/view (I think transactions prevent the same happening on the __dbt_tmp object).

A solution is a macro that we can override for choosing how to name the temporary and backup objects. Something similar to what we can already do for schemas or aliases.
So the view materialization here and the table materialization here (and other materializations too) would call that macro instead of hard-coding the suffix.
I noticed there is something similar called make_temp_relation but my impression is only the Snowflake plugin uses it in a slightly different setting.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Mar 8, 2022
@meisam-napster
Copy link

Hi, running into the same issue now and wondering if there are any updates / solutions to this scenario? Will DBT support "real" temp tables for BigQuery as opposed to creating tables with "__tmp" at the end or is there another solution?

@Dikootje
Copy link

Hi, running into the same issue now and wondering if there are any updates / solutions to this scenario? Will DBT support "real" temp tables for BigQuery as opposed to creating tables with "__tmp" at the end or is there another solution?

FYI: On snowflake I've ran into a similar issue. We now work around this now by overriding the macro's make_intermediate_relation and make_temp_relation and using the invocation_id as a suffix instead. (Without the dashes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

5 participants