Include a unique identifier as part of the temporary relation name #2881

samwedge · 2020-11-10T17:10:35Z

Describe the feature

We schedule our DBT models using Airflow. Each day, our models run with a run date that allows us to populate all data for that specific day. We also trigger runs with "old" run dates to allow us to run DBT with historic dates to backfill the table with a full history, or sometimes run it for ad-hod days to fix a data issue.

When DBT runs, it creates a temporary table name which comprises the model name with the suffix __dbt_tmp. This means if we were to run DBT for two different dates simultaneously, they would interfere with each other with one run deleting another run's temp table or loading the wrong data into the database. Including a date or, more generically, some sort of UUID as part of the temporary name would prevent this issue from occurring.

Describe alternatives you've considered

We currently work around this by being very careful when running jobs. We need to ensure that we don't run the same model twice simultaneously.
We have also in some instances rewritten sql to load all data between a specified date and the current date, which then precludes the need for "backfilling". But this only works for smaller datasets and also means the DBT run isn't repeatable (the data inserted depends on the data already present in the table).

Additional context

We use BigQuery. I imagine this is an issue for all databases.

Who will this benefit?

This will have a speed improvement for anyone running DBT on a day-by-day basis as they will be able to run multiple dates simultaneously. It will help reduce the potential for conflicts to occur.

Are you interested in contributing this feature?

Absolutely happy to create a PR for this. I may need some assistance to ensure adding this feature does not cause/exacerbate issues such as the one highlighted here: #2869

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-11-10T17:45:44Z

Thanks for the detailed writeup @samwedge. This varies quite a bit depending on the database you're using, since it sits at the intersection of transactions and temporary tables.

When DBT runs, it creates a temporary table name which comprises the model name with the suffix __dbt_tmp.

To my knowledge, the only times dbt creates tables on BigQuery with the suffix __dbt_tmp are:

incremental models with dynamic insert_overwrite strategy (here)
snapshots (here)

Those could be "real" temp tables (a relatively new feature on BigQuery), created and used inside of a script. At the moment, they're still artificial temp tables (12-hour time-to-live) for historical reasons.

In the standard table and view materializations, dbt-bigquery uses create or replace, skipping the creation of a tmp object entirely.

By comparison, on Postgres:

We need to truncate identifiers to accommodate suffixes because of stringent length requirements. This isn't an issue on BQ.
We can create and use temp tables inside of atomic transactions, and they automatically disappear soon after, avoiding any risk of conflict. Transactions aren't available on BQ; the closest thing we have is scripting.

Recommendation

By design, dbt really is not intended to have multiple concurrent runs that are mutating the same database objects—especially on databases without transactions. You could be running a query to replace an object in one run, meanwhile it's an active dependency for another run's query. Instead, dbt supports concurrency via threading and environments.

We currently work around this by being very careful when running jobs. We need to ensure that we don't run the same model twice simultaneously.

I get that this is a constraint, but I still think it's a useful one. I highly recommend that, as much as you can, you orchestrate your backfill runs in Airflow such that they do not overlap with daily runs in the same production environment.

Resolution

Rather than adding a unique identifier to the temp suffix, I think the real resolution here is to switch the incremental and snapshot materializations to use "real" temp tables, run inside of scripts. (The incremental materialization already uses scripting; this would require some reworking or BQ-specific reimplementation of the snapshot materialization.) To my mind, that would solve any possibility of conflict on BigQuery entirely.

Let me know if that's something you'd be interested in taking on!

samwedge · 2020-11-16T09:06:06Z

Hi @jtcohen6, thanks for the recommended resolution and clarification on when temp tables are used. Using "real" temp tables seems like the best solution. I'm definitely interested in contributing. I'll get everything building locally and take a look.

danielerapati · 2021-09-08T17:05:44Z

Hi,
we definitely run into a related problem in Redshift.
We have a setup similar to @samwedge with different airflow jobs running dbt run with different configurations but sharing a few models and running at different times.
Rarely (annoyingly this is often when jobs have failed for a different reason and need to be manually restarted from Airflow) two jobs would clash on the same model and hit the same __dbt_backup table/view (I think transactions prevent the same happening on the __dbt_tmp object).

A solution is a macro that we can override for choosing how to name the temporary and backup objects. Something similar to what we can already do for schemas or aliases.
So the view materialization here and the table materialization here (and other materializations too) would call that macro instead of hard-coding the suffix.
I noticed there is something similar called make_temp_relation but my impression is only the Snowflake plugin uses it in a slightly different setting.

github-actions · 2022-03-08T01:55:12Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

meisam-napster · 2022-11-15T13:43:01Z

Hi, running into the same issue now and wondering if there are any updates / solutions to this scenario? Will DBT support "real" temp tables for BigQuery as opposed to creating tables with "__tmp" at the end or is there another solution?

Dikootje · 2022-12-23T08:02:42Z

Hi, running into the same issue now and wondering if there are any updates / solutions to this scenario? Will DBT support "real" temp tables for BigQuery as opposed to creating tables with "__tmp" at the end or is there another solution?

FYI: On snowflake I've ran into a similar issue. We now work around this now by overriding the macro's make_intermediate_relation and make_temp_relation and using the invocation_id as a suffix instead. (Without the dashes)

samwedge added enhancement New feature or request triage labels Nov 10, 2020

jtcohen6 removed the triage label Nov 10, 2020

github-actions bot added the stale Issues that have gone stale label Mar 8, 2022

github-actions bot closed this as completed Mar 15, 2022

meisam-napster mentioned this issue Nov 15, 2022

[CT-682] [Feature] Overriding temp table name #5291

Closed

1 task

bdeboe mentioned this issue Jan 21, 2024

Error on running on parallel caretdev/dbt-iris#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include a unique identifier as part of the temporary relation name #2881

Include a unique identifier as part of the temporary relation name #2881

samwedge commented Nov 10, 2020 •

edited

Loading

jtcohen6 commented Nov 10, 2020

samwedge commented Nov 16, 2020

danielerapati commented Sep 8, 2021

github-actions bot commented Mar 8, 2022

meisam-napster commented Nov 15, 2022

Dikootje commented Dec 23, 2022

Include a unique identifier as part of the temporary relation name #2881

Include a unique identifier as part of the temporary relation name #2881

Comments

samwedge commented Nov 10, 2020 • edited Loading

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

jtcohen6 commented Nov 10, 2020

Recommendation

Resolution

samwedge commented Nov 16, 2020

danielerapati commented Sep 8, 2021

github-actions bot commented Mar 8, 2022

meisam-napster commented Nov 15, 2022

Dikootje commented Dec 23, 2022

samwedge commented Nov 10, 2020 •

edited

Loading