-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include a unique identifier as part of the temporary relation name #2881
Comments
Thanks for the detailed writeup @samwedge. This varies quite a bit depending on the database you're using, since it sits at the intersection of transactions and temporary tables.
To my knowledge, the only times dbt creates tables on BigQuery with the suffix Those could be "real" temp tables (a relatively new feature on BigQuery), created and used inside of a script. At the moment, they're still artificial temp tables (12-hour time-to-live) for historical reasons. In the standard table and view materializations, dbt-bigquery uses By comparison, on Postgres:
RecommendationBy design, dbt really is not intended to have multiple concurrent runs that are mutating the same database objects—especially on databases without transactions. You could be running a query to replace an object in one run, meanwhile it's an active dependency for another run's query. Instead, dbt supports concurrency via threading and environments.
I get that this is a constraint, but I still think it's a useful one. I highly recommend that, as much as you can, you orchestrate your backfill runs in Airflow such that they do not overlap with daily runs in the same production environment. ResolutionRather than adding a unique identifier to the temp suffix, I think the real resolution here is to switch the incremental and snapshot materializations to use "real" temp tables, run inside of scripts. (The incremental materialization already uses scripting; this would require some reworking or BQ-specific reimplementation of the snapshot materialization.) To my mind, that would solve any possibility of conflict on BigQuery entirely. Let me know if that's something you'd be interested in taking on! |
Hi @jtcohen6, thanks for the recommended resolution and clarification on when temp tables are used. Using "real" temp tables seems like the best solution. I'm definitely interested in contributing. I'll get everything building locally and take a look. |
Hi, A solution is a macro that we can override for choosing how to name the temporary and backup objects. Something similar to what we can already do for schemas or aliases. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Hi, running into the same issue now and wondering if there are any updates / solutions to this scenario? Will DBT support "real" temp tables for BigQuery as opposed to creating tables with "__tmp" at the end or is there another solution? |
FYI: On snowflake I've ran into a similar issue. We now work around this now by overriding the macro's |
Describe the feature
We schedule our DBT models using Airflow. Each day, our models run with a run date that allows us to populate all data for that specific day. We also trigger runs with "old" run dates to allow us to run DBT with historic dates to backfill the table with a full history, or sometimes run it for ad-hod days to fix a data issue.
When DBT runs, it creates a temporary table name which comprises the model name with the suffix
__dbt_tmp
. This means if we were to run DBT for two different dates simultaneously, they would interfere with each other with one run deleting another run's temp table or loading the wrong data into the database. Including a date or, more generically, some sort of UUID as part of the temporary name would prevent this issue from occurring.Describe alternatives you've considered
We currently work around this by being very careful when running jobs. We need to ensure that we don't run the same model twice simultaneously.
We have also in some instances rewritten sql to load all data between a specified date and the current date, which then precludes the need for "backfilling". But this only works for smaller datasets and also means the DBT run isn't repeatable (the data inserted depends on the data already present in the table).
Additional context
We use BigQuery. I imagine this is an issue for all databases.
Who will this benefit?
This will have a speed improvement for anyone running DBT on a day-by-day basis as they will be able to run multiple dates simultaneously. It will help reduce the potential for conflicts to occur.
Are you interested in contributing this feature?
Absolutely happy to create a PR for this. I may need some assistance to ensure adding this feature does not cause/exacerbate issues such as the one highlighted here: #2869
The text was updated successfully, but these errors were encountered: