-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-368] Race conditions in dbt seed. #112
Comments
Hi, @andrebaaij thanks for your detailed report and the steps to reproduce the bug. I can confirm that this is indeed possible today and is not the desired behavior. I believe your proposed solution of including |
Adding the Here are the two lines that we ultimately want in the same transaction. There are a couple of ways to go about this. We could implement this entirely within the snowflake adapter by duplicating some of this macro code in the snowflake adapter (acceptable), or we could make a change to the macro in dbt-core (preferred). |
Is this related in any way to the change introduced in dbt-labs/dbt-core#3510? Considering the comment at dbt-labs/dbt-core#3480 (comment), what would be the recommended behavior to avoid this kind of issues? For extra context, the issue explained in #135, and merged with this one, includes a simpler scenario to test and validate. (cc @jtcohen6) |
@adamantike I think you're right. dbt-labs/dbt-core#3510 disabled transactions on Snowflake by default, since they were often unnecessary, ineffective, and expensive. This is a case where we still want one, and we should do that by wrapping the joined-together DML statements in explicit dbt-snowflake/dbt/include/snowflake/macros/adapters.sql Lines 251 to 265 in 24482ea
We just need To @nathaniel-may's point, the quickest solution here would just be to copy-paste the default seed materialization into your project, and add two lines here: {% call noop_statement('main', code ~ ' ' ~ rows_affected, code, rows_affected) %}
begin;
{{ create_table_sql }};
-- dbt seed --
{{ sql }};
commit;
{% endcall %} The preferable solution, and one we'd be willing to merge, would avoid duplicating unnecessary materialization logic from
In {% macro get_csv_sql(create_or_truncate_sql, insert_sql) %}
{{ adapter.dispatch('get_csv_sql', 'dbt')(create_or_truncate_sql, insert_sql) }}
{% endmacro %}
{% macro default__get_csv_sql(create_or_truncate_sql, insert_sql) %}
{{ create_or_truncate_sql }};
-- dbt seed --
{{ insert_sql }}
{% endmacro %} Within the default {% call noop_statement('main', code ~ ' ' ~ rows_affected, code, rows_affected) %}
{{ get_csv_sql(create_table_sql, seed) }};
{% endcall %} Then within {% macro default__get_csv_sql(create_or_truncate_sql, insert_sql) %}
{% set dml = snowflake_dml_explicit_transaction(default__get_csv_sql()) %}
{{ return(dml) }}
{% endmacro %} |
The implementation of this new macro `get_csv_sql` comes as a suggestion by @jtcohen6 at dbt-labs/dbt-snowflake#112 (comment). The underlying reason is that the `dbt-snowflake` package needs to run these queries inside a transaction, but with the current implementation, the fix would require to duplicate the entire default seed materialization logic in the package.
The implementation of this new macro `get_csv_sql` comes as a suggestion by @jtcohen6 at dbt-labs/dbt-snowflake#112 (comment). The underlying reason is that the `dbt-snowflake` package needs to run these queries inside a transaction, but with the current implementation, the fix would require to duplicate the entire default seed materialization logic in the package.
The implementation of this new macro `get_csv_sql` comes as a suggestion by @jtcohen6 at dbt-labs/dbt-snowflake#112 (comment). The underlying reason is that the `dbt-snowflake` package needs to run these queries inside a transaction, but with the current implementation, the fix would require to duplicate the entire default seed materialization logic in the package.
@adamantike Update on my comment above: See where it says So just wrapping the SQL returned by those macros won't be enough. We need to actually wrap the initial call to each of those macros in a shared transaction ( New idea: We're already wrapping and returning the entire default dbt-snowflake/dbt/include/snowflake/macros/materializations/seed.sql Lines 39 to 47 in ae31719
What if we tried making that: {% materialization seed, adapter='snowflake' %}
{% set original_query_tag = set_query_tag() %}
{{ run_query('begin') }}
{% set relations = materialization_seed_default() %}
{% do unset_query_tag(original_query_tag) %}
{{ run_query('commit') }}
{{ return(relations) }}
{% endmaterialization %} In the event that the seed doesn't yet exist, or is
I'm just not sure that the connection / session (and therefore the transaction) will properly persist across those queries, since they're being run separately, and not as one semicolon-delimited chunk. Update: I just tried locally with two concurrent |
Chatted with @matt-winkler a few weeks ago about potentially using |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
We should explore if |
@jtcohen6 have you found a way to do it atomically? is this issue solved? |
Describe the bug
When running
dbt seed
in multiple jobs, a race condition can and does occur, the following happens:As a result, duplicate records.
The core issue is that the
truncate
andinsert
happen in different transactions.Steps To Reproduce
Run multiple
dbt seed
commands for the same table at the same time. A race condition is bound to occur.Expected behavior
If truncate and insert were in one transaction, as they should be, this issue would and could not occur. The expected behaviour is no race conditions when seeding the same table from multiple dbt seed commands running at the same time.
Screenshots and log output
I have added the following log out of our snowflake instance: table names and session ids have been made anonymous:
System information
The output of
dbt --version
:dbt 1.0.3
The operating system you're using:
dbt cloud
The text was updated successfully, but these errors were encountered: