Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-368] Race conditions in dbt seed. #112

Open
andrebaaij opened this issue Mar 16, 2022 · 9 comments
Open

[CT-368] Race conditions in dbt seed. #112

andrebaaij opened this issue Mar 16, 2022 · 9 comments
Labels
bug Something isn't working help_wanted Extra attention is needed

Comments

@andrebaaij
Copy link

Describe the bug

When running dbt seed in multiple jobs, a race condition can and does occur, the following happens:

  • Seed session 1: Truncate table
  • Seed session 2: Truncate table
  • Seed session 1: insert seed
  • Seed session 2 insert seed

As a result, duplicate records.
The core issue is that the truncate and insert happen in different transactions.

Steps To Reproduce

Run multiple dbt seed commands for the same table at the same time. A race condition is bound to occur.

Expected behavior

If truncate and insert were in one transaction, as they should be, this issue would and could not occur. The expected behaviour is no race conditions when seeding the same table from multiple dbt seed commands running at the same time.

Screenshots and log output

I have added the following log out of our snowflake instance: table names and session ids have been made anonymous:

SESSION_ID QUERY_TYPE QUERY_TEXT START_TIME END_TIME
2 COMMIT COMMIT 2022-03-15 23:06:01.609000 -07:00 2022-03-15 23:06:02.085000 -07:00
1 COMMIT COMMIT 2022-03-15 23:06:01.496000 -07:00 2022-03-15 23:06:01.792000 -07:00
2 INSERT insert into table_seed_a... 2022-03-15 23:06:00.939000 -07:00 2022-03-15 23:06:01.533000 -07:00
2 BEGIN_TRANSACTION BEGIN 2022-03-15 23:06:00.783000 -07:00 2022-03-15 23:06:00.856000 -07:00
1 INSERT insert into table_seed_a... 2022-03-15 23:06:00.435000 -07:00 2022-03-15 23:06:01.414000 -07:00
2 COMMIT commit; 2022-03-15 23:06:00.289000 -07:00 2022-03-15 23:06:00.688000 -07:00
1 BEGIN_TRANSACTION BEGIN 2022-03-15 23:06:00.243000 -07:00 2022-03-15 23:06:00.348000 -07:00
2 TRUNCATE_TABLE truncate table_seed_a... 2022-03-15 23:06:00.034000 -07:00 2022-03-15 23:06:00.213000 -07:00
2 BEGIN_TRANSACTION begin; 2022-03-15 23:05:59.888000 -07:00 2022-03-15 23:05:59.956000 -07:00
1 COMMIT commit; 2022-03-15 23:05:59.698000 -07:00 2022-03-15 23:06:00.125000 -07:00
1 TRUNCATE_TABLE truncate table_seed_a... 2022-03-15 23:05:59.389000 -07:00 2022-03-15 23:05:59.618000 -07:00
1 BEGIN_TRANSACTION begin; 2022-03-15 23:05:59.251000 -07:00 2022-03-15 23:05:59.306000 -07:00

System information

The output of dbt --version:
dbt 1.0.3

The operating system you're using:
dbt cloud

@andrebaaij andrebaaij added bug Something isn't working triage labels Mar 16, 2022
@github-actions github-actions bot changed the title Race conditions in dbt seed. [CT-368] Race conditions in dbt seed. Mar 16, 2022
@nathaniel-may
Copy link
Contributor

Hi, @andrebaaij thanks for your detailed report and the steps to reproduce the bug. I can confirm that this is indeed possible today and is not the desired behavior. I believe your proposed solution of including truncate and insert in the same Snowflake transaction would solve the problem. Snowflake transactions only have the isolation level of READ COMMITTED (docs) but that is enough to resolve this particular issue.

@nathaniel-may
Copy link
Contributor

nathaniel-may commented Apr 19, 2022

Adding the help_wanted label in case anyone has the bandwidth to work on this before we can get to it.

Here are the two lines that we ultimately want in the same transaction.

There are a couple of ways to go about this. We could implement this entirely within the snowflake adapter by duplicating some of this macro code in the snowflake adapter (acceptable), or we could make a change to the macro in dbt-core (preferred).

@nathaniel-may nathaniel-may added the help_wanted Extra attention is needed label Apr 19, 2022
@adamantike
Copy link

Is this related in any way to the change introduced in dbt-labs/dbt-core#3510? Considering the comment at dbt-labs/dbt-core#3480 (comment), what would be the recommended behavior to avoid this kind of issues?

For extra context, the issue explained in #135, and merged with this one, includes a simpler scenario to test and validate.

(cc @jtcohen6)

@jtcohen6
Copy link
Contributor

@adamantike I think you're right. dbt-labs/dbt-core#3510 disabled transactions on Snowflake by default, since they were often unnecessary, ineffective, and expensive. This is a case where we still want one, and we should do that by wrapping the joined-together DML statements in explicit begin; + commit;. That's the purpose of snowflake_dml_explicit_transaction():

{% macro snowflake_dml_explicit_transaction(dml) %}
{#
Use this macro to wrap all INSERT, MERGE, UPDATE, DELETE, and TRUNCATE
statements before passing them into run_query(), or calling in the 'main' statement
of a materialization
#}
{% set dml_transaction -%}
begin;
{{ dml }};
commit;
{%- endset %}
{% do return(dml_transaction) %}
{% endmacro %}

We just need reset_csv_table and load_csv_rows to happen in the same transaction. This might be even easier than I thought, since they are being called right within the same statement after all.

To @nathaniel-may's point, the quickest solution here would just be to copy-paste the default seed materialization into your project, and add two lines here:

  {% call noop_statement('main', code ~ ' ' ~ rows_affected, code, rows_affected) %}
    begin;
    {{ create_table_sql }};
    -- dbt seed --
    {{ sql }};
    commit;
  {% endcall %}

The preferable solution, and one we'd be willing to merge, would avoid duplicating unnecessary materialization logic from dbt-core. I think that could look like:

  1. In dbt-core's global default version, wrap the seed steps (truncate + insert) in a new macro, get_csv_sql()
  2. In dbt-snowflake, wrap default__get_csv_sql() inside snowflake_dml_explicit_transaction(), so that both steps take place inside one explicit transaction

In dbt-core, seed/helpers:

{% macro get_csv_sql(create_or_truncate_sql, insert_sql) %}
    {{ adapter.dispatch('get_csv_sql', 'dbt')(create_or_truncate_sql, insert_sql) }}
{% endmacro %}

{% macro default__get_csv_sql(create_or_truncate_sql, insert_sql) %}
    {{ create_or_truncate_sql }};
    -- dbt seed --
    {{ insert_sql }}
{% endmacro %}

Within the default seed materialization:

  {% call noop_statement('main', code ~ ' ' ~ rows_affected, code, rows_affected) %}
    {{ get_csv_sql(create_table_sql, seed) }};
  {% endcall %}

Then within dbt-snowflake:

{% macro default__get_csv_sql(create_or_truncate_sql, insert_sql) %}
    {% set dml = snowflake_dml_explicit_transaction(default__get_csv_sql()) %}
    {{ return(dml) }}
{% endmacro %}

adamantike added a commit to adamantike/dbt-core that referenced this issue Apr 27, 2022
The implementation of this new macro `get_csv_sql` comes as a suggestion
by @jtcohen6 at
dbt-labs/dbt-snowflake#112 (comment).

The underlying reason is that the `dbt-snowflake` package needs to run
these queries inside a transaction, but with the current implementation,
the fix would require to duplicate the entire default seed
materialization logic in the package.
adamantike added a commit to adamantike/dbt-core that referenced this issue May 3, 2022
The implementation of this new macro `get_csv_sql` comes as a suggestion
by @jtcohen6 at
dbt-labs/dbt-snowflake#112 (comment).

The underlying reason is that the `dbt-snowflake` package needs to run
these queries inside a transaction, but with the current implementation,
the fix would require to duplicate the entire default seed
materialization logic in the package.
adamantike added a commit to adamantike/dbt-core that referenced this issue May 3, 2022
The implementation of this new macro `get_csv_sql` comes as a suggestion
by @jtcohen6 at
dbt-labs/dbt-snowflake#112 (comment).

The underlying reason is that the `dbt-snowflake` package needs to run
these queries inside a transaction, but with the current implementation,
the fix would require to duplicate the entire default seed
materialization logic in the package.
@jtcohen6
Copy link
Contributor

jtcohen6 commented May 3, 2022

@adamantike Update on my comment above: See where it says noop_statement instead of statement? That's a phony statement, just there to print the truncate + insert SQL. The actual truncate + insert are running here and here, that is, within the calls to the reset_csv_table and load_csv_rows macros themselves. Those macros do return SQL, but just for the purposes of printing it out.

So just wrapping the SQL returned by those macros won't be enough. We need to actually wrap the initial call to each of those macros in a shared transaction (begin; + commit;).

New idea: We're already wrapping and returning the entire default seed materialization here:

{% materialization seed, adapter='snowflake' %}
{% set original_query_tag = set_query_tag() %}
{% set relations = materialization_seed_default() %}
{% do unset_query_tag(original_query_tag) %}
{{ return(relations) }}
{% endmaterialization %}

What if we tried making that:

 {% materialization seed, adapter='snowflake' %} 
     {% set original_query_tag = set_query_tag() %} 
     
     {{ run_query('begin') }}
  
     {% set relations = materialization_seed_default() %} 
  
     {% do unset_query_tag(original_query_tag) %}
     
     {{ run_query('commit') }}
  
     {{ return(relations) }} 
 {% endmaterialization %} 

In the event that the seed doesn't yet exist, or is --full-refresh, the transaction will auto-commit on the basis of having run a DDL statement. But in the general case, it has no reason to—dbt will have effectively run these queries in order:

  • begin
  • truncate ...
  • insert ...
  • commit

I'm just not sure that the connection / session (and therefore the transaction) will properly persist across those queries, since they're being run separately, and not as one semicolon-delimited chunk.

Update: I just tried locally with two concurrent dbt seed invocations, and it doesn't seem to have prevented duplicate insertions. We may need to override more of these methods, or find a way to turn transactions conditionally on/off. It's a heavier lift than I initially thought.

@jtcohen6
Copy link
Contributor

Chatted with @matt-winkler a few weeks ago about potentially using INSERT OVERWRITE instead, which should have the effect of atomically combining truncate + insert.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@Fleid
Copy link
Contributor

Fleid commented Apr 17, 2023

We should explore if INSERT [ OVERWRITE ] INTO <target_table> could fix this!
https://docs.snowflake.com/en/sql-reference/sql/insert#optional-parameters

@ismailsimsek
Copy link

@jtcohen6 have you found a way to do it atomically? is this issue solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help_wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants