[CT-368] Race conditions in dbt seed. #112

andrebaaij · 2022-03-16T14:45:25Z

Describe the bug

When running dbt seed in multiple jobs, a race condition can and does occur, the following happens:

Seed session 1: Truncate table
Seed session 2: Truncate table
Seed session 1: insert seed
Seed session 2 insert seed

As a result, duplicate records.
The core issue is that the truncate and insert happen in different transactions.

Steps To Reproduce

Run multiple dbt seed commands for the same table at the same time. A race condition is bound to occur.

Expected behavior

If truncate and insert were in one transaction, as they should be, this issue would and could not occur. The expected behaviour is no race conditions when seeding the same table from multiple dbt seed commands running at the same time.

Screenshots and log output

I have added the following log out of our snowflake instance: table names and session ids have been made anonymous:

SESSION_ID	QUERY_TYPE	QUERY_TEXT	START_TIME	END_TIME
2	COMMIT	COMMIT	2022-03-15 23:06:01.609000 -07:00	2022-03-15 23:06:02.085000 -07:00
1	COMMIT	COMMIT	2022-03-15 23:06:01.496000 -07:00	2022-03-15 23:06:01.792000 -07:00
2	INSERT	insert into table_seed_a...	2022-03-15 23:06:00.939000 -07:00	2022-03-15 23:06:01.533000 -07:00
2	BEGIN_TRANSACTION	BEGIN	2022-03-15 23:06:00.783000 -07:00	2022-03-15 23:06:00.856000 -07:00
1	INSERT	insert into table_seed_a...	2022-03-15 23:06:00.435000 -07:00	2022-03-15 23:06:01.414000 -07:00
2	COMMIT	commit;	2022-03-15 23:06:00.289000 -07:00	2022-03-15 23:06:00.688000 -07:00
1	BEGIN_TRANSACTION	BEGIN	2022-03-15 23:06:00.243000 -07:00	2022-03-15 23:06:00.348000 -07:00
2	TRUNCATE_TABLE	truncate table_seed_a...	2022-03-15 23:06:00.034000 -07:00	2022-03-15 23:06:00.213000 -07:00
2	BEGIN_TRANSACTION	begin;	2022-03-15 23:05:59.888000 -07:00	2022-03-15 23:05:59.956000 -07:00
1	COMMIT	commit;	2022-03-15 23:05:59.698000 -07:00	2022-03-15 23:06:00.125000 -07:00
1	TRUNCATE_TABLE	truncate table_seed_a...	2022-03-15 23:05:59.389000 -07:00	2022-03-15 23:05:59.618000 -07:00
1	BEGIN_TRANSACTION	begin;	2022-03-15 23:05:59.251000 -07:00	2022-03-15 23:05:59.306000 -07:00

System information

The output of dbt --version:
dbt 1.0.3

The operating system you're using:
dbt cloud

The text was updated successfully, but these errors were encountered:

nathaniel-may · 2022-04-19T15:56:01Z

Hi, @andrebaaij thanks for your detailed report and the steps to reproduce the bug. I can confirm that this is indeed possible today and is not the desired behavior. I believe your proposed solution of including truncate and insert in the same Snowflake transaction would solve the problem. Snowflake transactions only have the isolation level of READ COMMITTED (docs) but that is enough to resolve this particular issue.

nathaniel-may · 2022-04-19T16:19:16Z

Adding the help_wanted label in case anyone has the bandwidth to work on this before we can get to it.

Here are the two lines that we ultimately want in the same transaction.

There are a couple of ways to go about this. We could implement this entirely within the snowflake adapter by duplicating some of this macro code in the snowflake adapter (acceptable), or we could make a change to the macro in dbt-core (preferred).

adamantike · 2022-04-19T21:03:37Z

Is this related in any way to the change introduced in dbt-labs/dbt-core#3510? Considering the comment at dbt-labs/dbt-core#3480 (comment), what would be the recommended behavior to avoid this kind of issues?

For extra context, the issue explained in #135, and merged with this one, includes a simpler scenario to test and validate.

(cc @jtcohen6)

jtcohen6 · 2022-04-20T09:42:19Z

@adamantike I think you're right. dbt-labs/dbt-core#3510 disabled transactions on Snowflake by default, since they were often unnecessary, ineffective, and expensive. This is a case where we still want one, and we should do that by wrapping the joined-together DML statements in explicit begin; + commit;. That's the purpose of snowflake_dml_explicit_transaction():

dbt-snowflake/dbt/include/snowflake/macros/adapters.sql

Lines 251 to 265 in 24482ea

 {% macro snowflake_dml_explicit_transaction(dml) %} 

 {# 

 Use this macro to wrap all INSERT, MERGE, UPDATE, DELETE, and TRUNCATE 

 statements before passing them into run_query(), or calling in the 'main' statement 

 of a materialization 

 #} 

 {% set dml_transaction -%} 

 begin; 

 {{ dml }}; 

 commit; 

 {%- endset %} 

 {% do return(dml_transaction) %} 

 {% endmacro %}

We just need reset_csv_table and load_csv_rows to happen in the same transaction. This might be even easier than I thought, since they are being called right within the same statement after all.

To @nathaniel-may's point, the quickest solution here would just be to copy-paste the default seed materialization into your project, and add two lines here:

  {% call noop_statement('main', code ~ ' ' ~ rows_affected, code, rows_affected) %}
    begin;
    {{ create_table_sql }};
    -- dbt seed --
    {{ sql }};
    commit;
  {% endcall %}

The preferable solution, and one we'd be willing to merge, would avoid duplicating unnecessary materialization logic from dbt-core. I think that could look like:

In dbt-core's global default version, wrap the seed steps (truncate + insert) in a new macro, get_csv_sql()
In dbt-snowflake, wrap default__get_csv_sql() inside snowflake_dml_explicit_transaction(), so that both steps take place inside one explicit transaction

In dbt-core, seed/helpers:

{% macro get_csv_sql(create_or_truncate_sql, insert_sql) %}
    {{ adapter.dispatch('get_csv_sql', 'dbt')(create_or_truncate_sql, insert_sql) }}
{% endmacro %}

{% macro default__get_csv_sql(create_or_truncate_sql, insert_sql) %}
    {{ create_or_truncate_sql }};
    -- dbt seed --
    {{ insert_sql }}
{% endmacro %}

Within the default seed materialization:

  {% call noop_statement('main', code ~ ' ' ~ rows_affected, code, rows_affected) %}
    {{ get_csv_sql(create_table_sql, seed) }};
  {% endcall %}

Then within dbt-snowflake:

{% macro default__get_csv_sql(create_or_truncate_sql, insert_sql) %}
    {% set dml = snowflake_dml_explicit_transaction(default__get_csv_sql()) %}
    {{ return(dml) }}
{% endmacro %}

@jtcohen6

The implementation of this new macro `get_csv_sql` comes as a suggestion by @jtcohen6 at dbt-labs/dbt-snowflake#112 (comment). The underlying reason is that the `dbt-snowflake` package needs to run these queries inside a transaction, but with the current implementation, the fix would require to duplicate the entire default seed materialization logic in the package.

@jtcohen6

The implementation of this new macro `get_csv_sql` comes as a suggestion by @jtcohen6 at dbt-labs/dbt-snowflake#112 (comment). The underlying reason is that the `dbt-snowflake` package needs to run these queries inside a transaction, but with the current implementation, the fix would require to duplicate the entire default seed materialization logic in the package.

@jtcohen6

The implementation of this new macro `get_csv_sql` comes as a suggestion by @jtcohen6 at dbt-labs/dbt-snowflake#112 (comment). The underlying reason is that the `dbt-snowflake` package needs to run these queries inside a transaction, but with the current implementation, the fix would require to duplicate the entire default seed materialization logic in the package.

jtcohen6 · 2022-05-03T15:25:39Z

@adamantike Update on my comment above: See where it says noop_statement instead of statement? That's a phony statement, just there to print the truncate + insert SQL. The actual truncate + insert are running here and here, that is, within the calls to the reset_csv_table and load_csv_rows macros themselves. Those macros do return SQL, but just for the purposes of printing it out.

So just wrapping the SQL returned by those macros won't be enough. We need to actually wrap the initial call to each of those macros in a shared transaction (begin; + commit;).

New idea: We're already wrapping and returning the entire default seed materialization here:

dbt-snowflake/dbt/include/snowflake/macros/materializations/seed.sql

Lines 39 to 47 in ae31719

 {% materialization seed, adapter='snowflake' %} 

 {% set original_query_tag = set_query_tag() %} 

 {% set relations = materialization_seed_default() %} 

 {% do unset_query_tag(original_query_tag) %} 

 {{ return(relations) }} 

 {% endmaterialization %}

What if we tried making that:

 {% materialization seed, adapter='snowflake' %} 
     {% set original_query_tag = set_query_tag() %} 
     
     {{ run_query('begin') }}
  
     {% set relations = materialization_seed_default() %} 
  
     {% do unset_query_tag(original_query_tag) %}
     
     {{ run_query('commit') }}
  
     {{ return(relations) }} 
 {% endmaterialization %}

In the event that the seed doesn't yet exist, or is --full-refresh, the transaction will auto-commit on the basis of having run a DDL statement. But in the general case, it has no reason to—dbt will have effectively run these queries in order:

begin
truncate ...
insert ...
commit

I'm just not sure that the connection / session (and therefore the transaction) will properly persist across those queries, since they're being run separately, and not as one semicolon-delimited chunk.

Update: I just tried locally with two concurrent dbt seed invocations, and it doesn't seem to have prevented duplicate insertions. We may need to override more of these methods, or find a way to turn transactions conditionally on/off. It's a heavier lift than I initially thought.

jtcohen6 · 2022-07-13T11:04:38Z

Chatted with @matt-winkler a few weeks ago about potentially using INSERT OVERWRITE instead, which should have the effect of atomically combining truncate + insert.

github-actions · 2023-01-10T02:00:43Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

Fleid · 2023-04-17T13:49:32Z

We should explore if INSERT [ OVERWRITE ] INTO <target_table> could fix this!
https://docs.snowflake.com/en/sql-reference/sql/insert#optional-parameters

ismailsimsek · 2024-04-08T09:34:19Z

@jtcohen6 have you found a way to do it atomically? is this issue solved?

andrebaaij added bug Something isn't working triage labels Mar 16, 2022

github-actions bot changed the title ~~Race conditions in dbt seed.~~ [CT-368] Race conditions in dbt seed. Mar 16, 2022

nathaniel-may mentioned this issue Apr 19, 2022

[CT-480] seed: Non-atomic update of seed tables #135

Closed

nathaniel-may removed the triage label Apr 19, 2022

nathaniel-may added the help_wanted Extra attention is needed label Apr 19, 2022

This was referenced May 3, 2022

[CT-596] [Feature] Allow customization of seed materialization logic for table creation/insertion dbt-labs/dbt-core#5206

Closed

seed: Add new macro get_csv_sql dbt-labs/dbt-core#5207

Merged

jtcohen6 mentioned this issue May 4, 2022

Call begin/commit outside of main when running DML on incremental models #155

Closed

4 tasks

jtcohen6 mentioned this issue Jul 13, 2022

Move v1.2 adapter upgrade guide to separate page dbt-labs/docs.getdbt.com#1696

Closed

github-actions bot added the Stale label Jan 10, 2023

github-actions bot closed this as completed Jan 18, 2023

jtcohen6 mentioned this issue Apr 12, 2023

[ADAP-432] [Feature] Snowflake Adapter supports INSERT OVERWRITE #550

Open

3 tasks

Fleid removed the Stale label Apr 17, 2023

Fleid reopened this Apr 17, 2023

moxordo mentioned this issue Apr 19, 2023

[CT-1876] [CT-1875] [Feature] Transaction support for hooks in BigQuery dbt-labs/dbt-bigquery#474

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-368] Race conditions in dbt seed. #112

[CT-368] Race conditions in dbt seed. #112

andrebaaij commented Mar 16, 2022

nathaniel-may commented Apr 19, 2022

nathaniel-may commented Apr 19, 2022 •

edited

Loading

adamantike commented Apr 19, 2022

jtcohen6 commented Apr 20, 2022

jtcohen6 commented May 3, 2022 •

edited

Loading

jtcohen6 commented Jul 13, 2022

github-actions bot commented Jan 10, 2023

Fleid commented Apr 17, 2023

ismailsimsek commented Apr 8, 2024

[CT-368] Race conditions in dbt seed. #112

[CT-368] Race conditions in dbt seed. #112

Comments

andrebaaij commented Mar 16, 2022

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

nathaniel-may commented Apr 19, 2022

nathaniel-may commented Apr 19, 2022 • edited Loading

adamantike commented Apr 19, 2022

jtcohen6 commented Apr 20, 2022

jtcohen6 commented May 3, 2022 • edited Loading

jtcohen6 commented Jul 13, 2022

github-actions bot commented Jan 10, 2023

Fleid commented Apr 17, 2023

ismailsimsek commented Apr 8, 2024

nathaniel-may commented Apr 19, 2022 •

edited

Loading

jtcohen6 commented May 3, 2022 •

edited

Loading