Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake: no transactions, except for DML #3510

Merged
merged 6 commits into from
Jul 27, 2021

Conversation

jtcohen6
Copy link
Contributor

@jtcohen6 jtcohen6 commented Jun 29, 2021

resolves #2748 resolves #3480

  • Never run begin, commit, or rollback on Snowflake
  • Set autcommit=True in all connections
  • Wrap DML statements in explicit begin + commit, affecting the following macros: snowflake__get_merge_sql, snowflake__get_delete_insert_merge_sql, snowflake__snapshot_merge_sql
  • Consolidate run_hooks in the table and incremental materializations—there are no inside/outside-transactional hooks, since there are no more transactions.

Questions

  • ❓ For simplicity, I wrote a macro snowflake_dml_explicit_transaction that wraps supplied DML in begin; + commit;. Should this be an adapter method instead, written in python?
  • ❓ During incremental runs, there is technically a transaction, since we're wrapping DML inside begin + commit. Users can still run within-transaction "pre-hook" logic via sql_header. Do we need to preserve a way to run within-transaction "post-hook" logic? This isn't something I see come up nearly as often.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Jun 29, 2021
@jtcohen6 jtcohen6 temporarily deployed to Postgres June 29, 2021 20:47 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Redshift June 29, 2021 20:47 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Redshift June 29, 2021 20:47 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Bigquery June 29, 2021 20:47 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Bigquery June 29, 2021 20:47 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Snowflake June 29, 2021 20:47 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Snowflake June 29, 2021 20:47 Inactive
@jtcohen6 jtcohen6 marked this pull request as ready for review June 29, 2021 23:05
@jtcohen6
Copy link
Contributor Author

I wasn't expecting all the integration tests to pass on the first go. Neat!

@drewbanin Could I tag you in for a review here? I think you're one of the few folks who has the full history of dbt x Snowflake x transactional logic. To be clear, the changes in this PR are based entirely on the explicit recommendations from Snowflake's partner solutions architect, whom we met with today: #3480 (comment) (cc @amychen1776)

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 I took a quick scan through here - really shocked at how small the diff is! That's great.

Some of this is germane to this PR, some of it is not, but do want to get the thoughts in my head out into the world:

  • IIRC the primary use-case for specifying pre/post hooks outside of a transaction is on Redshift where operations like vacuums could not be specified inside of a transaction. I think it's a-ok if we drop that distinction on Snowflake
  • The fact that we can create or replace table|view on Snowflake simplifies this problem significantly, I think. If Snowflake didn't have this construct, then I would be pretty worried about non-atomicity when dbt replaces a table or view.
    • In a similar vein, I am wondering about what happens when we replace a table with a view, or vice-versa. Is this handled in an atomic way for an external querier? Eg. if a BI tool issues a query between when a view is dropped and a table is recreated.. will that query fail? If I'm understanding the Snowflake docs correctly, I think the answer is YES
  • I wonder if issuing fewer explicit transactions is going to have an impact on dbt project runtime. I guess the latency associated with issuing those queries is amortized on big projects, but I would not be shocked if there is a positive impact on performance overall
  • I am glad to hear that the integration tests are all passing! My instinct is that we're not testing for atomicity on Snowflake in the same way that we do/did test on Redshift where this was a real problem for us historically. Can we imagine stress-testing this branch with a dbt running + concurrent queries against the tables being replaced? I'm happy to help with that if you think it's interesting

Kind of scattershot, but curious what you think! I am happy to slap a LGTM here since this is ultimately the kind of thing we'll want to extensively test internally/in a beta, but would also like to discuss some of these things first :)


{% set dml_transaction %}
begin;
{{ dml }} {{ ";" if not (dml | trim).endswith(";") }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really clever... worried that it might be too clever! If we wanted to be extra safe, we could issue three distinct queries for begin, the query, then commit. Would that defeat the purpose of these changes?

If we know that the supplied dml is always going to come from dbt internally, then this can be ok, but my preference would actually be either to:

  1. Specify the semicolon in the caller OR
  2. Always inject the semicolon here

you buy it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the latter option, as a new argument for the statement macro: perhaps explicit_transaction=False. It's conceptually related to the existing argument auto_begin=True (which won't take any effect on Snowflake, given the changes in this PR).

Since a lot of this DML is not called directly, however, but rather passed into the materialization's main statement—which could be DML, could be DDL—this would require a bit more plumbing work.

So I'm happy with mandating that SQL passed into this macro shouldn't end in a semicolon, and have the macro always stick it on the end.


{% macro snowflake__get_delete_insert_merge_sql(target, source, unique_key, dest_columns) %}
{% set dml = default__get_delete_insert_merge_sql(target, source, unique_key, dest_columns) %}
{% do return(snowflake_dml_explicit_transaction(dml)) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any problem with specifying code with an explicit begin/commit in addition to enabling autocommit? Surprised to see that this works! I would have expected that we would need to disable autocommit for these queries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the things that's tricky about autocommit. You don't actually need to disable it, it just doesn't take effect if you supply explicit begin + commit. From docs:

Statements inside an explicit transaction are not affected by AUTOCOMMIT. For example, statements inside an explicit BEGIN TRANSACTION ... ROLLBACK are rolled back even if AUTOCOMMIT is TRUE.

@jtcohen6 jtcohen6 force-pushed the feature/snowflake-disable-transactions branch from d33e1b3 to aa6b79b Compare July 11, 2021 22:46
Copy link
Contributor Author

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drewbanin Thanks for the quick review!

Is this handled in an atomic way for an external querier? Eg. if a BI tool issues a query between when a view is dropped and a table is recreated.. will that query fail? If I'm understanding the Snowflake docs correctly, I think the answer is YES

Yes—and I think this is the status quo behavior. When we switched to using create or replace table|view on Snowflake, whenever the old relation is of the wrong type, we simply drop it before creating it as the correct type, at risk of downtime and inability to rollback.

This did remind me to update the create_or_replace_view() macro as well, since we'd hacked that to support both transactions (on Snowflake) and not-transactions (on BigQuery). Now it can be even simpler.

Can we imagine stress-testing this branch with a dbt running + concurrent queries against the tables being replaced? I'm happy to help with that if you think it's interesting

I did some very low-key stress testing by creating 10 models, and querying them all unioned together in the Snowflake console while I simultaneously executed dbt run. Suffice to say, it works just fine when dbt is replacing views with views, or tables with tables. There's a few seconds of downtime if we're switching between views and tables. I made sure there's debug logging to reflect that.

We could take the BigQuery approach, which is to require that the user specify --full-refresh in order to drop a table and replace it with a view. (We don't require it for view → table, since that's presumably an "upgrade.")

this is ultimately the kind of thing we'll want to extensively test internally/in a beta

100% agree. I'd like to get this into v0.21.0-b1, to go out end of this week or early next week, and get it into as many beta testers' hands as possible, as early as possible.


{% macro snowflake__get_delete_insert_merge_sql(target, source, unique_key, dest_columns) %}
{% set dml = default__get_delete_insert_merge_sql(target, source, unique_key, dest_columns) %}
{% do return(snowflake_dml_explicit_transaction(dml)) %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the things that's tricky about autocommit. You don't actually need to disable it, it just doesn't take effect if you supply explicit begin + commit. From docs:

Statements inside an explicit transaction are not affected by AUTOCOMMIT. For example, statements inside an explicit BEGIN TRANSACTION ... ROLLBACK are rolled back even if AUTOCOMMIT is TRUE.


{% set dml_transaction %}
begin;
{{ dml }} {{ ";" if not (dml | trim).endswith(";") }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the latter option, as a new argument for the statement macro: perhaps explicit_transaction=False. It's conceptually related to the existing argument auto_begin=True (which won't take any effect on Snowflake, given the changes in this PR).

Since a lot of this DML is not called directly, however, but rather passed into the materialization's main statement—which could be DML, could be DDL—this would require a bit more plumbing work.

So I'm happy with mandating that SQL passed into this macro shouldn't end in a semicolon, and have the macro always stick it on the end.

@jtcohen6 jtcohen6 temporarily deployed to Bigquery July 11, 2021 22:50 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Bigquery July 11, 2021 22:50 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Snowflake July 11, 2021 22:50 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Snowflake July 11, 2021 22:50 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Redshift July 11, 2021 22:50 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Redshift July 11, 2021 22:50 Inactive
@jtcohen6 jtcohen6 force-pushed the feature/snowflake-disable-transactions branch from aa6b79b to e5ce556 Compare July 21, 2021 21:00
@jtcohen6 jtcohen6 temporarily deployed to Bigquery July 21, 2021 21:00 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Bigquery July 21, 2021 21:00 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Snowflake July 21, 2021 21:00 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Snowflake July 21, 2021 21:00 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Redshift July 21, 2021 21:00 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Redshift July 21, 2021 21:00 Inactive
@jtcohen6 jtcohen6 force-pushed the feature/snowflake-disable-transactions branch from e5ce556 to 3fa9b97 Compare July 21, 2021 21:15
@jtcohen6 jtcohen6 temporarily deployed to Bigquery July 27, 2021 17:50 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Bigquery July 27, 2021 17:50 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Postgres July 27, 2021 18:04 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Bigquery July 27, 2021 18:04 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Bigquery July 27, 2021 18:04 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Redshift July 27, 2021 18:04 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Redshift July 27, 2021 18:04 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Postgres July 27, 2021 19:07 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Snowflake July 27, 2021 19:07 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Snowflake July 27, 2021 19:07 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Redshift July 27, 2021 19:07 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Redshift July 27, 2021 19:07 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Bigquery July 27, 2021 19:07 Inactive
@jtcohen6 jtcohen6 temporarily deployed to Bigquery July 27, 2021 19:07 Inactive
@jtcohen6 jtcohen6 requested a review from leahwicz July 27, 2021 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL select statements for source freshness and tests are using unnecessary DB transaction logic
3 participants