-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snowflake: no transactions, except for DML #3510
Conversation
I wasn't expecting all the integration tests to pass on the first go. Neat! @drewbanin Could I tag you in for a review here? I think you're one of the few folks who has the full history of dbt x Snowflake x transactional logic. To be clear, the changes in this PR are based entirely on the explicit recommendations from Snowflake's partner solutions architect, whom we met with today: #3480 (comment) (cc @amychen1776) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtcohen6 I took a quick scan through here - really shocked at how small the diff is! That's great.
Some of this is germane to this PR, some of it is not, but do want to get the thoughts in my head out into the world:
- IIRC the primary use-case for specifying pre/post hooks outside of a transaction is on Redshift where operations like
vacuums
could not be specified inside of a transaction. I think it's a-ok if we drop that distinction on Snowflake - The fact that we can
create or replace table|view
on Snowflake simplifies this problem significantly, I think. If Snowflake didn't have this construct, then I would be pretty worried about non-atomicity when dbt replaces a table or view.- In a similar vein, I am wondering about what happens when we replace a table with a view, or vice-versa. Is this handled in an atomic way for an external querier? Eg. if a BI tool issues a query between when a view is dropped and a table is recreated.. will that query fail? If I'm understanding the Snowflake docs correctly, I think the answer is YES
- I wonder if issuing fewer explicit transactions is going to have an impact on dbt project runtime. I guess the latency associated with issuing those queries is amortized on big projects, but I would not be shocked if there is a positive impact on performance overall
- I am glad to hear that the integration tests are all passing! My instinct is that we're not testing for atomicity on Snowflake in the same way that we do/did test on Redshift where this was a real problem for us historically. Can we imagine stress-testing this branch with a dbt running + concurrent queries against the tables being replaced? I'm happy to help with that if you think it's interesting
Kind of scattershot, but curious what you think! I am happy to slap a LGTM here since this is ultimately the kind of thing we'll want to extensively test internally/in a beta, but would also like to discuss some of these things first :)
|
||
{% set dml_transaction %} | ||
begin; | ||
{{ dml }} {{ ";" if not (dml | trim).endswith(";") }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really clever... worried that it might be too clever! If we wanted to be extra safe, we could issue three distinct queries for begin
, the query, then commit
. Would that defeat the purpose of these changes?
If we know that the supplied dml is always going to come from dbt internally, then this can be ok, but my preference would actually be either to:
- Specify the semicolon in the caller OR
- Always inject the semicolon here
you buy it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about the latter option, as a new argument for the statement
macro: perhaps explicit_transaction=False
. It's conceptually related to the existing argument auto_begin=True
(which won't take any effect on Snowflake, given the changes in this PR).
Since a lot of this DML is not called directly, however, but rather passed into the materialization's main
statement—which could be DML, could be DDL—this would require a bit more plumbing work.
So I'm happy with mandating that SQL passed into this macro shouldn't end in a semicolon, and have the macro always stick it on the end.
|
||
{% macro snowflake__get_delete_insert_merge_sql(target, source, unique_key, dest_columns) %} | ||
{% set dml = default__get_delete_insert_merge_sql(target, source, unique_key, dest_columns) %} | ||
{% do return(snowflake_dml_explicit_transaction(dml)) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any problem with specifying code with an explicit begin/commit
in addition to enabling autocommit? Surprised to see that this works! I would have expected that we would need to disable autocommit for these queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the things that's tricky about autocommit
. You don't actually need to disable it, it just doesn't take effect if you supply explicit begin
+ commit
. From docs:
Statements inside an explicit transaction are not affected by AUTOCOMMIT. For example, statements inside an explicit BEGIN TRANSACTION ... ROLLBACK are rolled back even if AUTOCOMMIT is TRUE.
d33e1b3
to
aa6b79b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drewbanin Thanks for the quick review!
Is this handled in an atomic way for an external querier? Eg. if a BI tool issues a query between when a view is dropped and a table is recreated.. will that query fail? If I'm understanding the Snowflake docs correctly, I think the answer is YES
Yes—and I think this is the status quo behavior. When we switched to using create or replace table|view
on Snowflake, whenever the old relation is of the wrong type, we simply drop it before creating it as the correct type, at risk of downtime and inability to rollback.
This did remind me to update the create_or_replace_view()
macro as well, since we'd hacked that to support both transactions (on Snowflake) and not-transactions (on BigQuery). Now it can be even simpler.
Can we imagine stress-testing this branch with a dbt running + concurrent queries against the tables being replaced? I'm happy to help with that if you think it's interesting
I did some very low-key stress testing by creating 10 models, and querying them all unioned together in the Snowflake console while I simultaneously executed dbt run
. Suffice to say, it works just fine when dbt is replacing views with views, or tables with tables. There's a few seconds of downtime if we're switching between views and tables. I made sure there's debug logging to reflect that.
We could take the BigQuery approach, which is to require that the user specify --full-refresh
in order to drop a table and replace it with a view. (We don't require it for view → table, since that's presumably an "upgrade.")
this is ultimately the kind of thing we'll want to extensively test internally/in a beta
100% agree. I'd like to get this into v0.21.0-b1, to go out end of this week or early next week, and get it into as many beta testers' hands as possible, as early as possible.
|
||
{% macro snowflake__get_delete_insert_merge_sql(target, source, unique_key, dest_columns) %} | ||
{% set dml = default__get_delete_insert_merge_sql(target, source, unique_key, dest_columns) %} | ||
{% do return(snowflake_dml_explicit_transaction(dml)) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the things that's tricky about autocommit
. You don't actually need to disable it, it just doesn't take effect if you supply explicit begin
+ commit
. From docs:
Statements inside an explicit transaction are not affected by AUTOCOMMIT. For example, statements inside an explicit BEGIN TRANSACTION ... ROLLBACK are rolled back even if AUTOCOMMIT is TRUE.
|
||
{% set dml_transaction %} | ||
begin; | ||
{{ dml }} {{ ";" if not (dml | trim).endswith(";") }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about the latter option, as a new argument for the statement
macro: perhaps explicit_transaction=False
. It's conceptually related to the existing argument auto_begin=True
(which won't take any effect on Snowflake, given the changes in this PR).
Since a lot of this DML is not called directly, however, but rather passed into the materialization's main
statement—which could be DML, could be DDL—this would require a bit more plumbing work.
So I'm happy with mandating that SQL passed into this macro shouldn't end in a semicolon, and have the macro always stick it on the end.
aa6b79b
to
e5ce556
Compare
e5ce556
to
3fa9b97
Compare
resolves #2748 resolves #3480
begin
,commit
, orrollback
on Snowflakeautcommit=True
in all connectionsbegin
+commit
, affecting the following macros:snowflake__get_merge_sql
,snowflake__get_delete_insert_merge_sql
,snowflake__snapshot_merge_sql
run_hooks
in thetable
andincremental
materializations—there are no inside/outside-transactional hooks, since there are no more transactions.Questions
snowflake_dml_explicit_transaction
that wraps supplied DML inbegin;
+commit;
. Should this be an adapter method instead, written in python?begin
+commit
. Users can still run within-transaction "pre-hook" logic viasql_header
. Do we need to preserve a way to run within-transaction "post-hook" logic? This isn't something I see come up nearly as often.Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.