Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-query statements: fix rows_affected + query comments #153

Merged
merged 4 commits into from
Jun 30, 2022

Conversation

jtcohen6
Copy link
Contributor

@jtcohen6 jtcohen6 commented May 4, 2022

resolves #140
resolves #147

Description

Context: Snowflake's Python client can't handle multi-query statements, i.e.

select 1 as id;
do another thing;
finally this;

As such, dbt-snowflake needs to split these up and send them off individually.

These queries remain tightly "batched," because they use the same threaded connection. I believe this represents the surest way we have to run multiple statements within the same transaction. As such, and on Snowflake's recommendation, we wrap DML statements in explicit transactions to ensure they succeed:

begin;
delete ...;
insert ...;
commit;

The problems:

  • The query comment (/* {...} */) only gets applied once, to the very first query
  • The "result" of that query batch only reflects the final commit, rather than the main event (insert)

The solutions:

  1. Add special-case handling for begin; + commit; to use dbt's built-in add_begin_query + add_commit_query (even though dbt-snowflake disables these by default). This has the effect of ignoring the results from these queries, so the last non-commit query has its result stored instead
  2. Apply query comments to each individual_query, after they've been split on ;. Needed to do this anyway, so we can properly catch begin; and commit; — why not fix this issue in the meantime, too!

Tests

I added a test case for incremental models + rows_affected (by extending the "basic" incremental test). I haven't added a test case to ensure that all queries contain the query comment, but I imagine it wouldn't be too tricky with run_dbt_and_capture.

Running locally

-- models/my_incremental_model.sql

{{ config(
  materialized = 'incremental',
  unique_key = 'id',
  incremental_strategy = 'delete+insert',
) }}

select 1 as id
union all
select 2 as id
union all
select 3 as id
$ dbt run --full-refresh && dbt run
10:16:00  Running with dbt=1.2.0-a1
10:16:00  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 383 macros, 0 operations, 1 seed file, 1 source, 0 exposures, 0 metrics
10:16:00
10:16:05  Concurrency: 8 threads (target='dev')
10:16:05
10:16:05  1 of 1 START incremental model dbt_jcohen.my_incremental_model ................. [RUN]
10:16:07  1 of 1 OK created incremental model dbt_jcohen.my_incremental_model ............ [SUCCESS 1 in 2.23s]
10:16:07
10:16:07  Finished running 1 incremental model in 6.40s.
10:16:07
10:16:07  Completed successfully
10:16:07
10:16:07  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
10:16:09  Running with dbt=1.2.0-a1
10:16:09  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 383 macros, 0 operations, 1 seed file, 1 source, 0 exposures, 0 metrics
10:16:09
10:16:13  Concurrency: 8 threads (target='dev')
10:16:13
10:16:13  1 of 1 START incremental model dbt_jcohen.my_incremental_model ................. [RUN]
10:16:18  1 of 1 OK created incremental model dbt_jcohen.my_incremental_model ............ [SUCCESS 3 in 5.02s]
10:16:18
10:16:18  Finished running 1 incremental model in 8.45s.
10:16:18
10:16:18  Completed successfully
10:16:18
10:16:18  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Note that second SUCCESS 3 (fixed), instead of SUCCESS 1 (current)!

In logs/dbt.log, query comments everywhere:

10:16:15.500668 [debug] [Thread-1  ]: On model.testy.my_incremental_model: /* {"app": "dbt", "dbt_version": "1.2.0a1", "profile_name": "sandbox-snowflake", "target_name": "dev", "node_id": "model.testy.my_incremental_model"} */
BEGIN
10:16:15.749855 [debug] [Thread-1  ]: SQL status: SUCCESS 1 in 0.25 seconds
10:16:15.750393 [debug] [Thread-1  ]: Using snowflake connection "model.testy.my_incremental_model"
10:16:15.750677 [debug] [Thread-1  ]: On model.testy.my_incremental_model: /* {"app": "dbt", "dbt_version": "1.2.0a1", "profile_name": "sandbox-snowflake", "target_name": "dev", "node_id": "model.testy.my_incremental_model"} */
delete from ad_hoc.dbt_jcohen.my_incremental_model
            where (
                id) in (
                select (id)
                from ad_hoc.dbt_jcohen.my_incremental_model__dbt_tmp
            );
10:16:16.511880 [debug] [Thread-1  ]: SQL status: SUCCESS 3 in 0.76 seconds
10:16:16.512596 [debug] [Thread-1  ]: Using snowflake connection "model.testy.my_incremental_model"
10:16:16.512984 [debug] [Thread-1  ]: On model.testy.my_incremental_model: /* {"app": "dbt", "dbt_version": "1.2.0a1", "profile_name": "sandbox-snowflake", "target_name": "dev", "node_id": "model.testy.my_incremental_model"} */
insert into ad_hoc.dbt_jcohen.my_incremental_model ("ID")
    (
        select "ID"
        from ad_hoc.dbt_jcohen.my_incremental_model__dbt_tmp
    );
10:16:17.125697 [debug] [Thread-1  ]: SQL status: SUCCESS 3 in 0.61 seconds
10:16:17.126288 [debug] [Thread-1  ]: Using snowflake connection "model.testy.my_incremental_model"
10:16:17.126555 [debug] [Thread-1  ]: On model.testy.my_incremental_model: /* {"app": "dbt", "dbt_version": "1.2.0a1", "profile_name": "sandbox-snowflake", "target_name": "dev", "node_id": "model.testy.my_incremental_model"} */
COMMIT
10:16:17.639290 [debug] [Thread-1  ]: SQL status: SUCCESS 1 in 0.51 seconds
10:16:17.660185 [debug] [Thread-1  ]: finished collecting timing info
10:16:17.660683 [debug] [Thread-1  ]: On model.testy.my_incremental_model: Close
10:16:18.081427 [info ] [Thread-1  ]: 1 of 1 OK created incremental model dbt_jcohen.my_incremental_model ............ [�[32mSUCCESS 3�[0m in 5.02s]

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-snowflake next" section.

@cla-bot cla-bot bot added the cla:yes label May 4, 2022
Copy link
Contributor Author

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing tests are the same ones that are failing nightly on main right now: #154

Comment on lines +439 to +440
# don't apply the query comment here
# it will be applied after ';' queries are split
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This execute is identical to dbt.adapters.sql.connections.execute, except that we're going to wait to _add_query_comment until after the queries have been split

# Let their wish be granted!
# This also has the effect of ignoring "commit" in the RunResult for this model
# https://github.com/dbt-labs/dbt-snowflake/issues/147
if individual_query.lower() == "begin;":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a bit jank, but actually:

  • These are generally being added by our own internal macro, snowflake_dml_explicit_transaction, and we know exactly what it's adding
  • We've already stripped whitespace above
  • We use the Snowflake connector's official split_statements, which seems to return statements with the ; included.
  • I'm actually not sure why we're not also using their official remove_comments argument to split_statements, rather than our own janky logic here? Most recent work there was Fix bug #2731 on stripping query comments for snowflake dbt-core#2974
  • Worst case, we don't catch these as "real" begin + commit, just send them along as normal queries, and store their results accordingly

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jun 2, 2022

Failing tests are the same ones that are failing nightly on main right now :) should be resolved by #157

@jtcohen6 jtcohen6 force-pushed the fix/multiline-query-comments-result branch from 5cae71f to 4e73187 Compare June 3, 2022 10:56
@jtcohen6 jtcohen6 marked this pull request as ready for review June 7, 2022 10:09
@jtcohen6 jtcohen6 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Jun 15, 2022
Copy link
Contributor

@b-per b-per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a test on an incremental model and:

  • the CLI returned the number of rows inserted when it ran in incremental mode
  • the query comments showed on the 3 queries:
    • begin;
    • insert into ...;
    • commit;

Copy link
Contributor

@nathaniel-may nathaniel-may left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I wish we didn't mutate our connection object with our queries so we could get a couple of nice clean unit tests in here, but that part isn't your fault and isn't worth changing in this PR.

Let's add a changelog entry and make sure we note that this change will change the logs for people trying to compare metadata between runs of prior versions.

@jtcohen6 jtcohen6 force-pushed the fix/multiline-query-comments-result branch from c58b01e to caebedd Compare June 30, 2022 15:04
@jtcohen6
Copy link
Contributor Author

Hm. Will try closing and reopening to trigger Snyk, before giving up and finding a way to merge anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
3 participants