Feature/cost effective bq incremental followup #2140

drewbanin · 2020-02-15T21:49:28Z

Reopening from #1971, fixes #1034

Description

Support dictionary-style specification of partitioning fields
Deprecate string-style specification of partitioning fields
Implement BigQuery scripting for incremental models
- use "real" temporary tables
- provide partition predicate in merge statement join to limit data scanned in destination table
- add _dbt_max_partition field for use in model code

Example incremental model:

{{ config(
  materialized='incremental',
  partition_by={"field": "collector_tstamp", "data_type": "date},
  unique_key="event_id"
) }}

select * from snowplow.event
{% if is_incremental() %}
where collector_tstamp > _dbt_max_partition
{% endif %}

if snowplow.event is partitioned by collector_tstamp, then only the partitions containing new data in snowplow.event will be scanned
the resulting merge statement will only scan the partitions in the destination table that will be updated by the merge

Before this PR, an incremental model which copies a 100mb source table would scan 200mb of data as it required one full table scan against both the source and destination tables every time it ran. After this PR, the same incremental model will process < 2mb (lookups on partitioning fields) of data for an incremental build with no data to merge.

cc @jtcohen6

Remaining todo:

add more tests for changing partitioning types
add tests for _dbt_max_partition scripting variable
confirm logic used for range partitioning... i don't think that part is right...

Checklist

I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

jtcohen6 · 2020-02-16T23:37:08Z

@drewbanin I really like the addition of _dbt_max_partition. This accomplishes something that I thought we would have to enable via a declare config argument. BQ partitioned incrementals present a pretty compelling user experience, now that all these pieces have come together.

Based on your latest changes, it looks like we're going to use "fake" temp tables across the board for the time being, in order to avoid breaking changes to snapshots. There's very little downside to using "fake" temp tables in the incremental materialization; worst case, dbt fails to drop the "temp" table, and someone is charged for 12 hours of storage of that table, or about $0.33/TB (based on monthly storage cost of $20/TB).

Do you think the switch to "true" temp tables is something we'll push to a different PR / future version of dbt?

drewbanin · 2020-02-16T23:51:16Z

hey @jtcohen6 - I would love to use proper temp tables on BigQuery in the future. If you're curious, check out the problematic snapshot code here:

https://github.com/fishtown-analytics/dbt/blob/0115e469c13237aed0ea53cc93655a089b843ab6/core/dbt/include/global_project/macros/materializations/snapshot/snapshot.sql#L165-L174

We need to use the temp table in the same script where the table is created, but that's not how the snapshot materialization works today.

I think I'd like to update the create_table_as macro to accept a config argument instead of plucking the config out from the model context. This would let us use "fake" temp tables in snapshots, and "real" temp tables in incremental models. As it stands though, I didn't want to overcomplicate this PR.

I did add a line to drop the temp table at the end of the incremental materialization, so the table hopefully won't hang around for any longer than it needs to regardless!

…dbt into feature/cost-effective-bq-incremental-followup

drewbanin · 2020-02-20T02:12:56Z

plugins/bigquery/dbt/adapters/bigquery/connections.py

@@ -83,12 +80,12 @@ def exception_handler(self, sql):
            yield

        except google.cloud.exceptions.BadRequest as e:
-            message = "Bad request while running:\n{sql}"
-            self.handle_error(e, message, sql)
+            message = "Bad request while running query"


The existing BQ debug logs were incredibly verbose, printing the same SQL and error messages three times over. These changes should only print out the SQL once (when it is executed) and should suppress the re-printing of the query in the BQ exception message.

drewbanin · 2020-02-20T02:13:49Z

plugins/bigquery/dbt/adapters/bigquery/impl.py

+
+
+@dataclass
+class PartitionConfig(JsonSchemaMixin):


I am LARPing as a person who knows how to use dataclasses. I'm sure there's some subtlety here.... is this the right way to specify these types?

drewbanin · 2020-02-20T02:33:43Z

@beckjake much has changed here since you last took a look. Can you review this again please?

beckjake

Some PR feedback, mostly on dataclasses. I'd either go all the way on the partition config being a dataclass/hololgram type (my preference) or make it a dict all the way through.

Other than those things, looks great!

beckjake · 2020-02-20T03:31:36Z

plugins/bigquery/dbt/adapters/bigquery/impl.py

+
+
+@dataclass
+class PartitionConfig(Dict[str, Any], JsonSchemaMixin):


If I understand what's going on here correctly, I think instead of inheriting this from Dict you should pass partition_by.to_dict() instead of the places you're using this as a dict.

beckjake · 2020-02-20T03:35:49Z

plugins/bigquery/dbt/include/bigquery/macros/adapters.sql

  {{ sql_header if sql_header is not none }}

  create or replace table {{ relation }}
-  {{ partition_by(raw_partition_by) }}
+  {{ partition_by(partition_by_dict) }}


partition_by.to_dict()

beckjake · 2020-02-20T03:37:26Z

plugins/bigquery/dbt/include/bigquery/macros/materializations/table.sql

+    {%- set raw_partition_by = config.get('partition_by', none) -%}
+    {%- set partition_by = adapter.parse_partition_by(raw_partition_by) -%}
+    {%- set cluster_by = config.get('cluster_by', none) -%}
+    {% if not adapter.is_replaceable(old_relation, partition_by, cluster_by) %}


partition_by.to_dict(), or change is_replaceable to take a regular PartitionConfig.

beckjake · 2020-02-20T03:38:39Z

plugins/bigquery/dbt/adapters/bigquery/impl.py

+@dataclass
+class PartitionConfig(Dict[str, Any], JsonSchemaMixin):
+    field: str
+    data_type: str


Suggested change

data_type: str

data_type: str = 'date'

beckjake · 2020-02-20T03:39:47Z

plugins/bigquery/dbt/adapters/bigquery/impl.py

+            if raw_partition_by.get('field'):
+                if raw_partition_by.get('data_type'):
+                    return cls(**raw_partition_by)
+                else:  # assume date type as default
+                    return cls(**raw_partition_by, data_type='date')
+            else:
+                dbt.exceptions.raise_compiler_error(
+                    'Config `partition_by` is missing required item `field`'
+                )


This looks like it could just be:

try: return cls.from_dict(raw_partition_by) except hologram.ValidationError: dbt.exceptions.raise_compiler_error( 'Config `partition_by` is missing required item `field`' )

beckjake · 2020-02-20T03:42:41Z

plugins/bigquery/dbt/adapters/bigquery/impl.py

+            inferred_partition_by = {
+                'field': partition_by,
+                'data_type': data_type
+            }


How about:

inferred_partition_by = cls(field=partition_by, data_type=data_type) dbt.deprecations.warn( 'bq-partition-by-string', raw_partition_by=raw_partition_by, inferred_partition_by=inferred_partition_by ) return inferred_partition_by

Setting up a kwargs dictionary just to expand it seems unnecessary!

Also, make tests run

plugins/bigquery/dbt/adapters/bigquery/impl.py

drewbanin

I have one more tiny comment here, and then we should ship the heck out of this PR.

This does not only Look Good To Me, this Looks Great To Me. Word on the street is that this PR is going to cut into GCP's profit margin... so glad that these incremental models are finally going to be really efficient!!

Co-Authored-By: Drew Banin <drew@fishtownanalytics.com>

jtcohen6 and others added 2 commits February 15, 2020 15:45

Implement partition-aware BigQuery merge statements

7106a7c

report bytes billed for scripts, add _dbt_max_partition field

1ae17d5

cla-bot bot added the cla:yes label Feb 15, 2020

add tests

15cac5a

drewbanin force-pushed the feature/cost-effective-bq-incremental-followup branch from 27d21f9 to 15cac5a Compare February 15, 2020 23:26

drewbanin added 2 commits February 16, 2020 17:55

use not-real temp tables on bigquery, fix snapshots

1879fb5

rm semicolons

e73e276

drewbanin added 3 commits February 19, 2020 16:34

Merge branch 'dev/barbara-gittings' of github.com:fishtown-analytics/…

fa22c5b

…dbt into feature/cost-effective-bq-incremental-followup

quiet bq error logs; refactor to use dataclasses

67571f4

fix tests

df82cd5

drewbanin commented Feb 20, 2020

View reviewed changes

push some gnarly logic out of jinja and into python

2bd56de

drewbanin requested a review from beckjake February 20, 2020 02:33

beckjake reviewed Feb 20, 2020

View reviewed changes

drewbanin and others added 2 commits February 21, 2020 12:49

wip

e29078e

fix tests and types, enhance error message for bad partition configs

7d43037

beckjake approved these changes Feb 21, 2020

View reviewed changes

beckjake force-pushed the feature/cost-effective-bq-incremental-followup branch from 7d43037 to 6fb8c96 Compare February 21, 2020 20:35

fix dict/PartitionConfig confusion

4068c66

Also, make tests run

beckjake force-pushed the feature/cost-effective-bq-incremental-followup branch from 6fb8c96 to 4068c66 Compare February 21, 2020 20:37

drewbanin commented Feb 21, 2020

View reviewed changes

plugins/bigquery/dbt/adapters/bigquery/impl.py Outdated Show resolved Hide resolved

drewbanin commented Feb 21, 2020

View reviewed changes

Update plugins/bigquery/dbt/adapters/bigquery/impl.py

202f8a1

Co-Authored-By: Drew Banin <drew@fishtownanalytics.com>

beckjake merged commit bcea7cc into dev/barbara-gittings Feb 24, 2020

beckjake deleted the feature/cost-effective-bq-incremental-followup branch February 24, 2020 17:32

jtcohen6 mentioned this pull request Feb 24, 2020

Feature: cost-effective merge for partitioned incremental models on BigQuery #1971

Closed

This was referenced Feb 24, 2020

Feature/bq incremental strategy insert_overwrite #2153

Merged

Bug: fixup bq _partitions_match for range partitions #2156

Merged

This was referenced Feb 27, 2020

dbt should drop and recreate unreplaceable tables on BigQuery #2000

Closed

[Barbara Gittings]: merge.sql BigQuery macro file shows up in dbt v016.0-b3 pre-release #2177

Closed

This was referenced Mar 11, 2020

BQ clustering can improve merge performance #2196

Closed

Rework insert_overwrite incremental strategy #2198

Merged

jtcohen6 mentioned this pull request Aug 14, 2020

Support TTL for BigQuery tables #2697

Closed

This was referenced Oct 29, 2022

[CT-2051] [Bug] New insert_overwrite Bigquery partitioning with integer keys can create huge temporary array variables, exceeding BQ limits dbt-labs/dbt-bigquery#16

Open

Subvert insert_overwrite merge strategy to bring back merge raywhite/dbt-bigquery#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/cost effective bq incremental followup #2140

Feature/cost effective bq incremental followup #2140

drewbanin commented Feb 15, 2020 •

edited

Loading

jtcohen6 commented Feb 16, 2020

drewbanin commented Feb 16, 2020

drewbanin Feb 20, 2020

drewbanin Feb 20, 2020

beckjake Feb 20, 2020

drewbanin commented Feb 20, 2020

beckjake left a comment

beckjake Feb 20, 2020

beckjake Feb 20, 2020

beckjake Feb 20, 2020

beckjake Feb 20, 2020

beckjake Feb 20, 2020

beckjake Feb 20, 2020

drewbanin left a comment



		@dataclass
		class PartitionConfig(Dict[str, Any], JsonSchemaMixin):

Feature/cost effective bq incremental followup #2140

Feature/cost effective bq incremental followup #2140

Conversation

drewbanin commented Feb 15, 2020 • edited Loading

Description

Checklist

jtcohen6 commented Feb 16, 2020

drewbanin commented Feb 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin commented Feb 20, 2020

beckjake left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin left a comment

Choose a reason for hiding this comment

drewbanin commented Feb 15, 2020 •

edited

Loading