-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/cost effective bq incremental followup #2140
Feature/cost effective bq incremental followup #2140
Conversation
27d21f9
to
15cac5a
Compare
@drewbanin I really like the addition of Based on your latest changes, it looks like we're going to use "fake" temp tables across the board for the time being, in order to avoid breaking changes to snapshots. There's very little downside to using "fake" temp tables in the incremental materialization; worst case, dbt fails to drop the "temp" table, and someone is charged for 12 hours of storage of that table, or about $0.33/TB (based on monthly storage cost of $20/TB). Do you think the switch to "true" temp tables is something we'll push to a different PR / future version of dbt? |
hey @jtcohen6 - I would love to use proper temp tables on BigQuery in the future. If you're curious, check out the problematic snapshot code here: We need to use the temp table in the same script where the table is created, but that's not how the snapshot materialization works today. I think I'd like to update the I did add a line to drop the temp table at the end of the incremental materialization, so the table hopefully won't hang around for any longer than it needs to regardless! |
…dbt into feature/cost-effective-bq-incremental-followup
@@ -83,12 +80,12 @@ def exception_handler(self, sql): | |||
yield | |||
|
|||
except google.cloud.exceptions.BadRequest as e: | |||
message = "Bad request while running:\n{sql}" | |||
self.handle_error(e, message, sql) | |||
message = "Bad request while running query" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing BQ debug logs were incredibly verbose, printing the same SQL and error messages three times over. These changes should only print out the SQL once (when it is executed) and should suppress the re-printing of the query in the BQ exception message.
|
||
|
||
@dataclass | ||
class PartitionConfig(JsonSchemaMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am LARPing as a person who knows how to use dataclasses. I'm sure there's some subtlety here.... is this the right way to specify these types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup!
@beckjake much has changed here since you last took a look. Can you review this again please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some PR feedback, mostly on dataclasses. I'd either go all the way on the partition config being a dataclass/hololgram type (my preference) or make it a dict all the way through.
Other than those things, looks great!
|
||
|
||
@dataclass | ||
class PartitionConfig(Dict[str, Any], JsonSchemaMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand what's going on here correctly, I think instead of inheriting this from Dict
you should pass partition_by.to_dict()
instead of the places you're using this as a dict.
{{ sql_header if sql_header is not none }} | ||
|
||
create or replace table {{ relation }} | ||
{{ partition_by(raw_partition_by) }} | ||
{{ partition_by(partition_by_dict) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partition_by.to_dict()
{%- set raw_partition_by = config.get('partition_by', none) -%} | ||
{%- set partition_by = adapter.parse_partition_by(raw_partition_by) -%} | ||
{%- set cluster_by = config.get('cluster_by', none) -%} | ||
{% if not adapter.is_replaceable(old_relation, partition_by, cluster_by) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partition_by.to_dict()
, or change is_replaceable to take a regular PartitionConfig
.
@dataclass | ||
class PartitionConfig(Dict[str, Any], JsonSchemaMixin): | ||
field: str | ||
data_type: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_type: str | |
data_type: str = 'date' |
if raw_partition_by.get('field'): | ||
if raw_partition_by.get('data_type'): | ||
return cls(**raw_partition_by) | ||
else: # assume date type as default | ||
return cls(**raw_partition_by, data_type='date') | ||
else: | ||
dbt.exceptions.raise_compiler_error( | ||
'Config `partition_by` is missing required item `field`' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it could just be:
try:
return cls.from_dict(raw_partition_by)
except hologram.ValidationError:
dbt.exceptions.raise_compiler_error(
'Config `partition_by` is missing required item `field`'
)
inferred_partition_by = { | ||
'field': partition_by, | ||
'data_type': data_type | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
inferred_partition_by = cls(field=partition_by, data_type=data_type)
dbt.deprecations.warn(
'bq-partition-by-string',
raw_partition_by=raw_partition_by,
inferred_partition_by=inferred_partition_by
)
return inferred_partition_by
Setting up a kwargs dictionary just to expand it seems unnecessary!
7d43037
to
6fb8c96
Compare
Also, make tests run
6fb8c96
to
4068c66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one more tiny comment here, and then we should ship the heck out of this PR.
This does not only Look Good To Me, this Looks Great To Me. Word on the street is that this PR is going to cut into GCP's profit margin... so glad that these incremental models are finally going to be really efficient!!
Co-Authored-By: Drew Banin <drew@fishtownanalytics.com>
Reopening from #1971, fixes #1034
Description
_dbt_max_partition
field for use in model codeExample incremental model:
snowplow.event
is partitioned bycollector_tstamp
, then only the partitions containing new data in snowplow.event will be scannedmerge
statement will only scan the partitions in the destination table that will be updated by the mergeBefore this PR, an incremental model which copies a 100mb source table would scan 200mb of data as it required one full table scan against both the source and destination tables every time it ran. After this PR, the same incremental model will process < 2mb (lookups on partitioning fields) of data for an incremental build with no data to merge.
cc @jtcohen6
Remaining todo:
_dbt_max_partition
scripting variableChecklist
CHANGELOG.md
and added information about my change to the "dbt next" section.