-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BQ clustering can improve merge performance #2196
Comments
@drewbanin and I had a chance to talk about this and game-plan for 0.16.0. To simplify the release, we are going to:
PR and documentation to come! |
This is also important for merge costs in BigQuery: #2136 (set_sql_header works well with 'table', but not 'incremental') |
related: fhoffa/code_snippets#2 (review) |
@fhoffa Good point, the fix with missing I don't know if @drewbanin is up for trying to sneak that fix into the next minor release (0.16.0). If not, it could come in the next patch release. |
@fhoffa It seems that the behavior we're seeing in
|
Description
On BigQuery, running a simple merge statement—as the incremental materialization does on versions <=0.15.2—appears to scan significantly less data if the target table is clustered.
This makes some intuitive sense when the cluster key and the
unique_key
for merge equality are identical, and even some slight sense when they're correlated (e.g. the merge key isevent_id
, the clustering issession_id
, the former is contained within the latter). I'm seeing the benefit for all clustered tables, however, no matter which column(s) the table is clustered by.This feels like a relatively recent BQ performance improvement, and—wow. While it throws a small wrench into our 0.16.0 rework of the BQ incremental materialization, it's also a very exciting discovery. Big thanks to @clausherther for his help on this!
Benchmarking
I've been doing a lot of work yesterday and today trying to benchmark query runtime and cost according to three variables: modeling, data volume, and incremental strategy. I'll have more to say about this in Discourse and hope to present some of my findings in tomorrow's office hours. Broadly:
At small data volumes, a simple
merge
into a clustered table is faster and scans less data than the multi-step, scripted, partition-based approach implemented in #2140.As the target table increases in size (> 50 GB), the partition-based approach is still slower, but it's increasingly more cost-effective than any simple
merge
, clustered or not.Next steps
I think we should reimplement the simple
merge
as the default BigQuery incremental strategy and document the finding around clustered tables' improved performance.We should also allow users to turn on the new partition-based scripting approach using a
partition_merge
strategy.That would give us three strategies total:
merge
(simple, default),partition_merge
, andinsert_overwrite
.The text was updated successfully, but these errors were encountered: