[CT-1134] Partitioning on datatime column with hourly granularity #298

psalms945 · 2022-09-07T09:58:01Z

Describe the bug

I've tried to setup partitioning on a table of my BigQuery following the instruction on https://docs.getdbt.com/reference/resource-configs/bigquery-configs#partition-clause
I found the "Bytes billed" cost of the merge script is over than I expect.

Expected behavior

I would expect just pay for the partitions I specify to read and write.

Steps To Reproduce

The is my model:

{% set partitions_to_replace = [
    '\"2022-07-01 00:00:00\"',
    '\"2022-07-01 01:00:00\"'
] %}
{{ config(
    materialized = 'incremental',
    incremental_strategy = 'insert_overwrite',
    partition_by={
        "field": "created",
        "data_type": "datetime",
        "granularity": "hour"
    },
    partitions = partitions_to_replace
)}}
WITH source AS (
    SELECT * FROM `myproject`.`dataset`.`source1`
)
SELECT *
FROM source
WHERE 1=1
{% if is_incremental() %}
    AND created in ({{ partitions_to_replace | join(',') }})
{% endif %}

dbt generates the sql is like below:

merge into `myproject`.`dataset`.`mart1` as DBT_INTERNAL_DEST
using (
  WITH source AS (
    SELECT * FROM `myproject`.`dataset`.`source1`
)
SELECT *
FROM source
WHERE 1=1
  AND created in ("2022-07-01 00:00:00","2022-07-01 01:00:00")
) as DBT_INTERNAL_SOURCE
  on FALSE
when not matched by source
and datetime_trunc(DBT_INTERNAL_DEST.created, hour) in ("2022-07-01 00:00:00", "2022-07-01 01:00:00") 
  then delete
when not matched 
then insert(`id`, `value`, `created`) values(`id`, `value`, `created`)

Screenshots and log output

09:10:45  Running with dbt=1.2.1
09:10:47  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 287 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
09:10:47
09:10:49  Concurrency: 1 threads (target='dev')
09:10:49
09:10:49  1 of 1 START incremental model dataset.mart1 .......................... [RUN]
09:10:58  1 of 1 OK created incremental model dataset.mart1 ..................... [MERGE (530.0 rows, 30.9 MB processed) in 9.48s]
09:10:58
09:10:58  Finished running 1 incremental model in 0 hours 0 minutes and 11.33 seconds (11.33s).

The size of the destination mart table is 30 MB, and the size of the partitions I specifiy shoule be less than 1 MB.

Additional context

I found that datetime_trunc is the reason to cost 30 MB to query all the data of destination.

My current workaround is:

Round the datetime column of source table to hour before running a dbt run
Change the setting of data_type from "datetime" to "date" after I initiated the first run

The script generated by dbt is like below:

merge into `myproject`.`dataset`.`mart1` as DBT_INTERNAL_DEST
using (
  WITH source AS (
    SELECT * FROM `myproject`.`dataset`.`source1`
)
SELECT *
FROM source
WHERE 1=1
  AND created in ("2022-07-01 00:00:00","2022-07-01 01:00:00")
) as DBT_INTERNAL_SOURCE
  on FALSE
when not matched by source
and DBT_INTERNAL_DEST.created in ("2022-07-01 00:00:00", "2022-07-01 01:00:00") 
  then delete
when not matched 
then insert(`id`, `value`, `created`) values(`id`, `value`, `created`)

BTW, is there a way to specifiy a time range of partitions rather than a list of partitions? It should be more felixable if I want to replace a continuous partition.

jtcohen6 · 2022-09-07T11:46:52Z

@AidenPTHuang This has come up before! I believe this is a bug in BigQuery. Slack thread from almost a year ago:

Everything you're doing looks right to me. I don't understand why BigQuery does not effectively prune partitions, and reduce query cost, while using DATETIME_TRUNC. This is supported for DATE_TRUNC and TIMESTAMP_TRUNC. It's not something I've been able to find any documentation or rationale for...

In the meantime, if you are able to switch the data type from datetime to timestamp, I think that should work the way you'd expect

It sounds like that's the same basic workaround you've taken here, too.

BTW, is there a way to specifiy a time range of partitions rather than a list of partitions? It should be more felixable if I want to replace a continuous partition.

Related: #16

psalms945 · 2022-09-08T03:38:16Z

@jtcohen6 Thanks for reply. I change to use timestamp instead of datetime, and it's working as I expected.

github-actions bot changed the title ~~Partitioning on datatime column with hourly granularity~~ [CT-1134] Partitioning on datatime column with hourly granularity Sep 7, 2022

jtcohen6 added bug Something isn't working triage and removed triage labels Sep 7, 2022

psalms945 closed this as completed Sep 8, 2022

dbeatty10 added partitioning Related to creating, replacing, or pruning partitions to avoid full table scans incremental bytes_processed Issues related to cost tracking in BigQuery labels May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1134] Partitioning on datatime column with hourly granularity #298

[CT-1134] Partitioning on datatime column with hourly granularity #298

psalms945 commented Sep 7, 2022

jtcohen6 commented Sep 7, 2022

psalms945 commented Sep 8, 2022

[CT-1134] Partitioning on datatime column with hourly granularity #298

[CT-1134] Partitioning on datatime column with hourly granularity #298

Comments

psalms945 commented Sep 7, 2022

Describe the bug

Expected behavior

Steps To Reproduce

Screenshots and log output

Additional context

jtcohen6 commented Sep 7, 2022

psalms945 commented Sep 8, 2022