Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1134] Partitioning on datatime column with hourly granularity #298

Closed
psalms945 opened this issue Sep 7, 2022 · 2 comments
Closed
Labels
bug Something isn't working bytes_processed Issues related to cost tracking in BigQuery incremental partitioning Related to creating, replacing, or pruning partitions to avoid full table scans

Comments

@psalms945
Copy link

Describe the bug

I've tried to setup partitioning on a table of my BigQuery following the instruction on https://docs.getdbt.com/reference/resource-configs/bigquery-configs#partition-clause
I found the "Bytes billed" cost of the merge script is over than I expect.

Expected behavior

I would expect just pay for the partitions I specify to read and write.

Steps To Reproduce

The is my model:

{% set partitions_to_replace = [
    '\"2022-07-01 00:00:00\"',
    '\"2022-07-01 01:00:00\"'
] %}
{{ config(
    materialized = 'incremental',
    incremental_strategy = 'insert_overwrite',
    partition_by={
        "field": "created",
        "data_type": "datetime",
        "granularity": "hour"
    },
    partitions = partitions_to_replace
)}}
WITH source AS (
    SELECT * FROM `myproject`.`dataset`.`source1`
)
SELECT *
FROM source
WHERE 1=1
{% if is_incremental() %}
    AND created in ({{ partitions_to_replace | join(',') }})
{% endif %}

dbt generates the sql is like below:

merge into `myproject`.`dataset`.`mart1` as DBT_INTERNAL_DEST
using (
  WITH source AS (
    SELECT * FROM `myproject`.`dataset`.`source1`
)
SELECT *
FROM source
WHERE 1=1
  AND created in ("2022-07-01 00:00:00","2022-07-01 01:00:00")
) as DBT_INTERNAL_SOURCE
  on FALSE
when not matched by source
and datetime_trunc(DBT_INTERNAL_DEST.created, hour) in ("2022-07-01 00:00:00", "2022-07-01 01:00:00") 
  then delete
when not matched 
then insert(`id`, `value`, `created`) values(`id`, `value`, `created`)

Screenshots and log output

09:10:45  Running with dbt=1.2.1
09:10:47  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 287 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
09:10:47
09:10:49  Concurrency: 1 threads (target='dev')
09:10:49
09:10:49  1 of 1 START incremental model dataset.mart1 .......................... [RUN]
09:10:58  1 of 1 OK created incremental model dataset.mart1 ..................... [MERGE (530.0 rows, 30.9 MB processed) in 9.48s]
09:10:58
09:10:58  Finished running 1 incremental model in 0 hours 0 minutes and 11.33 seconds (11.33s).

The size of the destination mart table is 30 MB, and the size of the partitions I specifiy shoule be less than 1 MB.

Additional context

I found that datetime_trunc is the reason to cost 30 MB to query all the data of destination.

My current workaround is:

  1. Round the datetime column of source table to hour before running a dbt run
  2. Change the setting of data_type from "datetime" to "date" after I initiated the first run

The script generated by dbt is like below:

merge into `myproject`.`dataset`.`mart1` as DBT_INTERNAL_DEST
using (
  WITH source AS (
    SELECT * FROM `myproject`.`dataset`.`source1`
)
SELECT *
FROM source
WHERE 1=1
  AND created in ("2022-07-01 00:00:00","2022-07-01 01:00:00")
) as DBT_INTERNAL_SOURCE
  on FALSE
when not matched by source
and DBT_INTERNAL_DEST.created in ("2022-07-01 00:00:00", "2022-07-01 01:00:00") 
  then delete
when not matched 
then insert(`id`, `value`, `created`) values(`id`, `value`, `created`)

BTW, is there a way to specifiy a time range of partitions rather than a list of partitions? It should be more felixable if I want to replace a continuous partition.

@github-actions github-actions bot changed the title Partitioning on datatime column with hourly granularity [CT-1134] Partitioning on datatime column with hourly granularity Sep 7, 2022
@jtcohen6 jtcohen6 added bug Something isn't working triage and removed triage labels Sep 7, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 7, 2022

@AidenPTHuang This has come up before! I believe this is a bug in BigQuery. Slack thread from almost a year ago:

Everything you're doing looks right to me. I don't understand why BigQuery does not effectively prune partitions, and reduce query cost, while using DATETIME_TRUNC. This is supported for DATE_TRUNC and TIMESTAMP_TRUNC. It's not something I've been able to find any documentation or rationale for...

In the meantime, if you are able to switch the data type from datetime to timestamp, I think that should work the way you'd expect

It sounds like that's the same basic workaround you've taken here, too.

BTW, is there a way to specifiy a time range of partitions rather than a list of partitions? It should be more felixable if I want to replace a continuous partition.

Related: #16

@psalms945
Copy link
Author

@jtcohen6 Thanks for reply. I change to use timestamp instead of datetime, and it's working as I expected.

@dbeatty10 dbeatty10 added partitioning Related to creating, replacing, or pruning partitions to avoid full table scans incremental bytes_processed Issues related to cost tracking in BigQuery labels May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working bytes_processed Issues related to cost tracking in BigQuery incremental partitioning Related to creating, replacing, or pruning partitions to avoid full table scans
Projects
None yet
Development

No branches or pull requests

3 participants