Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow require_partition_filter=true for BigQuery table creation #1843

Closed
hui-zheng opened this issue Oct 19, 2019 · 8 comments · Fixed by #2928
Closed

Allow require_partition_filter=true for BigQuery table creation #1843

hui-zheng opened this issue Oct 19, 2019 · 8 comments · Fixed by #2928
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Milestone

Comments

@hui-zheng
Copy link

hui-zheng commented Oct 19, 2019

Describe the feature

I would like to have enable require_partition_filter=true in table and incremental mode.
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#table_option_list

for example, we could config model as

{{
  config (
    materialized='table',
    partition_by='DATE(stamp)',
    partition_by_required='true',
  )
}}

Currently, in 0.14.3, bigquery/adapters.sql, only persist_docs and temporary options is enable.
Could we make bigquery_table_options macro below flexible to take in any other options, so that we could pass a generic key/value list in model config to create table options
{% macro bigquery_table_options(persist_docs, temporary) %}

for example, we could config model as

{{
  config (
    materialized='table',
    partition_by='DATE(stamp)',
    options = '{ partition_by_required=true, partition_expiration_days=7 }'
  )
}}

Additional context

This feature is BigQuery-specific.

Who will this benefit?

This feature will benefit anyone who use BigQuery partition tables, and users of BigQuery in general.

@hui-zheng hui-zheng added enhancement New feature or request triage labels Oct 19, 2019
@drewbanin drewbanin added bigquery and removed triage labels Oct 19, 2019
@drewbanin
Copy link
Contributor

drewbanin commented Oct 19, 2019

cool idea @hui-zheng! See also #1829 -- both of these issues require dbt to set options appropriately in the create table as statements that are run for table/incremental models.

I think we could support this pretty readily:

  1. Add require_partition_filter as a "known" config in the plugin code
  2. Add some logic in the bigquery specific plugin macros which checks for a require_partition_filter config, adding it to the list of supplied options

And that's it! What do you think about that?

@hui-zheng
Copy link
Author

yes, I agree with this approach.

@soltanianalytics
Copy link

soltanianalytics commented Oct 28, 2019

I've just done a minimal-effort implementation by overwriting the essential macros in my own macros folder and just want to add what I found out on the way: You will also need to touch the merge statements as they will not execute with require_partition_filter=true unless you specify the filter in the on statement of the merge. In other words, my common_get_merge_sql now looks something like this:

{% macro common_get_merge_sql(target, source, unique_key, dest_columns) -%}
...
    merge into {{ target }} as DBT_INTERNAL_DEST
    using {{ source }} as DBT_INTERNAL_SOURCE
    {% if unique_key %}
        on DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
        {# The if statement below is new #}
        {% if config.get('part_filter_required', True) %}
          and DBT_INTERNAL_DEST.{{ config.get('part_date_column', default='time') }} > {{some_logic}}
        {% endif %}
    {% else %}
        on FALSE
    {% endif %}
...
{%- endmacro %}

This is (purposefully) hacky but works for me, I'm not recommending going this route, just wanted to make you aware of the limitation. Note that you will also need to supply the argument as a literal, something like SELECT MAX(column) FROM {{this}} as {{some_logic}} is not going to work. I solved it by calling a macro which calls statements which get me a datetime literal that works for my purposes.

The compiled and ready to run sql then looks something like this:

    merge into `[project]`.`[dataset].`[relation]` as DBT_INTERNAL_DEST
    using (
SELECT
	GENERATE_UUID() AS id
	, time
...
FROM `[project]`.`[source_schema]`.`[source_relation]`
WHERE time > '2019-10-26 20:55:40' 
GROUP BY
    2, ...
    ) as DBT_INTERNAL_SOURCE
        on DBT_INTERNAL_SOURCE.id = DBT_INTERNAL_DEST.id
          AND DBT_INTERNAL_DEST.time > '2019-10-26 20:55:40'
    when matched then update set
        id = DBT_INTERNAL_SOURCE.id,time = DBT_INTERNAL_SOURCE.time,...
    when not matched then insert
        (id, time, ...)
    values
        (id, time, ...)

Edit 2019-11-04: I wanted to add that if the merge joins on FALSE because there is no unique_key specified, there is also no need to include the partition column in the on statement. In other words, this would also work:

merge into `[project]`.`[dataset].`[relation]` as DBT_INTERNAL_DEST
    using (
SELECT
	GENERATE_UUID() AS id
	, time
...
FROM `[project]`.`[source_schema]`.`[source_relation]`
WHERE time > '2019-10-26 20:55:40' 
GROUP BY
    2, ...
    ) as DBT_INTERNAL_SOURCE
        on FALSE
    when matched then update set
        id = DBT_INTERNAL_SOURCE.id,time = DBT_INTERNAL_SOURCE.time,...
    when not matched then insert
        (id, time, ...)
    values
        (id, time, ...)

So the one thing that doesn't work is having both a unique_key and required partitioning filter.

@drewbanin
Copy link
Contributor

@soltanianalytics thanks so much for the heads up! The logic you implemented reminds me a lot of #1034. I think this is a problem we'll generally need to solve on BQ sometime soon - it's great to know that these two pieces of functionality are related.

@drewbanin
Copy link
Contributor

We changed some of the relevant code around. The eventual implementation will need to:

  1. Add an adapter-specific config here
  2. Return the require_partiton_filter option for table creation here

@profoundpanda
Copy link

Hey @drewbanin just wanted to check if this something that will be supported sometime soon? Would be a super useful config variable to have available.

@drewbanin drewbanin added the good_first_issue Straightforward + self-contained changes, good for new contributors! label Jul 9, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jul 9, 2020

Agree that this would be a great first issue for an external contributor!

One small thing to note about tables that require partition filters, very much outside the scope of the actual code change required here:

Built-in schema tests will not work on these models. You'll need to instead use tests like unique_where, not_null_where, relationships_where. It could be cool to play around with behavior like:

models:
  - name: my_model
    columns:
      - name: id
        tests:
          - dbt_utils.unique_where:
              where: "{{ config.get('partition_by') ~ '= current_date' if config.get('partition_by_required') else '1=1' }}"

@yu-iskw
Copy link
Contributor

yu-iskw commented Dec 3, 2020

If no one hasn't worked on it yet, I will send a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants