Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Could micro-batch strategy solve most of my windowing problems for incremental models? #11059

Open
3 tasks done
jschintz-nytimes opened this issue Nov 27, 2024 · 0 comments
Labels
enhancement New feature or request triage

Comments

@jschintz-nytimes
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Microbatch should handle windowing logic.

But the hardest step for any incremental model has been figuring out what to do with windowing logic - and still having performant model pipelines.

Microbatch is fantastic! For an org with large-ish datasets, the feature makes moving data easy and straightforward.

Describe alternatives you've considered

Should there just be a blog post explaining how users can custom fit their microbatch strategies?

Here's an example where the microbatch model is customized to

  • Get a 15 day window for all days if the date is within the past 2 weeks. This handles late arriving date.
  • Get a 7 day window for 1 day a week if the date is older than 2 weeks ago.
  {# Custom microbatch selection for windowing #}
    {# Get the starttime for a microbatch & a window up to 15 days ago #}
    {%- set microbatch_day -%}
        {{ model.config.__dbt_internal_microbatch_event_time_start or modules.datetime.date(1899, 1, 1) }}
    {%- endset -%}
    {% set fifteen_days_ago = (modules.datetime.date.today() - modules.datetime.timedelta(days=15)) %}
    {% set only_run_mondays %}
        and EXTRACT(dayofweek FROM DATE('{{ microbatch_day }}')) = 2
    {% endset %}
with
    filter_microbatch__user_daily_agent_id_active_days_tbl as (

        select * 
        from {{ ref('user_daily_agent_id_active_days_tbl').render() }}
        --override default microbatch behavior with 7 day window
        -- This process the full week for each lookback window - which is wasteful. 
        where date(_pt) >= date_trunc(date('{{ microbatch_day }}'), week(monday)) - 7
              and date(_pt) < date('{{ microbatch_day }}')
             {# Performance improvement - filter to only run partition on 1 day of week if older than 15 days ago.   #}
             {{ only_run_mondays if fifteen_days_ago > modules.datetime.date.fromisoformat(microbatch_day) }}
    )

select
    {{ dbt_utils.generate_surrogate_key(['fiscal_week', 'agent_id']) }} as uid,
    fiscal_week,
    fiscal_week as _pt,
    agent_id,
    max(1) as is_active_this_week,
    count(distinct _pt) as active_days,
    SUM(is_active_on_news_today) as active_days_on_news
from filter_microbatch__user_daily_agent_id_active_days_tbl
group by all

Who will this benefit?

The dbt community has been talking about windowing and incremental models since 2018 - and look Tristan even weighed in!

Any user of the microbatch strategy will probably benefit. Most models eventually mature to use window functions.

Are you interested in contributing this feature?

Sure!

Anything else?

No response

@jschintz-nytimes jschintz-nytimes added enhancement New feature or request triage labels Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

1 participant