-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Incremental Model Improvements - Microbatch #10624
Comments
Just for my understanding: Is it right that this issue seeks to address technical (performance/load) issues in models that take just one single ref as source (or if it has other sources as well we assume them to be stale) ? I am looking for ways to support incremental processing of multi-table join models (e.g. https://discourse.getdbt.com/t/template-for-complex-incremental-models/10054, but I've seen many more similar help requests on community forums). To be sure, such features will not be in scope right ? |
@MaartenN1234 I'm not sure that I fully understand the question being asked. For my clarity, is the question whether this new functionality will support more than one input to an incremental model? If so, the answer is yes! For example, say we turn the jaffle-shop customers model into an incremental microbatch model. It'd look like the following
If the models |
The critical requirement for me, is that matching rows (on the join condition) in both sources are not neccesarily created in the same batch. So when the filter is on the sources independently: stuff will be wrong (e.g. if we would load one more order, we would loose all previous from the aggregate or when the customer data is updated while no new orders for this client are to be processed the update will not be propagated). To get it right, it should become somewhat like this: So one needs to incorporate the join clause and the aggregation into the change detection |
Incremental models in dbt is a materialization strategy designed to efficiently update your data warehouse tables by only transforming and loading new or changed data since the last run. Instead of processing your entire dataset every time, incremental models append or update only the new rows, significantly reducing the time and resources required for your data transformations.
Even with all the benefits of incremental models as they exist today, there are limitations with this approach, such as:
full-refresh
mode) as it’s done in “one big” SQL statement - can time out, if it fails you end up needing to retry already successful partitions, etc.vars
In this project we're aiming to make incremental models easier to implement and more efficient to run.
P0s - Core
P0s - Core Framework
--event-start-time
and--event-end-time
as CLI flags for specifying microbatch start and end times #10635batch_size
as a top-level property of a model #10637WHERE
filter for reading data of microbatch incremental builds from properties #10638--event-start-time
and--event-end-time
values if not provided #10636lookback
as top level model property #10662ref
andsource
should have microbatchWHERE
filter when appropriate #10639dbt retry
for microbatch models #10729begin
config for microbatch incremental models #10701full_refresh
config during batch creation #10785dbt retry
scenarios #10800P0s - Adapters
P0s - Adapters
Bugs
Beta bugs
0
results in consistently incomplete batches #10867event_column
is of typedate
#10868partial success
#10999P1s
P1s
--event-end-time
require--event-start-time
and vice versa #10874P2s
P2s
begin
on microbatch incremental models optional by calculating min of mins #10702The text was updated successfully, but these errors were encountered: