Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microbatch: batched execution #10677

Merged
merged 39 commits into from
Sep 18, 2024
Merged

Microbatch: batched execution #10677

merged 39 commits into from
Sep 18, 2024

Conversation

MichelleArk
Copy link
Contributor

@MichelleArk MichelleArk commented Sep 6, 2024

Resolves #10700

Problem

For microbatch models, it should be possible to:

  • Split the execution of a microbatch model into batches, regardless of whether it is a full-refresh or incremental run
  • Run a single adapter (main) query per batch (i.e. partition)
  • Fail gracefully if a single batch fails

Solution

  • Move start/end time computation from provider to MicrobatchBuilder
  • Build start/end + batches during ModelRun execution for microbatch models
  • Execute each batch:
    • Populate jinja context vars for recompilation of refs for each batch
    • Re-compile model + update jinja context vars like is_incremental + should_full_refresh
    • Create run result

Checklist

  • I have read the contributing guide and understand what's expected of me.
  • I have run this code in development, and it appears to resolve the stated issue.
  • This PR includes tests, or tests are not required or relevant for this PR.
  • This PR has no interface changes (e.g., macros, CLI, logs, JSON artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX.
  • This PR includes type annotations for new and modified functions.

🎩 Example batch-level failure:
Screenshot 2024-09-16 at 1 22 32 AM

@cla-bot cla-bot bot added the cla:yes label Sep 6, 2024
Copy link
Contributor

github-actions bot commented Sep 6, 2024

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Copy link

codecov bot commented Sep 6, 2024

Codecov Report

Attention: Patch coverage is 98.63014% with 2 lines in your changes missing coverage. Please review.

Project coverage is 88.95%. Comparing base (c6b8f7e) to head (737ae3d).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10677      +/-   ##
==========================================
+ Coverage   88.90%   88.95%   +0.05%     
==========================================
  Files         180      181       +1     
  Lines       22856    22959     +103     
==========================================
+ Hits        20319    20423     +104     
+ Misses       2537     2536       -1     
Flag Coverage Δ
integration 86.14% <88.35%> (+0.03%) ⬆️
unit 62.37% <58.21%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.37% <58.21%> (-0.06%) ⬇️
Integration Tests 86.14% <88.35%> (+0.03%) ⬆️

@QMalcolm QMalcolm force-pushed the event-time-ref-filtering branch from e4138c5 to 3a6c739 Compare September 11, 2024 16:58
Base automatically changed from event-time-ref-filtering to main September 12, 2024 22:16
@MichelleArk MichelleArk changed the title rough in: chunked backfills Microbatch: batched execution Sep 14, 2024
@MichelleArk MichelleArk force-pushed the microbatch-chunked-backfill branch from c6b2ccc to 71a526c Compare September 14, 2024 02:32
@MichelleArk MichelleArk force-pushed the microbatch-chunked-backfill branch from 71a526c to a31e703 Compare September 14, 2024 02:58
@MichelleArk
Copy link
Contributor Author

MichelleArk commented Sep 16, 2024

🤷 clicking into the failing codecov check, things look okay! perhaps the GH response is stale somehow.

Screenshot 2024-09-16 at 1 39 39 AM

@MichelleArk MichelleArk marked this pull request as ready for review September 16, 2024 05:40
@MichelleArk MichelleArk requested a review from a team as a code owner September 16, 2024 05:40
Copy link
Contributor

@QMalcolm QMalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes and am gonna follow up by making said changes 😅

Comment on lines +403 to +410
microbatch_builder = MicrobatchBuilder(
model=model,
is_incremental=self._is_incremental(model),
event_time_start=getattr(self.config.args, "EVENT_TIME_START", None),
event_time_end=getattr(self.config.args, "EVENT_TIME_END", None),
)
end = microbatch_builder.build_end_time()
start = microbatch_builder.build_start_time(end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was reading through MicrobatchBuilder I found it odd that build_end_time and build_start_time weren't private functions that the __init__ would then call to default the event_time_start/event_time_end. With this bit of code here, I find myself still thinking so. Basically, the start/end are always going to be specific to a given MicrobatchBuilder instance. Perhaps as a fast follow to this PR we should investigate if this can be reduced to that. The code here would then become

...
microbatch_builder = MicrobatchBuilder(
  model=model,
  is_incremental=self._is_incremental(model),
  event_time_start=getattr(self.config.args, "EVENT_TIME_START", None),
  event_time_end=getattr(self.config.args, "EVENT_TIME_END", None),
)
batches = microbatch_builder.build_batches()
...

The alternative would be to take the MicrobatchBuilder class to be less specific to a model, which also seems like a valid approach. Right now though we seem to be somewhere in the middle with a class that is specific to a model, but methods that we're expected to call which should never change their return given the associated model.

Comment on lines +453 to +454
# TODO: Remove. This is a temporary method. We're working with adapters on
# a strategy to ensure we can access the `is_incremental` logic without drift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should follow up with this on @mikealfare about the possible timeline

Comment on lines +379 to +392
if (
os.environ.get("DBT_EXPERIMENTAL_MICROBATCH")
and model.config.materialized == "incremental"
and model.config.incremental_strategy == "microbatch"
):
batch_results = self._execute_microbatch_materialization(
model, manifest, context, materialization_macro
)
else:
result = MacroGenerator(
materialization_macro, context, stack=context["context_macro_stack"]
)()
for relation in self._materialization_relations(result, model):
self.adapter.cache_added(relation.incorporate(dbt_created=True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we talked about this work, and I absolutely understand why we're doing it this way. Something feels smelly about it though, and I can't exactly put my finger on it. My best guess is that we're doing a conditional, exiting the conditional, and then basically re-entering the conditional on lines 396-399. For instance line 399 should never be hit if we enter line 384. However because of the split conditionals, this isn't immediately apparent. I wonder if we should be calling into two separate private functions just before the try on line 378, and only one or the other function would ever be called depending on if we're doing microbatch stuff or not.

status=RunStatus.Success,
timing=[],
thread_id=threading.current_thread().name,
# TODO -- why isn't this getting propagated to logs?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution_time isn't making it to the logs? 🤔 That's odd....

Copy link
Contributor

@QMalcolm QMalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My remaining open comments can be addressed at a later date. Let's move forward 🚀

@QMalcolm QMalcolm merged commit 8fe5ea1 into main Sep 18, 2024
61 of 62 checks passed
@QMalcolm QMalcolm deleted the microbatch-chunked-backfill branch September 18, 2024 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Materialize updates for microbatch incremental models in batches
2 participants