Handle concurrent job ordering better #813

drewbanin · 2018-06-27T13:24:50Z

Presently, the run pipeline stalls in certain scenarios. dbt computes a static list of "run levels", and each run level is completed synchronously. Instead, dbt should begin at root node and then iteratively pick nodes to run based on the the set of completed nodes.

Additionally, this ordering should be deterministic. I think we can just order by the node name to break ties.

cmcarthur · 2018-06-27T16:13:34Z

@drewbanin agree that we should do this. just out of curiosity, when does it stall?

drewbanin · 2018-06-27T16:24:15Z

I thought there was a situation where dbt could run models, but didn't, because they were in the next "run level". Do you know if that's still the case? I haven't been able to dig too far into it yet

cmcarthur · 2018-06-28T12:43:42Z

Yeah, if there are two totally independent subgraphs where one subgraph has a very slow model. I read "stall" as it stops the run, but you're just saying it makes the run much slower. Makes sense

nsnyder3 · 2018-08-03T18:21:33Z

We've seen this exact same issue. Where a model's dependencies are complete and there is an open thread, yet the model doesn't run until much later.

nsnyder3 · 2018-10-09T21:31:05Z

Hey guys, wondering what your latest thoughts are on this? We are in the process of splitting up our models into separate runs and finding that solution less than ideal because we are replicating logic that we've already encoded in dbt.

drewbanin · 2018-10-14T17:35:12Z

Hey @nsnyder3 - I lost track of your comment! This isn't currently on our near term roadmap, but it would definitely be good to improve how this works!

The operative code is here. You can see that there is a nested for-loop here. The outer for loop is responsible for grabbing a "run level" from node_dependency_list, where a "run level" is just a list of nodes that can all be processed concurrently.

This current paradigm involves statically determining batches of nodes that can be run all at once. The node_dependency_list ends up looking something like:

[
   [node_1, node_2, node_3], # level 0
   [node_4, node5],          # level 1
   [node_6]                  # level 2
]

Here, the elements in level 0 have no ancestor dependencies. Level 1's dependencies are all contained in level 0. And level 2's dependencies are all contained in level 0 and level 1. We actually calculate this "run level" by building the DAG and recording the "depth" of each node in the DAG. This is some of the earliest code that was written in dbt (by your's truly 🎩 ). The benefit is that it was easy to implement, but the drawback is that each level must be fully completed before the next level can begin.

Given:

node	time to build	ancestors
node_1	1s	None
node_2	1s	None
node_3	30s	None
node_4	1s	node_1
node_5	1s	node_2
node_6	1s	node_4

If you were to run this graph with 8 threads, dbt would quickly finish running nodes 1 & 2, but then would block on node 3. In this scenario, dbt could run nodes 4 & 5, then conceivably 6 before node_3 even finishes!

So, I think the answer here is to flip around control. Right now, a node_dependency_list is provided to the runner, which runs the nodes in level-order. Instead, I think dbt should just provide a plain list of nodes to the runner, and the runner should selectively pluck elements from that list once the full set of their ancestors have been run.

This is going to be tricky, as we'll need to make sure the threads coordinate with each other. We don't want two threads to pick up the same node, for instance. This is all very doable, and I'm excited to see how this would improve performance both in the best and average case. I think it will be a big improvement for degenerate-shaped graphs (like the one above), but should generally decrease runtimes in more general cases.

I'm not certain when we'll be able to pick this up, but if you're interested in taking a crack at it, I'd be very happy to describe how I think the code should be changed to fix this issue!

drewbanin · 2018-10-29T19:10:14Z

I'm going to profile some example code to see some approximate performance benefits to implementing this change in the general case. The final implementation requires thread coordination (noted above) which we shouldn't take lightly. I'll follow up here with perf timing results to help inform eventual prioritization.

Handle concurrent job ordering via a graph iterator (#813)

drewbanin mentioned this issue Jun 27, 2018

Ability to run only specified model with dbt run #811

Closed

drewbanin added dependencies Changes to the version of dbt dependencies and removed dependencies Changes to the version of dbt dependencies labels Jun 28, 2018

jthandy added this to the Stephen Girard milestone Oct 25, 2018

drewbanin self-assigned this Oct 29, 2018

drewbanin modified the milestones: Stephen Girard, Grace Kelly Nov 2, 2018

drewbanin removed their assignment Nov 14, 2018

drewbanin mentioned this issue Nov 19, 2018

Add performance profiling to dbt #1001

Closed

beckjake self-assigned this Nov 23, 2018

beckjake mentioned this issue Nov 27, 2018

Handle concurrent job ordering via a graph iterator (#813) #1157

Merged

beckjake closed this as completed in #1157 Dec 5, 2018

beckjake added a commit that referenced this issue Dec 5, 2018

Merge pull request #1157 from fishtown-analytics/feature/graph-iteration

bb6b469

Handle concurrent job ordering via a graph iterator (#813)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle concurrent job ordering better #813

Handle concurrent job ordering better #813

drewbanin commented Jun 27, 2018

cmcarthur commented Jun 27, 2018

drewbanin commented Jun 27, 2018

cmcarthur commented Jun 28, 2018

nsnyder3 commented Aug 3, 2018 •

edited

Loading

nsnyder3 commented Oct 9, 2018

drewbanin commented Oct 14, 2018 •

edited

Loading

drewbanin commented Oct 29, 2018

Handle concurrent job ordering better #813

Handle concurrent job ordering better #813

Comments

drewbanin commented Jun 27, 2018

cmcarthur commented Jun 27, 2018

drewbanin commented Jun 27, 2018

cmcarthur commented Jun 28, 2018

nsnyder3 commented Aug 3, 2018 • edited Loading

nsnyder3 commented Oct 9, 2018

drewbanin commented Oct 14, 2018 • edited Loading

drewbanin commented Oct 29, 2018

nsnyder3 commented Aug 3, 2018 •

edited

Loading

drewbanin commented Oct 14, 2018 •

edited

Loading