refactor: DSLRunner now uses a pull-based execution model #273

shreyashankar · 2025-01-09T08:37:59Z

Summary

This PR refactors the DSLRunner to use a pull-based execution model implemented through a new OpContainer class. This change significantly improves the pipeline execution architecture by making it more efficient, maintainable, and extensible.

We also ran pre-commit hooks.

Previous Implementation Challenges

The old implementation struggled with several fundamental limitations that made it difficult to work with and extend. Operations were eagerly evaluated, forcing us to load and process all data upfront even when only a subset was needed. The linear step-by-step execution model made it challenging to handle operations with multiple inputs like equijoins, and the complex state management made it hard to reason about pipeline behavior. Most importantly, without a clear view of the execution graph, implementing optimizations or debugging pipeline issues was unnecessarily difficult.

Benefits of Pull-Based Architecture

The new pull-based model fundamentally transforms how pipelines execute by implementing lazy evaluation through a DAG of operation containers. Each operation is encapsulated in its own container with explicit parent-child relationships, making the execution flow more intuitive - you simply request data from the final node, and it pulls what it needs from its parents. This lazy evaluation means operations only execute when their results are actually needed, enabling efficient resource usage and smart caching of intermediate results.

Implementation Details

The core of the new design is the OpContainer class, which manages an operation's configuration and dependencies while implementing the pull-based execution through its next() method. When building a pipeline, operations are connected into a DAG that clearly shows data dependencies. The execution flows backward through the graph - when the final node is asked for data, it recursively requests data from its parents until reaching the initial data loading operations.

Query Plan Visualization

A major improvement in this PR is the addition of a rich query plan visualization. The new print_query_plan() method renders a detailed view of the execution graph, with operations color-coded by step and clear indication of dependencies between operations. For each operation, it shows the operation type, name, and output schema, making it much easier to understand and debug complex pipelines. The visualization uses indentation and arrows to show parent-child relationships, with special handling for equijoin operations that display both left and right input branches.

Next steps: Optimizer Refactoring

We should refactor the optimizer to leverage the OpContainer DAG structure directly.

shreyashankar added 5 commits January 8, 2025 22:58

partial commit

fb26cdf

refactor: dslrunner is now a pull based execution model

186c827

refactor: dslrunner is now a pull based execution model

18cae16

refactor: optimizer is now using the new pull based execution model

74e2ab9

refactor: optimizer is now using the new pull based execution model

c5bac35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: DSLRunner now uses a pull-based execution model #273

refactor: DSLRunner now uses a pull-based execution model #273

shreyashankar commented Jan 9, 2025 •

edited

Loading

refactor: DSLRunner now uses a pull-based execution model #273

Are you sure you want to change the base?

refactor: DSLRunner now uses a pull-based execution model #273

Conversation

shreyashankar commented Jan 9, 2025 • edited Loading

Summary

Previous Implementation Challenges

Benefits of Pull-Based Architecture

Implementation Details

Query Plan Visualization

Next steps: Optimizer Refactoring

shreyashankar commented Jan 9, 2025 •

edited

Loading