refactor: DSLRunner now uses a pull-based execution model #273
+2,756
−849
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR refactors the DSLRunner to use a pull-based execution model implemented through a new OpContainer class. This change significantly improves the pipeline execution architecture by making it more efficient, maintainable, and extensible.
We also ran pre-commit hooks.
Previous Implementation Challenges
The old implementation struggled with several fundamental limitations that made it difficult to work with and extend. Operations were eagerly evaluated, forcing us to load and process all data upfront even when only a subset was needed. The linear step-by-step execution model made it challenging to handle operations with multiple inputs like equijoins, and the complex state management made it hard to reason about pipeline behavior. Most importantly, without a clear view of the execution graph, implementing optimizations or debugging pipeline issues was unnecessarily difficult.
Benefits of Pull-Based Architecture
The new pull-based model fundamentally transforms how pipelines execute by implementing lazy evaluation through a DAG of operation containers. Each operation is encapsulated in its own container with explicit parent-child relationships, making the execution flow more intuitive - you simply request data from the final node, and it pulls what it needs from its parents. This lazy evaluation means operations only execute when their results are actually needed, enabling efficient resource usage and smart caching of intermediate results.
Implementation Details
The core of the new design is the OpContainer class, which manages an operation's configuration and dependencies while implementing the pull-based execution through its next() method. When building a pipeline, operations are connected into a DAG that clearly shows data dependencies. The execution flows backward through the graph - when the final node is asked for data, it recursively requests data from its parents until reaching the initial data loading operations.
Query Plan Visualization
A major improvement in this PR is the addition of a rich query plan visualization. The new print_query_plan() method renders a detailed view of the execution graph, with operations color-coded by step and clear indication of dependencies between operations. For each operation, it shows the operation type, name, and output schema, making it much easier to understand and debug complex pipelines. The visualization uses indentation and arrows to show parent-child relationships, with special handling for equijoin operations that display both left and right input branches.
Next steps: Optimizer Refactoring
We should refactor the optimizer to leverage the OpContainer DAG structure directly.