Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: DSLRunner now uses a pull-based execution model #273

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

shreyashankar
Copy link
Collaborator

@shreyashankar shreyashankar commented Jan 9, 2025

Summary

This PR refactors the DSLRunner to use a pull-based execution model implemented through a new OpContainer class. This change significantly improves the pipeline execution architecture by making it more efficient, maintainable, and extensible.

We also ran pre-commit hooks.

Previous Implementation Challenges

The old implementation struggled with several fundamental limitations that made it difficult to work with and extend. Operations were eagerly evaluated, forcing us to load and process all data upfront even when only a subset was needed. The linear step-by-step execution model made it challenging to handle operations with multiple inputs like equijoins, and the complex state management made it hard to reason about pipeline behavior. Most importantly, without a clear view of the execution graph, implementing optimizations or debugging pipeline issues was unnecessarily difficult.

Benefits of Pull-Based Architecture

The new pull-based model fundamentally transforms how pipelines execute by implementing lazy evaluation through a DAG of operation containers. Each operation is encapsulated in its own container with explicit parent-child relationships, making the execution flow more intuitive - you simply request data from the final node, and it pulls what it needs from its parents. This lazy evaluation means operations only execute when their results are actually needed, enabling efficient resource usage and smart caching of intermediate results.

Implementation Details

The core of the new design is the OpContainer class, which manages an operation's configuration and dependencies while implementing the pull-based execution through its next() method. When building a pipeline, operations are connected into a DAG that clearly shows data dependencies. The execution flows backward through the graph - when the final node is asked for data, it recursively requests data from its parents until reaching the initial data loading operations.

Query Plan Visualization

A major improvement in this PR is the addition of a rich query plan visualization. The new print_query_plan() method renders a detailed view of the execution graph, with operations color-coded by step and clear indication of dependencies between operations. For each operation, it shows the operation type, name, and output schema, making it much easier to understand and debug complex pipelines. The visualization uses indentation and arrows to show parent-child relationships, with special handling for equijoin operations that display both left and right input branches.

image

Next steps: Optimizer Refactoring

We should refactor the optimizer to leverage the OpContainer DAG structure directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant