Skip to content

Unify scan & take paths #5823

@westonpace

Description

@westonpace

Currently the scan and take APIs are rather different. All scan pathways create a scanner, create a datafusion plan, and execute the plan. Take pathways bypass the datafusion plan entirely and just access the readers directly. There is also a TakeExec which enables take-like functionality in a datafusion plan (since takes need to happen as part of scans sometimes). The TakeExec then goes into the dataset and uses the lower-level take functionality. This is all quite confusing.

One downside of the approach is that I feel we have duplicated a lot of our projection logic and have had to fix issues like #5722 . In general, projection inside of take has lagged behind projections from scan (it took a while before we allowed projection in take for example).

It would be nice if the take paths still created a datafusion plan (just with a TakeExec) and executed it. This lines up with other work like the work to make sure that merge insert goes through a datafusion plan. It will also help (I think) reduce some of the duplication.

That being said, there is some risk that the overhead of plan creation would add a lot of work to take. Take operations typically have much smaller data sizes than scan operations and so the overhead can be noticeable. For example, in #5532 we observe that the simple cost of creating a DF plan (instead of running the physical exprs directly) introduces unacceptable overhead. So if we were to go down this road we would need to be cautious.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions