Unify scan & take paths

Currently the scan and take APIs are rather different.  All scan pathways create a scanner, create a datafusion plan, and execute the plan.  Take pathways bypass the datafusion plan entirely and just access the readers directly.  There is also a TakeExec which enables take-like functionality in a datafusion plan (since takes need to happen as part of scans sometimes).  The TakeExec then goes into the dataset and uses the lower-level take functionality.  This is all quite confusing.

One downside of the approach is that I feel we have duplicated a lot of our projection logic and have had to fix issues like https://github.com/lance-format/lance/pull/5722 .  In general, projection inside of take has lagged behind projections from scan (it took a while before we allowed projection in take for example).

It would be nice if the take paths still created a datafusion plan (just with a TakeExec) and executed it.  This lines up with other work like the work to make sure that merge insert goes through a datafusion plan.  It will also help (I think) reduce some of the duplication.

That being said, there is some risk that the overhead of plan creation would add a lot of work to take.  Take operations typically have much smaller data sizes than scan operations and so the overhead can be noticeable.  For example, in https://github.com/lance-format/lance/pull/5532 we observe that the simple cost of creating a DF plan (instead of running the physical exprs directly) introduces unacceptable overhead.  So if we were to go down this road we would need to be cautious.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify scan & take paths #5823

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unify scan & take paths #5823

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions