-
Notifications
You must be signed in to change notification settings - Fork 537
Description
Currently the scan and take APIs are rather different. All scan pathways create a scanner, create a datafusion plan, and execute the plan. Take pathways bypass the datafusion plan entirely and just access the readers directly. There is also a TakeExec which enables take-like functionality in a datafusion plan (since takes need to happen as part of scans sometimes). The TakeExec then goes into the dataset and uses the lower-level take functionality. This is all quite confusing.
One downside of the approach is that I feel we have duplicated a lot of our projection logic and have had to fix issues like #5722 . In general, projection inside of take has lagged behind projections from scan (it took a while before we allowed projection in take for example).
It would be nice if the take paths still created a datafusion plan (just with a TakeExec) and executed it. This lines up with other work like the work to make sure that merge insert goes through a datafusion plan. It will also help (I think) reduce some of the duplication.
That being said, there is some risk that the overhead of plan creation would add a lot of work to take. Take operations typically have much smaller data sizes than scan operations and so the overhead can be noticeable. For example, in #5532 we observe that the simple cost of creating a DF plan (instead of running the physical exprs directly) introduces unacceptable overhead. So if we were to go down this road we would need to be cautious.