-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] A Collection of Sort Based Optimizations #10313
Comments
Would this ticket be an appropriate place to add tickets related to pushing down sorts to federated query engines? I know that this was discussed previously (i.e. #7871) and it seems that writing a custom optimizer is the current way to handle that. I will need to do this soon (federated sort pushdown) and it initially wasn't clear to me how to make this work in DataFusion. I can volunteer to write some docs on how to do this once I have an implementation that works. |
I added #7871 to the list above -- thank you. Yes I think this would be a good place to discuss
That would be great, thanks @phillipleblanc Right now, once What we don't have is any way to have the optimizer tell a I wonder if we could add something to trait ExecutionPlan {
...
/// return other possible orders that this ExecutionPlan could return
/// (the DataFusion optimizer will use this information to potentially push Sorts
/// into the Node
fn pushable_sorts(&self) -> Result<Option<PotentialSortOrders>>> {
return Ok(None)
}
/// return a node like this one except that it its output is sorted according to exprs
fn resorted(&self) -> Result<Option<Arc<dyn ExecutionPlan>>> {
return Ok(None)
} And then add a new optimizer pass that tries to push sorts into the plan nodes that report they can provide sorted data 🤔 |
After digging into and understanding how the My realization essentially comes down to (please correct me if this is incorrect): DataFusion is a library that provides both query planning ( The |
Usecase
Many analytic systems store their data with some particular sort order, and the query engine can often take advantage of this sort order to both reduce memory usage and performance
Specific examples in Datafusion include:
SortMergeJoin
EnforceSorting
andreplace_with_order_preserving_variants
This information is currently encoded in
ExecutionPlan::maintains_input_order
ExecutionPlan::required_input_ordering
andPlanProperties
The same underlying analysis is often required for streaming (where determining what to emit is modeled as a sorted stream, for example on
date_trunc(ts)
of a stream sorted by timestamp).Describe the solution you'd like
This epic has a list of optimizations / improvements that further take sortedness into account. Here are some related issues:
split_file_groups_by_statistics
by default #10336ProgressiveEval
operator for optimizeSortPreservingMerge
#10488UnionExec
without losing ordering #10314SortPreservingMerge
that doesn't actually compare sort keys of the key ranges are ordered #10316The text was updated successfully, but these errors were encountered: