-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A More General Approach for Optimizing Projections in Physical Plans #9111
Comments
(copying my comment from elsewhere) FWIW there is a version of pushdown in InfluxDB (written by @crepererum ) that may serve as a useful reference https://github.com/influxdata/influxdb/blob/main/iox_query/src/physical_optimizer/projection_pushdown.rs |
I have a two questions:
The idea of supporting ProjectionPushdown for userdefined plans sounds a good idea to me 👍 . If that is indeed the usecase adding a test showing it doesn't work today and works after your changes is 🆗
I assume you are referring to this code: Maybe you could add an API like this: trait ExecutionPlan {
...
/// Return a copy of this plan that only produces the specified outputs, if possible, in the specified order.
/// the projection is a BTreeMap as it is unique and in increasing order (as some nodes like HashAggregateExec
/// can't reorder their outputs
///
/// For a plan plans such as `ProjectionExec`, projecting a subset of columns will reduce the
/// expression list.
///
/// for some plans such as `HashAggregateExec`, projection may not be possible (e.g. it is not possible to remove
/// group columns from the output)
///
/// By default, returns Ok(None)
fn try_project(projection: BTreeSet<usize>) -> Result<Option<Arc<dyn Self>> { Ok(None) }
... 🤔 |
Another problem occurs when a projection is inserted into the input of some operator to decrease the load, however; that projection insertion may cause to change the indices of some columns at the output operators. In such a case, all operators object to a column index change must be rewritten.
The new version API requirements are a little different. I re-request your opinions when the PR is ready. |
I was thinking the Perhaps you can avoid having to invent a new algorithm |
If I remember correctly |
Right -- I think the difference is that in the Physical realm, projection pushdown could maintain the mapping of "is output column N used" where N is the index. Then the trick will be to implement the transformation for each type of ExecutionPlan to both:
With thse two functions I think you could implement a pretty simple pushdown rule |
Yes, as you said, in short, the rule consists of tracking the required columns across the plan and rewriting the plans by new expressions with updated indices. |
Is it like FunctionalDependence in DFSchema ? |
I couldn't understand the way you think there is a similarity, can you give more detail? |
Is your feature request related to a problem or challenge?
In the current version of
ProjectionPushdown
, there are some algorithmic limitations and it is not very friendly to be extendable. To solve this optimization in theoretical limits within the strategy of "pushing down" is not possible. We will need another approach. Also any custom plan should be easily integrable with this optimization.Describe the solution you'd like
The new rule aims achieving the most effective use of projections in plans. It will ensures that query plans are free from unnecessary projections and that no unused columns are propagated unnecessarily between plans. The rule is designed to enhance query performance by:
The optimization is conducted in two phases:
Top-down Phase:
a) Merge it with its input projection if merge is beneficial.
b) Remove the projection if it is redundant.
c) Narrow the Projection if possible.
d) The projection can be nested into the source.
e) Do nothing, otherwise.
a) Schema needs pruning. Insert the necessary projections to the children.
b) All fields are required. Do nothing.
Bottom-up Phase:
This pass is required because modifying a plan node can change the column indices used by output nodes. When such a change occurs, we store the old and new indices of the columns in the node's state. We then proceed from the leaves to the root, updating the indices of columns in the plans by referencing these mapping records. After the top-down phase, also some unnecessary projections may emerge. When projections check its input schema mapping, it can remove itself and assign new schema mapping to the new node which was the projection's input formerly.
The designed node structure is:
To summarize, with two state variables for each plan node (one for transferring the required columns and one for the notifying changes of column indices), and with two passes (it is actually one pass, the bottom-up pass will be done implicitly during the attachment of transformed children to the self node), we will have a future-proof projection optimizer rule.
Describe alternatives you've considered
No response
Additional context
I am currently working on this issue. I will plan to open a PR for suggestions, especially on how to update the ExecutionPlan API to get rid of if else structure of all plans. It will be ready likely next week. It's not expected to significantly alter our current plans, but it will be a solid step towards optimizing potential outcomes following other existing optimizations and future ones.
The text was updated successfully, but these errors were encountered: