Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: better plan for projection expression by using output indices #1922

Closed
st1page opened this issue Apr 19, 2022 · 5 comments
Closed
Assignees
Labels

Comments

@st1page
Copy link
Contributor

st1page commented Apr 19, 2022

Background

in our project operator, there is 2 kind of expressions

  1. calculation expressions, like Add(input_ref(0), input_ref(1)
  2. projection expression, those only one input_ref expression.
    and this issue will talk about how to reduce the projection expression in project operators.

for 2 reason

  1. they are really different during optimizing, the projection expression might be able to reduce the column of the input operators. so we want to split them into two project operators. for example, we'd like to do calculations distributedly, but want to push down the column projection
  2. and during optimizing, such as when unnesting or join reordering, we will construct lots of projection operators with the 2nd expression which is really trouble.

Solution

To eliminate these logical projection nodes we introduce output_index.

col_prune v2

To preserve the information of column order, we introduce column prune v2. We should change the definition fn prune_col(&self, required_cols: &FixedBitSet) -> PlanRef; to fn prune_col(&self, required_cols: &[usize]) -> PlanRef; . Therefore we can prune columns while reordering the columns.

Output index

A field output_index: Vec<usize> will be added on LogicalAgg, LogicalJoin, LogicalHopWindow, and any logical plan node which needs extra projection before. It will represent the operator's output column index based on our current schema of PlanNode.
for example

SELECT t2.a, t1.a from t1 join t2 on t1.b = t2.b

in our current implementation, we will get a join plan node with schema [t1.a, t1.b, t2.a, t2.b], and a project on the join with expressions [input_ref(2), input_ref(0)].
And with the output_index there will only be a join plan node with output_index [2, 0]

Implementation considerations

When adding this field to the plan nodes, we should carefully re-think those already existing functions.

  • the schema of the plan node(output index will change the schema of the plan node)
  • the property(dist, order, pk...) derive in the new()
  • the o2i_mapping and other similar functions
  • rewrite_with_input, rewrite_for_stream.

The output_index is natural in our chunk-based vectorized query execution, so we can add the field on the proto and implement it in executors in the future. but now we can just add a stream/batch project node when converting the logical plan to a stream/batch plan.

@Enter-tainer
Copy link
Contributor

Enter-tainer commented May 23, 2022

Tracking

Column prune

Output Indices

@jon-chuang
Copy link
Contributor

jon-chuang commented May 23, 2022

Hmm, my guess is that it would be better to keep both the original schema and the new schema if we have output_indices not_none, so that we can keep prune_col's idempotency.

@Enter-tainer
Copy link
Contributor

Enter-tainer commented May 23, 2022

Could I ask what would be the schema of a logical node with output_index? Would it be restricted and reordered as well? I believe may actually not be better to not do so, as it will make prune_col lose its idempotency.

IMO the schema will be changed as well. This is based on the idea of "hiding" projections inside these nodes. And add these projections back, when converting logical nodes to stream/batch nodes. I'm currently working on LogicalHopWindow and still exploring where to put output_indices. I put output_indices and the original schema in LogicalHopWindow itself. The reason why I do not put it in PlanBase is because only logical nodes need output_indices, stream/batch nodes do not have it.

BTW, I'm not sure if we are on the same page, but I think column prune does not have idempotency. For example, if we call col_column([1, 3, 5]) on some logical node twice, the second time will fail because we simply do not have column 3 and column 5 --- they have been pruned in the first call and we only have 3 columns now. hmm I'm not sure what idempotency is in column prune. Can you explain more about this?

Anyway, I think it is worth a discussion on whether to keep the schema or not. cc @st1page what's your idea?

@st1page
Copy link
Contributor Author

st1page commented May 23, 2022

And add these projections back, when converting logical nodes to stream/batch nodes

and for some operators such as hash_join, we can even implement the behaviour in the executor

@st1page
Copy link
Contributor Author

st1page commented May 23, 2022

when the prune_col of input is called, it is expected to change the schema as the prune_col requiring

@Enter-tainer Enter-tainer changed the title feat(optimizer): better plan for projection Expression Tracking: better plan for projection expression by using output indices May 25, 2022
@st1page st1page closed this as completed Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants