Tracking: better plan for projection expression by using output indices #1922

st1page · 2022-04-19T02:47:59Z

Background

in our project operator, there is 2 kind of expressions

calculation expressions, like Add(input_ref(0), input_ref(1)
projection expression, those only one input_ref expression.
and this issue will talk about how to reduce the projection expression in project operators.

for 2 reason

they are really different during optimizing, the projection expression might be able to reduce the column of the input operators. so we want to split them into two project operators. for example, we'd like to do calculations distributedly, but want to push down the column projection
and during optimizing, such as when unnesting or join reordering, we will construct lots of projection operators with the 2nd expression which is really trouble.

Solution

To eliminate these logical projection nodes we introduce output_index.

col_prune v2

To preserve the information of column order, we introduce column prune v2. We should change the definition fn prune_col(&self, required_cols: &FixedBitSet) -> PlanRef; to fn prune_col(&self, required_cols: &[usize]) -> PlanRef; . Therefore we can prune columns while reordering the columns.

Output index

A field output_index: Vec<usize> will be added on LogicalAgg, LogicalJoin, LogicalHopWindow, and any logical plan node which needs extra projection before. It will represent the operator's output column index based on our current schema of PlanNode.
for example

SELECT t2.a, t1.a from t1 join t2 on t1.b = t2.b

in our current implementation, we will get a join plan node with schema [t1.a, t1.b, t2.a, t2.b], and a project on the join with expressions [input_ref(2), input_ref(0)].
And with the output_index there will only be a join plan node with output_index [2, 0]

Implementation considerations

When adding this field to the plan nodes, we should carefully re-think those already existing functions.

the schema of the plan node(output index will change the schema of the plan node)
the property(dist, order, pk...) derive in the new()
the o2i_mapping and other similar functions
rewrite_with_input, rewrite_for_stream.

The output_index is natural in our chunk-based vectorized query execution, so we can add the field on the proto and implement it in executors in the future. but now we can just add a stream/batch project node when converting the logical plan to a stream/batch plan.

The text was updated successfully, but these errors were encountered:

Enter-tainer · 2022-05-23T02:33:27Z

Tracking

Column prune

Change the function signature to fn prune_col(&self, required_cols: &[usize]) -> PlanRef; refactor(optimizer): column prune refactor #2095
Ensure every logical node correctly handle the reorder of columns when performing column prune. test(optimizer): ensure column prune correctly handle reorder #2603 refactor(optimizer): make LogicalHopWindow column order aware #2667

Output Indices

Add output indices for LogicalHopWindow feat(optimizer): add output_indices for LogicalHopWindow #2769
Add output indices for LogicalJoin feat(optimizer): Add output indices to LogicalJoin #2748
Add output indices for LogicalAgg
Add output indices for LogicalFilter
Add output indices to hop_window executor feat(executor): add output indices to HopWindow executor #2922
Add output indices to join executor

jon-chuang · 2022-05-23T07:46:57Z

Hmm, my guess is that it would be better to keep both the original schema and the new schema if we have output_indices not_none, so that we can keep prune_col's idempotency.

Enter-tainer · 2022-05-23T08:04:45Z

Could I ask what would be the schema of a logical node with output_index? Would it be restricted and reordered as well? I believe may actually not be better to not do so, as it will make prune_col lose its idempotency.

IMO the schema will be changed as well. This is based on the idea of "hiding" projections inside these nodes. And add these projections back, when converting logical nodes to stream/batch nodes. I'm currently working on LogicalHopWindow and still exploring where to put output_indices. I put output_indices and the original schema in LogicalHopWindow itself. The reason why I do not put it in PlanBase is because only logical nodes need output_indices, stream/batch nodes do not have it.

BTW, I'm not sure if we are on the same page, but I think column prune does not have idempotency. For example, if we call col_column([1, 3, 5]) on some logical node twice, the second time will fail because we simply do not have column 3 and column 5 --- they have been pruned in the first call and we only have 3 columns now. hmm I'm not sure what idempotency is in column prune. Can you explain more about this?

Anyway, I think it is worth a discussion on whether to keep the schema or not. cc @st1page what's your idea?

st1page · 2022-05-23T08:18:07Z

And add these projections back, when converting logical nodes to stream/batch nodes

and for some operators such as hash_join, we can even implement the behaviour in the executor

st1page · 2022-05-23T08:20:50Z

when the prune_col of input is called, it is expected to change the schema as the prune_col requiring

st1page added type/feature component/optimizer Query optimization. labels Apr 19, 2022

Enter-tainer self-assigned this Apr 21, 2022

Enter-tainer mentioned this issue Apr 22, 2022

refactor: replace bitset with vector when performing column pruning #2054

Closed

st1page mentioned this issue May 11, 2022

Implement LogicalMultiJoin #2425

Closed

Enter-tainer mentioned this issue May 13, 2022

refactor(optimizer): column prune refactor #2095

Merged

2 tasks

st1page mentioned this issue May 13, 2022

frontend: Support Join Reorder to avoid Cross Join in TPC-H Q2/Q8/Q9 #1866

Closed

Enter-tainer mentioned this issue May 17, 2022

test(optimizer): ensure column prune correctly handle reorder #2603

Merged

2 tasks

skyzh mentioned this issue May 23, 2022

optimizer: merge LogicalProject into LogicalMultiJoin #2728

Closed

jon-chuang mentioned this issue May 23, 2022

feat(optimizer): Add output indices to LogicalJoin #2748

Merged

Enter-tainer changed the title ~~feat(optimizer): better plan for projection Expression~~ Tracking: better plan for projection expression by using output indices May 25, 2022

This was referenced May 26, 2022

feat(optimizer): add output_indices for LogicalHopWindow #2769

Merged

feat(executor): add output indices to HopWindow executor #2922

Merged

Enter-tainer mentioned this issue Jun 8, 2022

feat: add output_indices to join executors #3047

Merged

4 tasks

st1page closed this as completed Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: better plan for projection expression by using output indices #1922

Tracking: better plan for projection expression by using output indices #1922

st1page commented Apr 19, 2022 •

edited by Enter-tainer

Loading

Enter-tainer commented May 23, 2022 •

edited by xxchan

Loading

jon-chuang commented May 23, 2022 •

edited

Loading

Enter-tainer commented May 23, 2022 •

edited

Loading

st1page commented May 23, 2022 •

edited

Loading

st1page commented May 23, 2022

Tracking: better plan for projection expression by using output indices #1922

Tracking: better plan for projection expression by using output indices #1922

Comments

st1page commented Apr 19, 2022 • edited by Enter-tainer Loading

Background

Solution

col_prune v2

Output index

Implementation considerations

Enter-tainer commented May 23, 2022 • edited by xxchan Loading

Tracking

Column prune

Output Indices

jon-chuang commented May 23, 2022 • edited Loading

Enter-tainer commented May 23, 2022 • edited Loading

st1page commented May 23, 2022 • edited Loading

st1page commented May 23, 2022

st1page commented Apr 19, 2022 •

edited by Enter-tainer

Loading

Enter-tainer commented May 23, 2022 •

edited by xxchan

Loading

jon-chuang commented May 23, 2022 •

edited

Loading

Enter-tainer commented May 23, 2022 •

edited

Loading

st1page commented May 23, 2022 •

edited

Loading