Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve Filter pushdown to Join #5770

Merged
merged 1 commit into from
Apr 1, 2023
Merged

improve Filter pushdown to Join #5770

merged 1 commit into from
Apr 1, 2023

Conversation

mingmwang
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Improve some TPCH query performance, simply the generate logical plan and physical plan.

What changes are included in this PR?

  1. Convert filters to join filters for Inner Join
  2. Avoid duplicated filters
  3. Fixed unstable physical HashJoin plan

tpch-q7, tpch-q17, tpch-q19, tpch-q20 are impacted by this PR.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Mar 29, 2023
@mingmwang
Copy link
Contributor Author

@yahoNanJing
Please help me to review

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the plan changes and the code carefully -- nice work @mingmwang

| | Aggregate: groupBy=[[]], aggr=[[SUM(lineitem.l_extendedprice)]] |
| | Projection: lineitem.l_extendedprice |
| | Inner Join: part.p_partkey = __scalar_sq_1.l_partkey Filter: CAST(lineitem.l_quantity AS Decimal128(30, 15)) < CAST(__scalar_sq_1.__value AS Decimal128(30, 15)) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a better plan because the redundant

Filter: part.p_partkey = lineitem.l_partkey AND lineitem.l_partkey = part.p_partkey

, which is already done by the earlier joins, is removed, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also pushes the filter

|               |       Filter: CAST(lineitem.l_quantity AS Decimal128(30, 15)) < CAST(__scalar_sq_1.__value AS Decimal128(30, 15)) AND __scalar_sq_1.l_partkey = lineitem.l_partkey              |
``

Into the Join which seems like a win to me (avoid generating output)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a better plan because the redundant

Filter: part.p_partkey = lineitem.l_partkey AND lineitem.l_partkey = part.p_partkey

, which is already done by the earlier joins, is removed, right?

Yes, the duplicated filters are removed. Actually why the original plan include duplicate filters is because the push_down_filter rule infers additional filters and try to pushdown them down. If they can not be pushed down, those inferred filters are added back to the Filters, this is unnecessary, need to differ the inferred filters and the original filters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the explanation.

let cols = expr.to_columns()?;

// Collect left & right field indices
// Collect left & right field indices, the field indices are sorted in ascending order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the sort order is important for later stages, can you make a note about the rationale (so the comment explains why the sorting is important, in addition to noting the output is sorted)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

@andygrove
Copy link
Member

@mingmwang Could you share any performance numbers for the improvements for the affected queries?

@mingmwang
Copy link
Contributor Author

@mingmwang Could you share any performance numbers for the improvements for the affected queries?

Sure, will do. unfortunately, the performance improvement is just a little. For q17, the major bottleneck is still the Aggregation.

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job to me. Thanks @mingmwang .

@jackwener
Copy link
Member

jackwener commented Mar 31, 2023

This PR remind me. I also notice some optimization.

We can do a EPIC list tasks about optimizer to collect those optimization like #5546.

such as predicate move around ......

@Dandandan Dandandan merged commit 5bc0051 into apache:main Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants