-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19712][SQL] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries #23211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…query after OptimizeSubqueries
|
Test build #99628 has finished for PR 23211 at commit
|
|
retest this please |
|
Test build #99636 has finished for PR 23211 at commit
|
|
I generated the TPC-DS plans to compare the differences after this patch to help review: |
|
@wangyum Thanks.. Can you please tell me how you generate this ? Also, is it possible to get runtimes of these queries to see if there are any regressions ? |
|
This file generated by TPCDSQueryOptimizerTracker.scala. runtimes can generated by TPCDSQueryBenchmark.scala. |
| } | ||
|
|
||
| def hasScalarSubquery(e: Seq[Expression]): Boolean = { | ||
| e.find(hasScalarSubquery(_)).isDefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.exists(hasScalarSubquery)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Sure.
| if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) { | ||
| if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) || | ||
| ScalarSubquery.hasScalarSubquery(p1.projectList) || | ||
| ScalarSubquery.hasScalarSubquery(p2.projectList)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we allow it before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan.. Let me get back to you on this, need to debug again :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan One failing test that i needed to address with this change is in subquerysuite.
select a, (select sum(b) from l l2 where l2.a <=> l1.a) sum_b from l l1")
One main reason is, the Filter ops with outer references were pulled up before optimizeSubqueries rule. So by the time other optimization rules kick in (like pushDownPredicate etc), it does not see outer references. But with the change in the PR, they are present. So another way to handle this is to change pushdownPredicate rule to make sure the filter clauses with outer references are not moved down. May be thats better way to handle it and keep CollapseProject as it is.
|
|
||
| // Similar to the above Filter over Project | ||
| // LeftSemi/LeftAnti over Project | ||
| case join @ Join(p @ Project(pList, grandChild), rightOp, LeftSemiOrAnti(joinType), joinCond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we create a new rule PushdownLeftSemaOrAntiJoin?
|
to make the PR smaller, can we add an individual rule |
|
@cloud-fan Just to make sure, so we want this new rule and associated tests to verify the pushdown of left semi/anti joins. We would keep the subquery rewrite at the same place first i.e not move it up in the new PR, correct ? |
|
Yes, since |
|
@cloud-fan Thanks Wenchen. It makes sense. I will work in creating a smaller pr first. |
…ect, Aggregate, Window, Union etc. ## What changes were proposed in this pull request? This PR adds support for pushing down LeftSemi and LeftAnti joins below operators such as Project, Aggregate, Window, Union etc. This is the initial piece of work that will be needed for the subsequent work of moving the subquery rewrites to the beginning of optimization phase. The larger PR is [here](#23211) . This PR addresses the comment at [link](#23211 (comment)). ## How was this patch tested? Added a new test suite LeftSemiAntiJoinPushDownSuite. Closes #23750 from dilipbiswal/SPARK-19712-pushleftsemi. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
Currently predicate subqueries (IN/EXISTS) are converted to Joins at the end of optimizer in RewritePredicateSubquery. This change moves the rewrite close to beginning of optimizer. The original idea was to keep the subquery expressions in Filter form so that we can push them down as deep as possible. One disadvantage is that, after the subqueries are rewritten in join form, they are not subjected to further optimizations. In this change, we convert the subqueries to join form early in the rewrite phase and then add logic to push the left-semi and left-anti joins down like we do for normal filter ops. I can think of the following advantages :
(P.S Thanks to Natt for his original work in here. I have based this pr on his work)
How was this patch tested?
A new suite LeftSemiOrAntiPushDownSuite is added. Existing subquery suite should verify the results and any potential regressions.