[SPARK-19712][SQL] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries #23211

dilipbiswal · 2018-12-03T22:25:56Z

Currently predicate subqueries (IN/EXISTS) are converted to Joins at the end of optimizer in RewritePredicateSubquery. This change moves the rewrite close to beginning of optimizer. The original idea was to keep the subquery expressions in Filter form so that we can push them down as deep as possible. One disadvantage is that, after the subqueries are rewritten in join form, they are not subjected to further optimizations. In this change, we convert the subqueries to join form early in the rewrite phase and then add logic to push the left-semi and left-anti joins down like we do for normal filter ops. I can think of the following advantages :

We will produce consistent optimized plans for subqueries written using SQL dialect and data frame apis or queries using left semi/anti joins directly.
Will hopefully make it easier to do the next phase of de-correlations when we open up more cases of de-correlation. In this case, it would be beneficial to expose the rewritten queries to all the other optimization rules, i think.
We can now hopefully get-rid of PullupCorrelatedPredicates rule and combine this with RewritePredicateSubquery. I haven't tried it. Will take it on a followup.

(P.S Thanks to Natt for his original work in here. I have based this pr on his work)

How was this patch tested?

A new suite LeftSemiOrAntiPushDownSuite is added. Existing subquery suite should verify the results and any potential regressions.

…query after OptimizeSubqueries

SparkQA · 2018-12-04T00:11:24Z

Test build #99628 has finished for PR 23211 at commit f4bb126.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-12-04T00:28:36Z

retest this please

SparkQA · 2018-12-04T04:00:35Z

Test build #99636 has finished for PR 23211 at commit f4bb126.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-12-05T06:15:14Z

I generated the TPC-DS plans to compare the differences after this patch to help review:
wangyum@7e7a1fe#diff-1a4e6beba801fa647e1dcbd61ed7e5bf

dilipbiswal · 2018-12-05T06:34:28Z

@wangyum Thanks.. Can you please tell me how you generate this ? Also, is it possible to get runtimes of these queries to see if there are any regressions ?

wangyum · 2018-12-06T10:17:29Z

This file generated by TPCDSQueryOptimizerTracker.scala. runtimes can generated by TPCDSQueryBenchmark.scala.

cloud-fan · 2018-12-10T06:16:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+  }
+
+  def hasScalarSubquery(e: Seq[Expression]): Boolean = {
+    e.find(hasScalarSubquery(_)).isDefined


e.exists(hasScalarSubquery)

@cloud-fan Sure.

cloud-fan · 2018-12-10T06:47:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) {
+      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
+        ScalarSubquery.hasScalarSubquery(p1.projectList) ||
+        ScalarSubquery.hasScalarSubquery(p2.projectList)) {


why did we allow it before?

@cloud-fan.. Let me get back to you on this, need to debug again :-)

@cloud-fan One failing test that i needed to address with this change is in subquerysuite.

select a, (select sum(b) from l l2 where l2.a <=> l1.a) sum_b from l l1")

One main reason is, the Filter ops with outer references were pulled up before optimizeSubqueries rule. So by the time other optimization rules kick in (like pushDownPredicate etc), it does not see outer references. But with the change in the PR, they are present. So another way to handle this is to change pushdownPredicate rule to make sure the filter clauses with outer references are not moved down. May be thats better way to handle it and keep CollapseProject as it is.

cloud-fan · 2018-12-10T06:48:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala


+    // Similar to the above Filter over Project
+    // LeftSemi/LeftAnti over Project
+    case join @ Join(p @ Project(pList, grandChild), rightOp, LeftSemiOrAnti(joinType), joinCond)


Shall we create a new rule PushdownLeftSemaOrAntiJoin?

cloud-fan · 2018-12-10T06:51:39Z

to make the PR smaller, can we add an individual rule PushdownLeftSemiOrAntiJoin first?

dilipbiswal · 2018-12-10T19:39:33Z

@cloud-fan Just to make sure, so we want this new rule and associated tests to verify the pushdown of left semi/anti joins. We would keep the subquery rewrite at the same place first i.e not move it up in the new PR, correct ?

cloud-fan · 2018-12-11T01:51:47Z

Yes, since PushdownLeftSemiOrAntiJoin rule is useful without subquery.

dilipbiswal · 2018-12-11T05:23:18Z

@cloud-fan Thanks Wenchen. It makes sense. I will work in creating a smaller pr first.

…ect, Aggregate, Window, Union etc. ## What changes were proposed in this pull request? This PR adds support for pushing down LeftSemi and LeftAnti joins below operators such as Project, Aggregate, Window, Union etc. This is the initial piece of work that will be needed for the subsequent work of moving the subquery rewrites to the beginning of optimization phase. The larger PR is [here](#23211) . This PR addresses the comment at [link](#23211 (comment)). ## How was this patch tested? Added a new test suite LeftSemiAntiJoinPushDownSuite. Closes #23750 from dilipbiswal/SPARK-19712-pushleftsemi. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions · 2020-01-03T00:08:50Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

[SPARK-19712] Move PullupCorrelatedPredicates and RewritePredicateSub…

f4bb126

…query after OptimizeSubqueries

cloud-fan mentioned this pull request Dec 10, 2018

[SPARK-26293][SQL] Cast exception when having python udf in subquery #23248

Closed

cloud-fan reviewed Dec 10, 2018

View reviewed changes

dilipbiswal mentioned this pull request Feb 9, 2019

[SPARK-19712][SQL] Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc. #23750

Closed

dilipbiswal mentioned this pull request Apr 9, 2019

[SPARK-19712][SQL] Pushdown LeftSemi/LeftAnti below join #24331

Closed

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 3, 2020

github-actions bot closed this Jan 4, 2020

[SPARK-19712][SQL] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries #23211

[SPARK-19712][SQL] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries #23211

Uh oh!

Conversation

dilipbiswal commented Dec 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How was this patch tested?

Uh oh!

SparkQA commented Dec 4, 2018

Uh oh!

dilipbiswal commented Dec 4, 2018

Uh oh!

SparkQA commented Dec 4, 2018

Uh oh!

wangyum commented Dec 5, 2018

Uh oh!

dilipbiswal commented Dec 5, 2018

Uh oh!

wangyum commented Dec 6, 2018

Uh oh!

cloud-fan Dec 10, 2018

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Dec 10, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Dec 10, 2018

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Dec 11, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 10, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 10, 2018

Uh oh!

dilipbiswal commented Dec 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Dec 11, 2018

Uh oh!

dilipbiswal commented Dec 11, 2018

Uh oh!

github-actions bot commented Jan 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dilipbiswal commented Dec 3, 2018 •

edited

Loading

cloud-fan Dec 10, 2018 •

edited

Loading

dilipbiswal commented Dec 10, 2018 •

edited

Loading