Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

@dilipbiswal dilipbiswal commented Dec 3, 2018

Currently predicate subqueries (IN/EXISTS) are converted to Joins at the end of optimizer in RewritePredicateSubquery. This change moves the rewrite close to beginning of optimizer. The original idea was to keep the subquery expressions in Filter form so that we can push them down as deep as possible. One disadvantage is that, after the subqueries are rewritten in join form, they are not subjected to further optimizations. In this change, we convert the subqueries to join form early in the rewrite phase and then add logic to push the left-semi and left-anti joins down like we do for normal filter ops. I can think of the following advantages :

  1. We will produce consistent optimized plans for subqueries written using SQL dialect and data frame apis or queries using left semi/anti joins directly.
  2. Will hopefully make it easier to do the next phase of de-correlations when we open up more cases of de-correlation. In this case, it would be beneficial to expose the rewritten queries to all the other optimization rules, i think.
  3. We can now hopefully get-rid of PullupCorrelatedPredicates rule and combine this with RewritePredicateSubquery. I haven't tried it. Will take it on a followup.

(P.S Thanks to Natt for his original work in here. I have based this pr on his work)

How was this patch tested?

A new suite LeftSemiOrAntiPushDownSuite is added. Existing subquery suite should verify the results and any potential regressions.

@SparkQA
Copy link

SparkQA commented Dec 4, 2018

Test build #99628 has finished for PR 23211 at commit f4bb126.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 4, 2018

Test build #99636 has finished for PR 23211 at commit f4bb126.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Dec 5, 2018

I generated the TPC-DS plans to compare the differences after this patch to help review:
wangyum@7e7a1fe#diff-1a4e6beba801fa647e1dcbd61ed7e5bf

@dilipbiswal
Copy link
Contributor Author

@wangyum Thanks.. Can you please tell me how you generate this ? Also, is it possible to get runtimes of these queries to see if there are any regressions ?

@wangyum
Copy link
Member

wangyum commented Dec 6, 2018

This file generated by TPCDSQueryOptimizerTracker.scala. runtimes can generated by TPCDSQueryBenchmark.scala.

}

def hasScalarSubquery(e: Seq[Expression]): Boolean = {
e.find(hasScalarSubquery(_)).isDefined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.exists(hasScalarSubquery)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Sure.

if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) {
if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) ||
ScalarSubquery.hasScalarSubquery(p1.projectList) ||
ScalarSubquery.hasScalarSubquery(p2.projectList)) {
Copy link
Contributor

@cloud-fan cloud-fan Dec 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we allow it before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan.. Let me get back to you on this, need to debug again :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan One failing test that i needed to address with this change is in subquerysuite.

select a, (select sum(b) from l l2 where l2.a <=> l1.a) sum_b from l l1")

One main reason is, the Filter ops with outer references were pulled up before optimizeSubqueries rule. So by the time other optimization rules kick in (like pushDownPredicate etc), it does not see outer references. But with the change in the PR, they are present. So another way to handle this is to change pushdownPredicate rule to make sure the filter clauses with outer references are not moved down. May be thats better way to handle it and keep CollapseProject as it is.


// Similar to the above Filter over Project
// LeftSemi/LeftAnti over Project
case join @ Join(p @ Project(pList, grandChild), rightOp, LeftSemiOrAnti(joinType), joinCond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we create a new rule PushdownLeftSemaOrAntiJoin?

@cloud-fan
Copy link
Contributor

to make the PR smaller, can we add an individual rule PushdownLeftSemiOrAntiJoin first?

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Dec 10, 2018

@cloud-fan Just to make sure, so we want this new rule and associated tests to verify the pushdown of left semi/anti joins. We would keep the subquery rewrite at the same place first i.e not move it up in the new PR, correct ?

@cloud-fan
Copy link
Contributor

Yes, since PushdownLeftSemiOrAntiJoin rule is useful without subquery.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thanks Wenchen. It makes sense. I will work in creating a smaller pr first.

cloud-fan pushed a commit that referenced this pull request Mar 4, 2019
…ect, Aggregate, Window, Union etc.

## What changes were proposed in this pull request?
This PR adds support for pushing down LeftSemi and LeftAnti joins below operators such as Project, Aggregate, Window, Union etc.  This is the initial piece of work that will be needed for
the subsequent work of moving the subquery rewrites to the beginning of optimization phase.

The larger  PR is [here](#23211) . This PR addresses the comment at [link](#23211 (comment)).
## How was this patch tested?
Added a new test suite LeftSemiAntiJoinPushDownSuite.

Closes #23750 from dilipbiswal/SPARK-19712-pushleftsemi.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@github-actions
Copy link

github-actions bot commented Jan 3, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 3, 2020
@github-actions github-actions bot closed this Jan 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants