[SPARK-23957][SQL] Sorts in subqueries are redundant and can be removed #21853

dilipbiswal · 2018-07-23T21:41:59Z

What changes were proposed in this pull request?

Thanks to @henryr for the original idea at #21049

Description from the original PR :
Subqueries (at least in SQL) have 'bag of tuples' semantics. Ordering
them is therefore redundant (unless combined with a limit).

This patch removes the top sort operators from the subquery plans.

This closes #21049.

How was this patch tested?

Added test cases in SubquerySuite to cover in, exists and scalar subqueries.

Please review http://spark.apache.org/contributing.html before opening a pull request.

gatorsmile · 2018-07-23T22:09:39Z

cc @maryannxue

SparkQA · 2018-07-24T01:33:50Z

Test build #93463 has finished for PR 21853 at commit 191c0eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-24T02:47:54Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+    }
+  }
 }
+


super nit: remove this blank line

maropu · 2018-07-24T02:48:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

   * Optimize all the subqueries inside expression.
   */
  object OptimizeSubqueries extends Rule[LogicalPlan] {
+    private def removeTopLevelSorts(plan: LogicalPlan): LogicalPlan = {


nit: removeTopLevelSort? (I think this func removes a single sort on the top?)

maropu · 2018-07-24T02:49:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

        val Subquery(newPlan) = Optimizer.this.execute(Subquery(s.plan))
-        s.withNewPlan(newPlan)
+        // At this point we have an optimized subquery plan that we are going to attach
+        // to this subquery expression. Here we can safely remove any top level sorts


super nit: any top level sort?

maropu · 2018-07-24T03:11:30Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+           |                      FROM   t2
+           |                      WHERE t2.c1 = t1.c1
+           |                      ORDER  BY t2.c2) t2
+           |              ORDER  BY t2.c1)


super nit: add one space before ORDER (Also, could you check the other indents in the SQL queries below again?)

maropu · 2018-07-24T03:13:43Z

LGTM except for minor comments

maropu · 2018-07-24T03:14:29Z

Also, could you add Closes #21049 in the description?

SparkQA · 2018-07-24T07:05:02Z

Test build #93477 has finished for PR 21853 at commit a86cb9f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-24T07:06:32Z

retest this please

SparkQA · 2018-07-24T11:17:55Z

Test build #93485 has finished for PR 21853 at commit a86cb9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-25T03:45:04Z

Ideally, this is not a perfect fix. We can make it more general to remove all the unnecessary sorts during the query planning. However, this optimization is still nice-to-have in Spark 2.4 since the sorts removed by this PR are not rare.

gatorsmile · 2018-07-25T03:58:27Z

LGTM

Thanks! Merged to master.

dilipbiswal · 2018-07-25T03:58:28Z

Thank you very much @gatorsmile and @maropu

…oin/Aggregation ### What changes were proposed in this pull request? This is somewhat a complement of #21853. The `Sort` without `Limit` operator in `Join` subquery is useless, it's the same case in `GroupBy` when the aggregation function is order irrelevant, such as `count`, `sum`. This PR try to remove this kind of `Sort` operator in `SQL Optimizer`. ### Why are the changes needed? For example, `select count(1) from (select a from test1 order by a)` is equal to `select count(1) from (select a from test1)`. 'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to `select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b`. Remove useless `Sort` operator can improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Adding new UT `RemoveSortInSubquerySuite.scala` Closes #26011 from WangGuangxin/remove_sorts. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-23957] Sorts in subqueries are redundant and can be removed

191c0eb

maropu reviewed Jul 24, 2018

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala Outdated

}

}

}

Copy link

Member

maropu Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: remove this blank line

maropu reviewed Jul 24, 2018

View reviewed changes

dilipbiswal added 2 commits July 23, 2018 21:33

Code review

4f70245

style

a86cb9f

asfgit closed this in afb0627 Jul 25, 2018

WangGuangxin mentioned this pull request Oct 14, 2019

[SPARK-29343][SQL] Eliminate sorts without limit in the subquery of Join/Aggregation #26011

Closed

[SPARK-23957][SQL] Sorts in subqueries are redundant and can be removed #21853

[SPARK-23957][SQL] Sorts in subqueries are redundant and can be removed #21853

Uh oh!

Conversation

dilipbiswal commented Jul 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Jul 23, 2018

Uh oh!

SparkQA commented Jul 24, 2018

Uh oh!

maropu Jul 24, 2018

Choose a reason for hiding this comment

Uh oh!

maropu Jul 24, 2018

Choose a reason for hiding this comment

Uh oh!

maropu Jul 24, 2018

Choose a reason for hiding this comment

Uh oh!

maropu Jul 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 24, 2018

Uh oh!

maropu commented Jul 24, 2018

Uh oh!

SparkQA commented Jul 24, 2018

Uh oh!

maropu commented Jul 24, 2018

Uh oh!

SparkQA commented Jul 24, 2018

Uh oh!

gatorsmile commented Jul 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 25, 2018

Uh oh!

dilipbiswal commented Jul 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dilipbiswal commented Jul 23, 2018 •

edited

Loading

maropu Jul 24, 2018 •

edited

Loading

gatorsmile commented Jul 25, 2018 •

edited

Loading