Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

@dilipbiswal dilipbiswal commented Jul 23, 2018

What changes were proposed in this pull request?

Thanks to @henryr for the original idea at #21049

Description from the original PR :
Subqueries (at least in SQL) have 'bag of tuples' semantics. Ordering
them is therefore redundant (unless combined with a limit).

This patch removes the top sort operators from the subquery plans.

This closes #21049.

How was this patch tested?

Added test cases in SubquerySuite to cover in, exists and scalar subqueries.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@gatorsmile
Copy link
Member

cc @maryannxue

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93463 has finished for PR 21853 at commit 191c0eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: remove this blank line

* Optimize all the subqueries inside expression.
*/
object OptimizeSubqueries extends Rule[LogicalPlan] {
private def removeTopLevelSorts(plan: LogicalPlan): LogicalPlan = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: removeTopLevelSort? (I think this func removes a single sort on the top?)

val Subquery(newPlan) = Optimizer.this.execute(Subquery(s.plan))
s.withNewPlan(newPlan)
// At this point we have an optimized subquery plan that we are going to attach
// to this subquery expression. Here we can safely remove any top level sorts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: any top level sort?

| FROM t2
| WHERE t2.c1 = t1.c1
| ORDER BY t2.c2) t2
| ORDER BY t2.c1)
Copy link
Member

@maropu maropu Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: add one space before ORDER (Also, could you check the other indents in the SQL queries below again?)

@maropu
Copy link
Member

maropu commented Jul 24, 2018

LGTM except for minor comments

@maropu
Copy link
Member

maropu commented Jul 24, 2018

Also, could you add Closes #21049 in the description?

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93477 has finished for PR 21853 at commit a86cb9f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 24, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93485 has finished for PR 21853 at commit a86cb9f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Jul 25, 2018

Ideally, this is not a perfect fix. We can make it more general to remove all the unnecessary sorts during the query planning. However, this optimization is still nice-to-have in Spark 2.4 since the sorts removed by this PR are not rare.

@asfgit asfgit closed this in afb0627 Jul 25, 2018
@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@dilipbiswal
Copy link
Contributor Author

Thank you very much @gatorsmile and @maropu

cloud-fan pushed a commit that referenced this pull request Nov 4, 2019
…oin/Aggregation

### What changes were proposed in this pull request?
This is somewhat a complement of #21853.
The `Sort` without `Limit` operator in `Join` subquery is useless, it's the same case in `GroupBy` when the aggregation function is order irrelevant, such as `count`, `sum`.
This PR try to remove this kind of `Sort` operator in `SQL Optimizer`.

### Why are the changes needed?
For example,  `select count(1) from (select a from test1 order by a)` is equal to `select count(1) from (select a from test1)`.
'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to `select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b`.

Remove useless `Sort` operator can improve performance.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Adding new UT `RemoveSortInSubquerySuite.scala`

Closes #26011 from WangGuangxin/remove_sorts.

Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants