-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23957][SQL] Sorts in subqueries are redundant and can be removed #21853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @maryannxue |
|
Test build #93463 has finished for PR 21853 at commit
|
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: remove this blank line
| * Optimize all the subqueries inside expression. | ||
| */ | ||
| object OptimizeSubqueries extends Rule[LogicalPlan] { | ||
| private def removeTopLevelSorts(plan: LogicalPlan): LogicalPlan = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: removeTopLevelSort? (I think this func removes a single sort on the top?)
| val Subquery(newPlan) = Optimizer.this.execute(Subquery(s.plan)) | ||
| s.withNewPlan(newPlan) | ||
| // At this point we have an optimized subquery plan that we are going to attach | ||
| // to this subquery expression. Here we can safely remove any top level sorts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: any top level sort?
| | FROM t2 | ||
| | WHERE t2.c1 = t1.c1 | ||
| | ORDER BY t2.c2) t2 | ||
| | ORDER BY t2.c1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: add one space before ORDER (Also, could you check the other indents in the SQL queries below again?)
|
LGTM except for minor comments |
|
Also, could you add |
|
Test build #93477 has finished for PR 21853 at commit
|
|
retest this please |
|
Test build #93485 has finished for PR 21853 at commit
|
|
Ideally, this is not a perfect fix. We can make it more general to remove all the unnecessary sorts during the query planning. However, this optimization is still nice-to-have in Spark 2.4 since the sorts removed by this PR are not rare. |
|
LGTM Thanks! Merged to master. |
|
Thank you very much @gatorsmile and @maropu |
…oin/Aggregation ### What changes were proposed in this pull request? This is somewhat a complement of #21853. The `Sort` without `Limit` operator in `Join` subquery is useless, it's the same case in `GroupBy` when the aggregation function is order irrelevant, such as `count`, `sum`. This PR try to remove this kind of `Sort` operator in `SQL Optimizer`. ### Why are the changes needed? For example, `select count(1) from (select a from test1 order by a)` is equal to `select count(1) from (select a from test1)`. 'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to `select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b`. Remove useless `Sort` operator can improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Adding new UT `RemoveSortInSubquerySuite.scala` Closes #26011 from WangGuangxin/remove_sorts. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Thanks to @henryr for the original idea at #21049
Description from the original PR :
Subqueries (at least in SQL) have 'bag of tuples' semantics. Ordering
them is therefore redundant (unless combined with a limit).
This patch removes the top sort operators from the subquery plans.
This closes #21049.
How was this patch tested?
Added test cases in SubquerySuite to cover in, exists and scalar subqueries.
Please review http://spark.apache.org/contributing.html before opening a pull request.