[SPARK-16955][SQL] Using ordinals in ORDER BY and GROUP BY causes an analysis error #14546

dongjoon-hyun · 2016-08-08T21:57:25Z

What changes were proposed in this pull request?

Spark supports ordinal in GROUP BY and ORDER BY. However, if we use both at the same time, it causes exceptions. The root cause was that ResolveAggregateFunctions rule removed the ordinals before ResolveOrdinalInOrderByAndGroupBy applied.

Before

scala> sql("select a, count(*) from (select 1 as a) tmp group by 1 order by 1")
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to Group by position: `1` exceeds the size of the select list `0`

After

scala> sql("select a, count(*) from (select 1 as a) tmp group by 1 order by 1").explain
== Physical Plan ==
*HashAggregate(keys=[1#9], functions=[count(1)])
+- Exchange hashpartitioning(1#9, 200)
   +- *HashAggregate(keys=[1 AS 1#9], functions=[partial_count(1)])
      +- Scan OneRowRelation[]

scala> sql("select a, count(*) from (select 1 as a) tmp group by 1 order by a").explain
== Physical Plan ==
*HashAggregate(keys=[1#23], functions=[count(1)])
+- Exchange hashpartitioning(1#23, 200)
   +- *HashAggregate(keys=[1 AS 1#23], functions=[partial_count(1)])
      +- Scan OneRowRelation[]

How was this patch tested?

Pass the Jenkins with new test cases.

SparkQA · 2016-08-08T23:59:59Z

Test build #63384 has finished for PR 14546 at commit 12f24dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-08-09T02:42:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Please add a comment here. Thanks!

Thank you, @gatorsmile .

SparkQA · 2016-08-09T02:49:50Z

Test build #63398 has finished for PR 14546 at commit 12bd144.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-09T03:15:30Z

Test build #63401 has finished for PR 14546 at commit 7386ed6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-08-09T05:08:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

We have a conf conf.orderByOrdinal to control whether the integer values are analyzed as positions. Thus, the current fix ignores this conf. Could you fix it? Also added a test case to ensure both options are covered. That is, true and false

For the false case, you meant to check ResolveAggregateFunctions functionality, right?

SparkQA · 2016-08-09T06:00:03Z

Test build #63408 has finished for PR 14546 at commit 7c3d732.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-09T07:39:02Z

Test build #63418 has finished for PR 14546 at commit 1ca8d59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-09T07:46:54Z

Test build #63419 has finished for PR 14546 at commit 1dc193a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-08-09T14:54:07Z

Hi, @yhuai .
Could you review this PR?

hvanhovell · 2016-08-10T09:36:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

I have the feeling that this guard is wrong. This disables this entire clause if conf.orderByOrdinal is false. Shouldn't it be: !conf.orderByOrdinal || sortOrder.forall(x => IntegerIndex.unapply(x.child).isEmpty)

Aha, I see what I missed. You're right. I will fix like that.

SparkQA · 2016-08-10T12:49:20Z

Test build #63527 has finished for PR 14546 at commit f88f9d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-10T13:17:59Z

Test build #63528 has finished for PR 14546 at commit c3262d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-08-11T00:05:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

remove is a transitive verb.

Eliminate the useless position numbers?

gatorsmile · 2016-08-11T00:08:59Z

LGTM except minor comments. cc @cloud-fan @hvanhovell

dongjoon-hyun · 2016-08-11T04:16:16Z

Thank you for review, @gatorsmile .

SparkQA · 2016-08-11T06:22:24Z

Test build #63579 has finished for PR 14546 at commit 32c639c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

clockfly · 2016-08-11T10:28:34Z

I believe this doesn't fix all the cases.

How about

sql("select count(*), a from (select 1 as a) tmp group by 2 having a > 0").show

dongjoon-hyun · 2016-08-11T20:19:40Z

Thank you, @clockfly . I'll check that!

…analysis error

dongjoon-hyun · 2016-08-11T20:52:56Z

Hi, @clockfly .
It is different issue since this PR aims to solve the case of both GROUP BY and ORDER BY existence.

However, this PR solve that problem too. Your case makes exceptions in current master. But in this PR,

scala> sql("select count(*), a from (select 1 as a) tmp group by 2 having a > 0").show
+--------+---+
|count(1)|  a|
+--------+---+
|       1|  1|
+--------+---+

dongjoon-hyun · 2016-08-11T20:55:38Z

Anyway, I rebased the branch to resolve the conflict. I checked your case after rebasing. So, you can checkout and see the result without conflicts.

SparkQA · 2016-08-11T23:00:56Z

Test build #63631 has finished for PR 14546 at commit 2689755.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

clockfly · 2016-08-11T23:37:55Z

@dongjoon-hyun The exception was muted by line:
https://github.com/apache/spark/pull/14546/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R1257

If you add some log message, you will find it still throws exception like:

org.apache.spark.sql.AnalysisException: GROUP BY position 2 is not in select list (valid range is [1, 1]); line 1 pos 53
...

clockfly · 2016-08-11T23:46:43Z

I think the root cause is that the Aggregate operator is treated as resolved if even it has group by ordinals.

For example:

'Filter ('a > 0)
   +- Aggregate [2], [count(1) AS count(1)#83L, a#81]
        +- SubqueryAlias tmp
            +- Project [1 AS a#81]
                 +- OneRowRelation$

Aggregate is treated as resolved even if it has a group by ordinal "2".

Then, it tries to resolve the Filter by putting the Filter as an aggregation expression:

!'Aggregate [2], [('a > 0) AS havingCondition#84] 
 +- SubqueryAlias tmp
    +- Project [1 AS a#81]
       +- OneRowRelation$

Actually this plan is already wrong. As we are asking for ordinal "2", but actually there is only one
aggregation expression [('a > 0) AS havingCondition#84]

clockfly · 2016-08-11T23:50:46Z

Similar case happens to order by. We don't need "order by ordinal" to reproduce the Analysis error.

'Sort ('a)
   +- Aggregate [2], [count(1) AS count(1)#83L, a#81]
        +- SubqueryAlias tmp
            +- Project [1 AS a#81]
                 +- OneRowRelation$

Aggregate is treated as resolved even if it has a group by ordinal "2".

Then, it tries to resolve the Sort by putting the SortOrder expression of Sort as a aggregation expression:

!'Aggregate [2], ['a] 
 +- SubqueryAlias tmp
    +- Project [1 AS a#81]
       +- OneRowRelation$

This plan is wrong because we are asking for ordinal "2", but actually there is only one
aggregation expression ['a]

clockfly · 2016-08-11T23:53:01Z

I think a proper fix will be marking ordinal unresolved, the ordinal can exists in group by or order by expression.

Then we can make sure the ResolveAggregateFunctions and other analyzer rules doesn't assume
the ordinals are resolved, and do pre-mature Analysis.

clockfly · 2016-08-12T07:03:14Z

@dongjoon-hyun I have implemented the idea in #14616
May be you can take a look to see what I mean.

yhuai · 2016-08-12T15:38:19Z

@dongjoon-hyun Seems this issue has been fixed as a by-product of #14595. How about we close this? Also, feel free to look at @clockfly's follow-up pr.

dongjoon-hyun · 2016-08-12T17:02:42Z

Yep. I confirmed that it was nicely resolved at 6bf20cd .
Thank you for review, @yhuai and @clockfly , @gatorsmile .

gatorsmile reviewed Aug 9, 2016
View reviewed changes

hvanhovell reviewed Aug 10, 2016
View reviewed changes

gatorsmile reviewed Aug 11, 2016
View reviewed changes

[SPARK-16955][SQL] Using ordinals in ORDER BY and GROUP BY causes an …

2689755

…analysis error

dongjoon-hyun closed this Aug 12, 2016

dongjoon-hyun deleted the SPARK-16955-ORDINAL branch January 17, 2018 09:41

[SPARK-16955][SQL] Using ordinals in ORDER BY and GROUP BY causes an analysis error #14546

[SPARK-16955][SQL] Using ordinals in ORDER BY and GROUP BY causes an analysis error #14546

Uh oh!

Conversation

dongjoon-hyun commented Aug 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 9, 2016

Uh oh!

SparkQA commented Aug 9, 2016

Uh oh!

gatorsmile Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 9, 2016

Uh oh!

SparkQA commented Aug 9, 2016

Uh oh!

SparkQA commented Aug 9, 2016

Uh oh!

dongjoon-hyun commented Aug 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 11, 2016

Uh oh!

SparkQA commented Aug 11, 2016

Uh oh!

clockfly commented Aug 11, 2016

Uh oh!

dongjoon-hyun commented Aug 11, 2016

Uh oh!

dongjoon-hyun commented Aug 11, 2016

Uh oh!

dongjoon-hyun commented Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 11, 2016

Uh oh!

clockfly commented Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clockfly commented Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clockfly commented Aug 11, 2016

Uh oh!

clockfly commented Aug 11, 2016

Uh oh!

clockfly commented Aug 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhuai commented Aug 12, 2016

Uh oh!

dongjoon-hyun commented Aug 8, 2016 •

edited

Loading

gatorsmile Aug 9, 2016 •

edited

Loading

gatorsmile commented Aug 11, 2016 •

edited

Loading

dongjoon-hyun commented Aug 11, 2016 •

edited

Loading

clockfly commented Aug 11, 2016 •

edited

Loading

clockfly commented Aug 11, 2016 •

edited

Loading

clockfly commented Aug 12, 2016 •

edited

Loading