Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Aug 8, 2016

What changes were proposed in this pull request?

Spark supports ordinal in GROUP BY and ORDER BY. However, if we use both at the same time, it causes exceptions. The root cause was that ResolveAggregateFunctions rule removed the ordinals before ResolveOrdinalInOrderByAndGroupBy applied.

Before

scala> sql("select a, count(*) from (select 1 as a) tmp group by 1 order by 1")
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to Group by position: `1` exceeds the size of the select list `0`

After

scala> sql("select a, count(*) from (select 1 as a) tmp group by 1 order by 1").explain
== Physical Plan ==
*HashAggregate(keys=[1#9], functions=[count(1)])
+- Exchange hashpartitioning(1#9, 200)
   +- *HashAggregate(keys=[1 AS 1#9], functions=[partial_count(1)])
      +- Scan OneRowRelation[]

scala> sql("select a, count(*) from (select 1 as a) tmp group by 1 order by a").explain
== Physical Plan ==
*HashAggregate(keys=[1#23], functions=[count(1)])
+- Exchange hashpartitioning(1#23, 200)
   +- *HashAggregate(keys=[1 AS 1#23], functions=[partial_count(1)])
      +- Scan OneRowRelation[]

How was this patch tested?

Pass the Jenkins with new test cases.

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63384 has finished for PR 14546 at commit 12f24dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @gatorsmile .

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63398 has finished for PR 14546 at commit 12bd144.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63401 has finished for PR 14546 at commit 7386ed6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a conf conf.orderByOrdinal to control whether the integer values are analyzed as positions. Thus, the current fix ignores this conf. Could you fix it? Also added a test case to ensure both options are covered. That is, true and false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the false case, you meant to check ResolveAggregateFunctions functionality, right?

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63408 has finished for PR 14546 at commit 7c3d732.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63418 has finished for PR 14546 at commit 1ca8d59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63419 has finished for PR 14546 at commit 1dc193a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @yhuai .
Could you review this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling that this guard is wrong. This disables this entire clause if conf.orderByOrdinal is false. Shouldn't it be: !conf.orderByOrdinal || sortOrder.forall(x => IntegerIndex.unapply(x.child).isEmpty)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see what I missed. You're right. I will fix like that.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63527 has finished for PR 14546 at commit f88f9d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63528 has finished for PR 14546 at commit c3262d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove is a transitive verb.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eliminate the useless position numbers?

@gatorsmile
Copy link
Member

gatorsmile commented Aug 11, 2016

LGTM except minor comments. cc @cloud-fan @hvanhovell

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @gatorsmile .

@SparkQA
Copy link

SparkQA commented Aug 11, 2016

Test build #63579 has finished for PR 14546 at commit 32c639c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@clockfly
Copy link
Contributor

I believe this doesn't fix all the cases.

How about

sql("select count(*), a from (select 1 as a) tmp group by 2 having a > 0").show

@dongjoon-hyun
Copy link
Member Author

Thank you, @clockfly . I'll check that!

@dongjoon-hyun
Copy link
Member Author

Hi, @clockfly .
It is different issue since this PR aims to solve the case of both GROUP BY and ORDER BY existence.

However, this PR solve that problem too. Your case makes exceptions in current master. But in this PR,

scala> sql("select count(*), a from (select 1 as a) tmp group by 2 having a > 0").show
+--------+---+
|count(1)|  a|
+--------+---+
|       1|  1|
+--------+---+

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 11, 2016

Anyway, I rebased the branch to resolve the conflict. I checked your case after rebasing. So, you can checkout and see the result without conflicts.

@SparkQA
Copy link

SparkQA commented Aug 11, 2016

Test build #63631 has finished for PR 14546 at commit 2689755.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@clockfly
Copy link
Contributor

clockfly commented Aug 11, 2016

@dongjoon-hyun The exception was muted by line:
https://github.com/apache/spark/pull/14546/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R1257

If you add some log message, you will find it still throws exception like:

org.apache.spark.sql.AnalysisException: GROUP BY position 2 is not in select list (valid range is [1, 1]); line 1 pos 53
...

@clockfly
Copy link
Contributor

clockfly commented Aug 11, 2016

I think the root cause is that the Aggregate operator is treated as resolved if even it has group by ordinals.

For example:

'Filter ('a > 0)
   +- Aggregate [2], [count(1) AS count(1)#83L, a#81]
        +- SubqueryAlias tmp
            +- Project [1 AS a#81]
                 +- OneRowRelation$

Aggregate is treated as resolved even if it has a group by ordinal "2".

Then, it tries to resolve the Filter by putting the Filter as an aggregation expression:

!'Aggregate [2], [('a > 0) AS havingCondition#84] 
 +- SubqueryAlias tmp
    +- Project [1 AS a#81]
       +- OneRowRelation$

Actually this plan is already wrong. As we are asking for ordinal "2", but actually there is only one
aggregation expression [('a > 0) AS havingCondition#84]

@clockfly
Copy link
Contributor

Similar case happens to order by. We don't need "order by ordinal" to reproduce the Analysis error.

'Sort ('a)
   +- Aggregate [2], [count(1) AS count(1)#83L, a#81]
        +- SubqueryAlias tmp
            +- Project [1 AS a#81]
                 +- OneRowRelation$

Aggregate is treated as resolved even if it has a group by ordinal "2".

Then, it tries to resolve the Sort by putting the SortOrder expression of Sort as a aggregation expression:

!'Aggregate [2], ['a] 
 +- SubqueryAlias tmp
    +- Project [1 AS a#81]
       +- OneRowRelation$

This plan is wrong because we are asking for ordinal "2", but actually there is only one
aggregation expression ['a]

@clockfly
Copy link
Contributor

I think a proper fix will be marking ordinal unresolved, the ordinal can exists in group by or order by expression.

Then we can make sure the ResolveAggregateFunctions and other analyzer rules doesn't assume
the ordinals are resolved, and do pre-mature Analysis.

@clockfly
Copy link
Contributor

clockfly commented Aug 12, 2016

@dongjoon-hyun I have implemented the idea in #14616
May be you can take a look to see what I mean.

@yhuai
Copy link
Contributor

yhuai commented Aug 12, 2016

@dongjoon-hyun Seems this issue has been fixed as a by-product of #14595. How about we close this? Also, feel free to look at @clockfly's follow-up pr.

@dongjoon-hyun
Copy link
Member Author

Yep. I confirmed that it was nicely resolved at 6bf20cd .
Thank you for review, @yhuai and @clockfly , @gatorsmile .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-16955-ORDINAL branch January 17, 2018 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants