[SPARK-10777] [SQL] Resolve Aliases in the Group By clause #10967
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@gatorsmile @yhuai @marmbrus @cloud-fan : Hello All, I tried to run the failing query with PR 10678 from Spark-12705, still got the same failure.
Actually for this jira problem, I can recreate it without using order by and window function. It just needs select a column with aliases and aggregate function , group by with the aliases.
the query looks like below:
select a r, sum(b) s FROM testData2 GROUP BY r
(if I replace r in the group by with a, it will work)
I think this jira is different than Xiao's jira.
For this Jira, it looks like the Aliases in the Group By clause (r) can't be resolved in the rule ResolveReferences.
Currently, the ResolveReferences only deal with the aggregate function if the argument contains Stars, so for other aggregate function, it falls into this case: case q: LogicalPlan , and it will try to resolve it in the child. In this case, the group by contains alias r, the child is LogicalRDD contains column a and b, that is why we can't find r in the child.
Here is the plan looks like.
plan = {Aggregate@9173} "'Aggregate ['r], [a#4 AS r#43,(sum(cast(b#5 as bigint)),mode=Complete,isDistinct=false) AS s#44L]\n+- Subquery testData2\n +- LogicalRDD [a#4,b#5], MapPartitionsRDD[5] at beforeAll at BeforeAndAfterAll.scala:187\n"
groupingExpressions = {$colon$colon@9176} "::" size = 1
(0) = {UnresolvedAttribute@9190} "'r"
aggregateExpressions = {$colon$colon@9177} "::" size = 2
(0) = {Alias@9110} "a#4 AS r#43"
(1) = {Alias@9196} "(sum(cast(b#5 as bigint)),mode=Complete,isDistinct=false) AS s#44L"
child = {Subquery@7456} "Subquery testData2\n+- LogicalRDD [a#4,b#5], MapPartitionsRDD[5] at beforeAll at BeforeAndAfterAll.scala:187\n"
alias = {String@9201} "testData2"
child = {LogicalRDD@9202} "LogicalRDD [a#4,b#5], MapPartitionsRDD[5] at beforeAll at BeforeAndAfterAll.scala:187\n"
analyzed = false
resolved = true
cleanArgs = null
org$apache$spark$Logging$$log = null
bitmap$0 = 1
schema = null
bitmap$0 = false
origin = {Origin@9203} "Origin(Some(1),Some(27))"
containsChild = {Set$Set1@9204} "Set$Set1" size = 1
bitmap$0 = true
resolved = false
bitmap$0 = true
_analyzed = false
resolved = false
the proposal fix is that we create another case for aggregate function, if there is unresolved attribute in the groupingExpressions, and all the attributes are resolved in the aggregateExpressions, we will search the unresolved attribute in the aggregateExpressions first.
Thanks for reviewing.