-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11433] [SQL] Cleanup the subquery name after eliminating subquery #9385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I use NamedExpression to replace AttributeReference?
|
Jenkins, okay to test. |
|
Logically when we remove |
|
Hi, @cloud-fan and @dbtsai So far, I just observed this strange ghosting qualifiers values when I read the optimized logical tree, but my query did not trigger any issue. Based on my understanding, usage of qualifiers is still limited in the current code base. It could be a potential issue when we support more complex SQL syntax/functions. Thus, I submitted this pull request to resolve this issue. Of course, I will continue to pay attention to this issue in the future. Thanks, Xiao Li |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we simplify this to a map and remove the :: Nil we have in the two sub cases? since it seems we are always returning a single element list for every case so it should be ok as a map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I did the change based on your suggestion. : )
|
@cloud-fan @dbtsai , Jenkins did not start the testing. Could you let Jenkins to test it? Thank you! |
|
I think there is some issue in Jenkins. |
|
Jenkins, add to whitelist |
|
Test build #44981 has finished for PR 9385 at commit
|
|
@dbtsai Thank you! Please let me know if you need any extra code change. |
|
Thanks for your contribution, but I'm tempted to not make this change unless there is actually a bug. We are eliminating the subqueries because they will impact optimization and planning. However, keeping the qualifiers around could actually be useful if we want to give better error messages. |
|
@marmbrus I already hit this issue when resolving https://issues.apache.org/jira/browse/SPARK-8658. That means, when comparing two AttributeReferences, we should not compare their qualifiers. That looks a strange fix, right? |
|
If you ran into problems adding qualifiers to |
|
Hi, @marmbrus After digging the root reason why Expand cases failed, I found we still need a deeper clean of subquery names in the plan tree after elimination. Let me use the following example to explain what happened in Expand. This query works well if we do not compare the qualifiers when comparing two AttributeReferences. However, if merging #9216, our current subquery elimination will cause an incorrect query result. val sqlDF = sql("select a, b, sum(a) from mytable group by a, b with rollup").explain(true)Before subquery elimination, the subquery name "mytable" is shown in all the two upper layers (Aggregate and Expand). Aggregate [a#2,b#3,grouping__id#5], [a#2,b#3,sum(cast(a#2 as bigint)) AS _c2#4L]
Expand [0,1,3], [a#2,b#3], grouping__id#5
Subquery mytable
Project [_1#0 AS a#2,_2#1 AS b#3]
LocalRelation [_1#0,_2#1], [[1,2],[2,4]]After subquery elimination, the subquery name "mytable" is not removed in these two upper layers Aggregate [a#2,b#3,grouping__id#5], [a#2,b#3,sum(cast(a#2 as bigint)) AS _c2#4L]
Expand [0,1,3], [a#2,b#3], grouping__id#5
Project [_1#0 AS a#2,_2#1 AS b#3]
LocalRelation [_1#0,_2#1], [[1,2],[2,4]]In SparkStrategies, we create an array of Projections for the child projection of case e @ logical.Expand(_, _, _, child) =>
execution.Expand(e.projections, e.output, planLater(child)) :: Nil
Let me post the incorrect physical plan TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Final,isDistinct=false)], output=[a#2,b#3,_c2#11L])
TungstenExchange hashpartitioning(a#2,b#3,grouping__id#12,5)
TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Partial,isDistinct=false)], output=[a#2,b#3,grouping__id#12,currentSum#15L])
Expand [List(a#2, b#3, 0),List(a#2, b#3, 1),List(a#2, b#3, 3)], [a#2,b#3,grouping__id#12]
LocalTableScan [a#2,b#3], [[1,2],[2,4]]For you convenience, below is the correct one (if we do not compare qualifiers in the TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Final,isDistinct=false)], output=[a#2,b#3,_c2#11L])
TungstenExchange hashpartitioning(a#2,b#3,grouping__id#12,5)
TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Partial,isDistinct=false)], output=[a#2,b#3,grouping__id#12,currentSum#15L])
Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], [a#2,b#3,grouping__id#12]
LocalTableScan [a#2,b#3], [[1,2],[2,4]]My current fix does not fix this issue yet. |
|
I'm sorry, I don't see where this |
|
Hi, @marmbrus Originally, I thought quantifiers are part of identifiers, like schema name in traditional RDBMS. Based on your explanation, this is not true. I did a code change. Please check if the latest changes make sense. Just did a merge to the latest master. #9216. Unfortunately, it triggered another failure in CachedTableSuite. Will try to see if we can use Thank you for your time. |
|
@marmbrus CachedTableSuite failed due to the same reason. We did not clean up the subquery names. Thus, it is unable to give a correct result when deciding if Exchange is needed. I did the fix by using Now, all the test cases passed. Thanks. |
|
Can we close this now that #9216 is merged? |
|
Sure. Close it. Thank you for your time! |
This fix is to remove the subquery name in qualifiers after eliminating subquery.