[SPARK-11433] [SQL] Cleanup the subquery name after eliminating subquery #9385

gatorsmile · 2015-10-30T22:26:05Z

This fix is to remove the subquery name in qualifiers after eliminating subquery.

gatorsmile · 2015-10-30T22:41:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Should I use NamedExpression to replace AttributeReference?

dbtsai · 2015-10-30T22:57:05Z

Jenkins, okay to test.

cloud-fan · 2015-10-31T06:25:43Z

Logically when we remove Subquery, we should remove related qualifiers, however, can you construct a case that hit problems because of not removing qualifiers?

gatorsmile · 2015-10-31T07:33:21Z

Hi, @cloud-fan and @dbtsai

So far, I just observed this strange ghosting qualifiers values when I read the optimized logical tree, but my query did not trigger any issue.

Based on my understanding, usage of qualifiers is still limited in the current code base. It could be a potential issue when we support more complex SQL syntax/functions. Thus, I submitted this pull request to resolve this issue.

Of course, I will continue to pay attention to this issue in the future.

Thanks,

Xiao Li

holdenk · 2015-11-02T09:11:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Could we simplify this to a map and remove the :: Nil we have in the two sub cases? since it seems we are always returning a single element list for every case so it should be ok as a map.

Thank you! I did the change based on your suggestion. : )

gatorsmile · 2015-11-04T00:08:20Z

@cloud-fan @dbtsai , Jenkins did not start the testing. Could you let Jenkins to test it?

Thank you!

dbtsai · 2015-11-04T00:50:17Z

I think there is some issue in Jenkins.

dbtsai · 2015-11-04T00:50:43Z

Jenkins, add to whitelist

SparkQA · 2015-11-04T02:48:18Z

Test build #44981 has finished for PR 9385 at commit a26763d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-11-04T07:38:16Z

@dbtsai Thank you!

Please let me know if you need any extra code change.

marmbrus · 2015-11-07T01:20:23Z

Thanks for your contribution, but I'm tempted to not make this change unless there is actually a bug. We are eliminating the subqueries because they will impact optimization and planning. However, keeping the qualifiers around could actually be useful if we want to give better error messages.

gatorsmile · 2015-11-07T03:23:17Z

@marmbrus I already hit this issue when resolving https://issues.apache.org/jira/browse/SPARK-8658. That means, when comparing two AttributeReferences, we should not compare their qualifiers. That looks a strange fix, right?

marmbrus · 2015-11-07T06:50:38Z

Expression instances have two different kinds of equality:

equals full equality, i.e. capitalization, qualifiers, etc. This is important for example to know if a tree transformation has reached fixed point.
semanticEquals equivalent in execution (i.e. will return the same result)

If you ran into problems adding qualifiers to equals then I think it is likely that that code should actually be using semanticEquals.

gatorsmile · 2015-11-07T07:43:41Z

@marmbrus Thanks!

I will try to change equals to semanticEquals in the pull request #9216. Then, you can decide if this is a right solution.

gatorsmile · 2015-11-11T06:00:12Z

Hi, @marmbrus

After digging the root reason why Expand cases failed, I found we still need a deeper clean of subquery names in the plan tree after elimination.

Let me use the following example to explain what happened in Expand. This query works well if we do not compare the qualifiers when comparing two AttributeReferences. However, if merging #9216, our current subquery elimination will cause an incorrect query result.

val sqlDF = sql("select a, b, sum(a) from mytable group by a, b with rollup").explain(true)

Before subquery elimination, the subquery name "mytable" is shown in all the two upper layers (Aggregate and Expand).

Aggregate [a#2,b#3,grouping__id#5], [a#2,b#3,sum(cast(a#2 as bigint)) AS _c2#4L]
 Expand [0,1,3], [a#2,b#3], grouping__id#5
  Subquery mytable
   Project [_1#0 AS a#2,_2#1 AS b#3]
    LocalRelation [_1#0,_2#1], [[1,2],[2,4]]

After subquery elimination, the subquery name "mytable" is not removed in these two upper layers Aggregate and Expand.

Aggregate [a#2,b#3,grouping__id#5], [a#2,b#3,sum(cast(a#2 as bigint)) AS _c2#4L]
 Expand [0,1,3], [a#2,b#3], grouping__id#5
  Project [_1#0 AS a#2,_2#1 AS b#3]
   LocalRelation [_1#0,_2#1], [[1,2],[2,4]]

In SparkStrategies, we create an array of Projections for the child projection of Expand.

      case e @ logical.Expand(_, _, _, child) =>
        execution.Expand(e.projections, e.output, planLater(child)) :: Nil

e.projections calls the function expand(). Inside the function expand(), I do not think we should use semanticEquals there.

Let me post the incorrect physical plan

TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Final,isDistinct=false)], output=[a#2,b#3,_c2#11L])
 TungstenExchange hashpartitioning(a#2,b#3,grouping__id#12,5)
  TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Partial,isDistinct=false)], output=[a#2,b#3,grouping__id#12,currentSum#15L])
   Expand [List(a#2, b#3, 0),List(a#2, b#3, 1),List(a#2, b#3, 3)], [a#2,b#3,grouping__id#12]
    LocalTableScan [a#2,b#3], [[1,2],[2,4]]

For you convenience, below is the correct one (if we do not compare qualifiers in the equal method of AttributeReference):

TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Final,isDistinct=false)], output=[a#2,b#3,_c2#11L])
 TungstenExchange hashpartitioning(a#2,b#3,grouping__id#12,5)
  TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Partial,isDistinct=false)], output=[a#2,b#3,grouping__id#12,currentSum#15L])
   Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], [a#2,b#3,grouping__id#12]
    LocalTableScan [a#2,b#3], [[1,2],[2,4]]

My current fix does not fix this issue yet.

marmbrus · 2015-11-12T21:23:07Z

I'm sorry, I don't see where this expand() function you are talking about is, or why it should not use semanticEquals. The whole point of semanticEquals is "are these two attributes referring to the same data", and in many cases this is actually what you want instead of structural equality (structural equality matters, for example, when you are checking to see if a tree has reached fixed point. otherwise we could not use rules to make changes to Attributes that are cosmetic in nature).

gatorsmile · 2015-11-14T02:33:41Z

Hi, @marmbrus

Originally, I thought quantifiers are part of identifiers, like schema name in traditional RDBMS. Based on your explanation, this is not true.

I did a code change. Please check if the latest changes make sense. semanticEquals is used. Now, all the test cases passed. gatorsmile@8e72b17

Just did a merge to the latest master. #9216. Unfortunately, it triggered another failure in CachedTableSuite. Will try to see if we can use semanticEquals too.

Thank you for your time.

gatorsmile · 2015-11-14T18:29:22Z

@marmbrus CachedTableSuite failed due to the same reason. We did not clean up the subquery names. Thus, it is unable to give a correct result when deciding if Exchange is needed.

I did the fix by using semanticEquals. Please check if the changes are appropriate. #9216

Now, all the test cases passed. Thanks.

marmbrus · 2015-11-16T23:29:05Z

Can we close this now that #9216 is merged?

gatorsmile · 2015-11-17T02:15:01Z

Sure. Close it. Thank you for your time!

cleanup the subquery name after eliminating subquery

db69ccf

gatorsmile reviewed Oct 30, 2015
View reviewed changes

holdenk reviewed Nov 2, 2015
View reviewed changes

flatmap->map

a26763d

gatorsmile closed this Nov 10, 2015

gatorsmile reopened this Nov 10, 2015

gatorsmile closed this Nov 17, 2015

gatorsmile deleted the eliminateSubQ branch November 17, 2015 02:17

[SPARK-11433] [SQL] Cleanup the subquery name after eliminating subquery #9385

[SPARK-11433] [SQL] Cleanup the subquery name after eliminating subquery #9385

Uh oh!

Conversation

gatorsmile commented Oct 30, 2015

Uh oh!

gatorsmile Oct 30, 2015

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Oct 30, 2015

Uh oh!

cloud-fan commented Oct 31, 2015

Uh oh!

gatorsmile commented Oct 31, 2015

Uh oh!

holdenk Nov 2, 2015

Choose a reason for hiding this comment

Uh oh!

gatorsmile Nov 4, 2015

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Nov 4, 2015

Uh oh!

dbtsai commented Nov 4, 2015

Uh oh!

dbtsai commented Nov 4, 2015

Uh oh!

SparkQA commented Nov 4, 2015

Uh oh!

gatorsmile commented Nov 4, 2015

Uh oh!

marmbrus commented Nov 7, 2015

Uh oh!

gatorsmile commented Nov 7, 2015

Uh oh!

marmbrus commented Nov 7, 2015

Uh oh!

gatorsmile commented Nov 7, 2015

Uh oh!

gatorsmile commented Nov 11, 2015

Uh oh!

marmbrus commented Nov 12, 2015

Uh oh!

gatorsmile commented Nov 14, 2015

Uh oh!

gatorsmile commented Nov 14, 2015

Uh oh!

marmbrus commented Nov 16, 2015

Uh oh!

gatorsmile commented Nov 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants