[SPARK-24870][SQL]Cache can't work normally if there are case letters in SQL #21823

eatoncys · 2018-07-20T02:18:18Z

What changes were proposed in this pull request?

Modified the canonicalized to not case-insensitive.
Before the PR, cache can't work normally if there are case letters in SQL,
for example:
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")

sql("select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum " +
  "from src group by key").cache().createOrReplaceTempView("src_cache")
sql(
  s"""select a.key
       from
       (select key from src_cache where positiveNum = 1)a
       left join
       (select key from src_cache )b
       on a.key=b.key
    """).explain

The physical plan of the sql is:

The subquery "select key from src_cache where positiveNum = 1" on the left of join can use the cache data, but the subquery "select key from src_cache" on the right of join cannot use the cache data.

How was this patch tested?

new added test

eatoncys · 2018-07-20T02:19:55Z

cc @cloud-fan @gatorsmile

cloud-fan · 2018-07-20T05:05:35Z

do you know why it's happening? It's super weird that select key from src_cache where positiveNum = 1 can hit the cache but select key from src_cache can not.

eatoncys · 2018-07-20T06:00:02Z

@cloud-fan Because the word 'Key' in the sql of cache "select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum" is Uppercase, and the field positiveNum is used in sql 'select key from src_cache where positiveNum = 1 ';
But not used in sql 'select key from src_cache', so the sql analyzer get the filed of 'key' from the original table 'src', which is lowercase.

SparkQA · 2018-07-20T06:16:38Z

Test build #93312 has finished for PR 21823 at commit 2b2a5a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-20T06:41:14Z

so the sql analyzer get the filed of 'key' from the original table 'src', which is lowercase.

shouldn't we always do it?

eatoncys · 2018-07-20T06:45:16Z

@cloud-fan Cast 'Key' to lower case is done by rule of ResolveReferences:

eatoncys · 2018-07-20T07:07:31Z

@cloud-fan
case j @ Join(left, right, _, _) if !j.duplicateResolved =>
j.copy(right = dedupRight(left, right))
dedupRight generate a new logical plan for the right child, which get the 'key' from the original table 'src', but left not.

cloud-fan · 2018-07-20T07:25:33Z

can we fix this in dedupRight?

eatoncys · 2018-07-20T07:32:27Z

@cloud-fan fix this in dedupRight is Ok, but maybe there are other operations like dedupRight to change the case of the word.

eatoncys · 2018-07-20T07:38:51Z

@cloud-fan why not fix this in doCanonicalize? I think it is better to fix it in doCanonicalize, but I'm not very sure.

cloud-fan · 2018-07-20T08:19:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

        // normalize the epxrId too.
        id += 1
-        ar.withExprId(ExprId(id)).canonicalized
+        ar.withExprId(ExprId(id)).withName(ar.name.toLowerCase(Locale.ROOT)).canonicalized


shall we just erase the attribute name like alias?

I think it is Ok, and it erase the attribute name in spark version 2.0.2

cloud-fan · 2018-07-20T08:54:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

        val ordinal = input.indexOf(ar.exprId)
        if (ordinal == -1) {
-          ar
+          ar.withName("")


let's leave it. We don't even normalize the exprId here.

cloud-fan · 2018-07-20T08:56:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

        // normalize the epxrId too.
        id += 1
-        ar.withExprId(ExprId(id)).canonicalized
+        ar.withExprId(ExprId(id)).withName("").canonicalized


oh wait. I think we've already erased the name, in Expression#canonicalized

cloud-fan · 2018-07-20T08:57:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

+          ar.withName("")
        } else {
-          ar.withExprId(ExprId(ordinal))
+          ar.withExprId(ExprId(ordinal)).withName("")


I think we just need to add a .canonicalized at the end.

cloud-fan · 2018-07-20T08:59:32Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala

    assert(!range.where(arrays1).sameResult(range.where(arrays3)))
  }
+
+  test("Canonicalized result is not case-insensitive") {


let's move it to SameResultSuite, also let's pick a simpler test, like using a Project with one columns instead of Aggregate.

Ok,modified,thanks.

SparkQA · 2018-07-20T11:34:02Z

Test build #93337 has finished for PR 21823 at commit b5b2a1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-20T11:36:50Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SameResultSuite.scala

+  test("Canonicalized result is not case-insensitive") {
+    val a = AttributeReference("A", IntegerType)()
+    val b = AttributeReference("B", IntegerType)()
+    val planUppercase = Project(Seq(a, b), LocalRelation(a))


we should create valid plans... Project(Seq(a), LocalRelation(a, b))

cloud-fan · 2018-07-20T11:37:00Z

LGTM

SparkQA · 2018-07-20T12:46:47Z

Test build #93333 has finished for PR 21823 at commit 1aefcb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-20T12:48:32Z

Test build #93332 has finished for PR 21823 at commit c01cf89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-20T14:04:49Z

Test build #93338 has finished for PR 21823 at commit 86c7ed6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-21T06:11:52Z

Test build #93376 has finished for PR 21823 at commit f2091a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

eatoncys · 2018-07-23T10:13:45Z

Can we merge it to master? @cloud-fan @gatorsmile

gatorsmile · 2018-07-23T17:06:02Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SameResultSuite.scala

    assert(df3.queryExecution.executedPlan.sameResult(df4.queryExecution.executedPlan))
  }
+
+  test("Canonicalized result is not case-insensitive") {


Canonicalized result is not case-insensitive -> Canonicalized result is case-insensitive

Modified, thanks.

SparkQA · 2018-07-24T05:12:44Z

Test build #93467 has finished for PR 21823 at commit f3a7963.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-24T06:04:50Z

Thanks! Merged to master.

…s in SQL Modified the canonicalized to not case-insensitive. Before the PR, cache can't work normally if there are case letters in SQL, for example: sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive") sql("select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum " + "from src group by key").cache().createOrReplaceTempView("src_cache") sql( s"""select a.key from (select key from src_cache where positiveNum = 1)a left join (select key from src_cache )b on a.key=b.key """).explain The physical plan of the sql is: ![image](https://user-images.githubusercontent.com/26834091/42979518-3decf0fa-8c05-11e8-9837-d5e4c334cb1f.png) The subquery "select key from src_cache where positiveNum = 1" on the left of join can use the cache data, but the subquery "select key from src_cache" on the right of join cannot use the cache data. new added test Author: 10129659 <chen.yanshan@zte.com.cn> Closes apache#21823 from eatoncys/canonicalized. (cherry picked from commit 13a67b0) RB=1413356 BUG=LIHADOOP-40154 G=superfriends-reviewers R=fli,mshen,yezhou,edlu A=edlu

Cache can't work normally if there are case letters in SQL

2b2a5a3

cloud-fan reviewed Jul 20, 2018

View reviewed changes

eatoncys added 2 commits July 20, 2018 16:36

erase the attribute name

c01cf89

erase the attribute name

1aefcb3

cloud-fan reviewed Jul 20, 2018

View reviewed changes

eatoncys added 2 commits July 20, 2018 17:36

using canonicalized

b5b2a1b

using canonicalized

86c7ed6

cloud-fan reviewed Jul 20, 2018

View reviewed changes

project

f2091a4

gatorsmile reviewed Jul 23, 2018

View reviewed changes

Canonicalized result is case-insensitive

f3a7963

asfgit closed this in 13a67b0 Jul 24, 2018

[SPARK-24870][SQL]Cache can't work normally if there are case letters in SQL #21823

[SPARK-24870][SQL]Cache can't work normally if there are case letters in SQL #21823

Uh oh!

Conversation

eatoncys commented Jul 20, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

eatoncys commented Jul 20, 2018

Uh oh!

cloud-fan commented Jul 20, 2018

Uh oh!

eatoncys commented Jul 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 20, 2018

Uh oh!

cloud-fan commented Jul 20, 2018

Uh oh!

eatoncys commented Jul 20, 2018

Uh oh!

eatoncys commented Jul 20, 2018

Uh oh!

cloud-fan commented Jul 20, 2018

Uh oh!

eatoncys commented Jul 20, 2018

Uh oh!

eatoncys commented Jul 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eatoncys Jul 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 20, 2018

Uh oh!

SparkQA commented Jul 20, 2018

Uh oh!

SparkQA commented Jul 20, 2018

Uh oh!

SparkQA commented Jul 20, 2018

Uh oh!

SparkQA commented Jul 21, 2018

Uh oh!

eatoncys commented Jul 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 24, 2018

Uh oh!

gatorsmile commented Jul 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eatoncys commented Jul 20, 2018 •

edited

Loading

eatoncys Jul 20, 2018 •

edited

Loading