Skip to content

Conversation

@eatoncys
Copy link
Contributor

What changes were proposed in this pull request?

Modified the canonicalized to not case-insensitive.
Before the PR, cache can't work normally if there are case letters in SQL,
for example:
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")

sql("select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum " +
  "from src group by key").cache().createOrReplaceTempView("src_cache")
sql(
  s"""select a.key
       from
       (select key from src_cache where positiveNum = 1)a
       left join
       (select key from src_cache )b
       on a.key=b.key
    """).explain

The physical plan of the sql is:
image

The subquery "select key from src_cache where positiveNum = 1" on the left of join can use the cache data, but the subquery "select key from src_cache" on the right of join cannot use the cache data.

How was this patch tested?

new added test

@eatoncys
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

@cloud-fan
Copy link
Contributor

do you know why it's happening? It's super weird that select key from src_cache where positiveNum = 1 can hit the cache but select key from src_cache can not.

@eatoncys
Copy link
Contributor Author

eatoncys commented Jul 20, 2018

@cloud-fan Because the word 'Key' in the sql of cache "select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum" is Uppercase, and the field positiveNum is used in sql 'select key from src_cache where positiveNum = 1 ';
But not used in sql 'select key from src_cache', so the sql analyzer get the filed of 'key' from the original table 'src', which is lowercase.

@SparkQA
Copy link

SparkQA commented Jul 20, 2018

Test build #93312 has finished for PR 21823 at commit 2b2a5a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

so the sql analyzer get the filed of 'key' from the original table 'src', which is lowercase.

shouldn't we always do it?

@eatoncys
Copy link
Contributor Author

@cloud-fan Cast 'Key' to lower case is done by rule of ResolveReferences:
image

@eatoncys
Copy link
Contributor Author

@cloud-fan
case j @ Join(left, right, _, _) if !j.duplicateResolved =>
j.copy(right = dedupRight(left, right))
dedupRight generate a new logical plan for the right child, which get the 'key' from the original table 'src', but left not.

@cloud-fan
Copy link
Contributor

can we fix this in dedupRight?

@eatoncys
Copy link
Contributor Author

@cloud-fan fix this in dedupRight is Ok, but maybe there are other operations like dedupRight to change the case of the word.

@eatoncys
Copy link
Contributor Author

@cloud-fan why not fix this in doCanonicalize? I think it is better to fix it in doCanonicalize, but I'm not very sure.

// normalize the epxrId too.
id += 1
ar.withExprId(ExprId(id)).canonicalized
ar.withExprId(ExprId(id)).withName(ar.name.toLowerCase(Locale.ROOT)).canonicalized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just erase the attribute name like alias?

Copy link
Contributor Author

@eatoncys eatoncys Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is Ok, and it erase the attribute name in spark version 2.0.2

val ordinal = input.indexOf(ar.exprId)
if (ordinal == -1) {
ar
ar.withName("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave it. We don't even normalize the exprId here.

// normalize the epxrId too.
id += 1
ar.withExprId(ExprId(id)).canonicalized
ar.withExprId(ExprId(id)).withName("").canonicalized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait. I think we've already erased the name, in Expression#canonicalized

ar.withName("")
} else {
ar.withExprId(ExprId(ordinal))
ar.withExprId(ExprId(ordinal)).withName("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to add a .canonicalized at the end.

assert(!range.where(arrays1).sameResult(range.where(arrays3)))
}

test("Canonicalized result is not case-insensitive") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move it to SameResultSuite, also let's pick a simpler test, like using a Project with one columns instead of Aggregate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,modified,thanks.

@SparkQA
Copy link

SparkQA commented Jul 20, 2018

Test build #93337 has finished for PR 21823 at commit b5b2a1b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

test("Canonicalized result is not case-insensitive") {
val a = AttributeReference("A", IntegerType)()
val b = AttributeReference("B", IntegerType)()
val planUppercase = Project(Seq(a, b), LocalRelation(a))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should create valid plans... Project(Seq(a), LocalRelation(a, b))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,thanks.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jul 20, 2018

Test build #93333 has finished for PR 21823 at commit 1aefcb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2018

Test build #93332 has finished for PR 21823 at commit c01cf89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2018

Test build #93338 has finished for PR 21823 at commit 86c7ed6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2018

Test build #93376 has finished for PR 21823 at commit f2091a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@eatoncys
Copy link
Contributor Author

Can we merge it to master? @cloud-fan @gatorsmile

assert(df3.queryExecution.executedPlan.sameResult(df4.queryExecution.executedPlan))
}

test("Canonicalized result is not case-insensitive") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canonicalized result is not case-insensitive -> Canonicalized result is case-insensitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, thanks.

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93467 has finished for PR 21823 at commit f3a7963.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 13a67b0 Jul 24, 2018
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…s in SQL

Modified the canonicalized to not case-insensitive.
Before the PR, cache can't work normally if there are case letters in SQL,
for example:
     sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")

    sql("select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum " +
      "from src group by key").cache().createOrReplaceTempView("src_cache")
    sql(
      s"""select a.key
           from
           (select key from src_cache where positiveNum = 1)a
           left join
           (select key from src_cache )b
           on a.key=b.key
        """).explain

The physical plan of the sql is:
![image](https://user-images.githubusercontent.com/26834091/42979518-3decf0fa-8c05-11e8-9837-d5e4c334cb1f.png)

The subquery "select key from src_cache where positiveNum = 1" on the left of join can use the cache data, but the subquery "select key from src_cache" on the right of join cannot use the cache data.

new added test

Author: 10129659 <chen.yanshan@zte.com.cn>

Closes apache#21823 from eatoncys/canonicalized.

(cherry picked from commit 13a67b0)

RB=1413356
BUG=LIHADOOP-40154
G=superfriends-reviewers
R=fli,mshen,yezhou,edlu
A=edlu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants