Skip to content

Conversation

@kevinyu98
Copy link
Contributor

During executing PromoteStrings rule, if one side of binaryComparison is StringType and the other side is not StringType, the current code will promote(cast) the StringType to DoubleType, and if the StringType doesn't contain the numbers, it will get null value. So if it is doing <=> (NULL-safe equal) with Null, it will not filter anything, caused the problem reported by this jira.

I proposal to the changes through this PR, can you review my code changes ?

This problem only happen for <=>, other operators works fine.

scala> val filteredDF = df.filter(df("column") > (new Column(Literal(null))))
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

scala> filteredDF.show
+------+
|column|
+------+
+------+

scala> val filteredDF = df.filter(df("column") === (new Column(Literal(null))))
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

scala> filteredDF.show
+------+
|column|
+------+
+------+

scala> df.registerTempTable("DF")

scala> sqlContext.sql("select * from DF where 'column' = NULL")
res27: org.apache.spark.sql.DataFrame = [column: string]

scala> res27.show
+------+
|column|
+------+
+------+

@cloud-fan
Copy link
Contributor

ok to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this rule looks weird to me, how about casting to tightest common type of left and right? cc @marmbrus @yhuai @nongli

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems @liancheng added this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, actually I only simplified the original rule with conciser pattern matching in PR #6537 (here). Tracked down the history and it turned out that this rule had already been there ever since the very first commit of Spark SQL by @marmbrus :) (here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here was to mimic hive's type coercion rules. I think if you create a compatibility test like SELECT "0001" = 1 this rule is required (if its not then we could consider changing this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive do support SELECT "0001" = 1, however, I think this rule is too simple, how about using findTightestCommonTypeToString?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, this rule will fire first and change the type to DoubleType.
btw I think it's a bad smell to have conflict rules, we should improve it and make sure it only handles cases that missed by ImplicitTypeCasts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan : do you want me to open a new jira to look into this? The new jira/pr will focus on the rules in PromoteStrings and ImplicitTypeCasts, as you suggested to reduce the redundant rules in PromoteStrings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinyu98 I do not think that is really a problem for now. I think we do not need a jira for that right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinyu98 please hold off until you find something is broken by this and we have to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai @cloud-fan : sure, I will not do that. I will try to run more testing to see if anything is broken.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45967 has finished for PR 9720 at commit bb705ca.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45974 has finished for PR 9720 at commit 5a5be06.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, thanks for working on this @kevinyu98

@kevinyu98
Copy link
Contributor Author

@cloud-fan and @marmbrus @yhuai @nongli @liancheng : thanks for reviewing the fix.

@yhuai
Copy link
Contributor

yhuai commented Nov 17, 2015

Merging to master and branch 1.6.

asfgit pushed a commit that referenced this pull request Nov 17, 2015
…son between NullType and StringType

During executing PromoteStrings rule, if one side of binaryComparison is StringType and the other side is not StringType, the current code will promote(cast) the StringType to DoubleType, and if the StringType doesn't contain the numbers, it will get null value. So if it is doing <=> (NULL-safe equal) with Null, it will not filter anything, caused the problem reported by this jira.

I proposal to the changes through this PR, can you review my code changes ?

This problem only happen for <=>, other operators works fine.

scala> val filteredDF = df.filter(df("column") > (new Column(Literal(null))))
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

scala> filteredDF.show
+------+
|column|
+------+
+------+

scala> val filteredDF = df.filter(df("column") === (new Column(Literal(null))))
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

scala> filteredDF.show
+------+
|column|
+------+
+------+

scala> df.registerTempTable("DF")

scala> sqlContext.sql("select * from DF where 'column' = NULL")
res27: org.apache.spark.sql.DataFrame = [column: string]

scala> res27.show
+------+
|column|
+------+
+------+

Author: Kevin Yu <qyu@us.ibm.com>

Closes #9720 from kevinyu98/working_on_spark-11447.

(cherry picked from commit e01865a)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in e01865a Nov 17, 2015
clehene pushed a commit to clehene/spark that referenced this pull request Dec 2, 2015
…son between NullType and StringType

During executing PromoteStrings rule, if one side of binaryComparison is StringType and the other side is not StringType, the current code will promote(cast) the StringType to DoubleType, and if the StringType doesn't contain the numbers, it will get null value. So if it is doing <=> (NULL-safe equal) with Null, it will not filter anything, caused the problem reported by this jira.

I proposal to the changes through this PR, can you review my code changes ?

This problem only happen for <=>, other operators works fine.

scala> val filteredDF = df.filter(df("column") > (new Column(Literal(null))))
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

scala> filteredDF.show
+------+
|column|
+------+
+------+

scala> val filteredDF = df.filter(df("column") === (new Column(Literal(null))))
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

scala> filteredDF.show
+------+
|column|
+------+
+------+

scala> df.registerTempTable("DF")

scala> sqlContext.sql("select * from DF where 'column' = NULL")
res27: org.apache.spark.sql.DataFrame = [column: string]

scala> res27.show
+------+
|column|
+------+
+------+

Author: Kevin Yu <qyu@us.ibm.com>

Closes apache#9720 from kevinyu98/working_on_spark-11447.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants