[SPARK-28189][SQL] Use semanticEquals in Dataset drop method for attributes comparison #25055

Tonix517 · 2019-07-05T00:10:38Z

What changes were proposed in this pull request?

In Dataset drop(col: Column) method, the equals comparison method was used instead of semanticEquals, which caused the problem of abnormal case-sensitivity behavior. When attributes of LogicalPlan are checked for equality, semanticEquals should be used instead.

A similar PR I referred to: #22713 created by @mgaido91

How was this patch tested?

Added new unit test case in DataFrameSuite
./build/sbt "testOnly org.apache.spark.sql.*"
The python code from ticket reporter at https://issues.apache.org/jira/browse/SPARK-28189

maropu · 2019-07-05T00:51:37Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

  }

+  test("drop column using drop with column reference with case-insensitive names") {
+    val col1 = testData("KEY")


Could you check this on both cases: spark.sql.caseSensitive=true/false?

+1 for @maropu 's comment.

With that flag set as true, an exception was thrown out org.apache.spark.sql.AnalysisException: Cannot resolve column name "KEY" among (key, value);. It looks to be the correct behavior in that case. do I still need to check it in this test case and make sure such exception is thrown and caught?

Yep. For case-sensitive issue, we need to have a test coverage for both.

maropu · 2019-07-05T00:52:18Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    val attrs = this.logicalPlan.output
    val colsAfterDrop = attrs.filter { attr =>
-      attr != expression
+      !attr.semanticEquals(expression)


Is there no palce having the same issue?

Went through Dataset.scala - didn't find similar issue. However there might be the same problems in other places in our SQL code..

There are other places. Please see #21449.

ok, thanks. I think its ok to only target dataset.drop in this pr.

+1 for @maropu 's opinion.

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

maropu · 2019-07-05T01:00:44Z

ok to test

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

SparkQA · 2019-07-05T04:14:21Z

Test build #107251 has finished for PR 25055 at commit 7438f90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Tonix517 · 2019-07-05T05:22:13Z

retest this please

SparkQA · 2019-07-05T07:05:01Z

Test build #107260 has finished for PR 25055 at commit ef4122d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tugangkai · 2019-07-05T11:59:32Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    val attrs = this.logicalPlan.output
    val colsAfterDrop = attrs.filter { attr =>
-      attr != expression
+      !attr.semanticEquals(expression)


Tonix517 · 2019-07-05T18:44:30Z

retest this please

dongjoon-hyun · 2019-07-05T18:46:43Z

Thank you for update, @Tonix517 .

SparkQA · 2019-07-05T21:48:07Z

Test build #107289 has finished for PR 25055 at commit 1dc9aa7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you for your first contribution, @Tonix517 .
Thank you, @maropu and @mgaido91 , too.
Merged to master.

Tonix517 · 2019-07-07T04:54:20Z

@dongjoon-hyun @maropu @mgaido91 thank you guys very much on helping my first commit - looking forward to working with you more :)

maropu · 2019-07-07T05:26:31Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    // With SQL config caseSensitive ON, AnalysisException should be thrown
+    withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
+      val e = intercept[AnalysisException] {
+        testData("KEY")


wait, it seems this test is not related to this pr?
you just wanted to do like this?

test("SPARK-28189 drop column using drop with column reference with case-insensitive names") { var caseInsensitiveCol: Column = null withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") { caseInsensitiveCol = testData("KEY") } Seq("true", "false").foreach { isCaseSensitive => withSQLConf(SQLConf.CASE_SENSITIVE.key -> isCaseSensitive) { val df = testData.drop(caseInsensitiveCol) checkAnswer(df, testData.selectExpr("value")) assert(df.schema.map(_.name) === Seq("value")) } } }

? @maropu . In your example, the following will fail in the same way because testData only has key and value.

withSQLConf(SQLConf.CASE_SENSITIVE.key -> isCaseSensitive) { caseInsensitiveCol = testData("KEY")

For me, the test case looked okay because caseSensitive mode doesn't allow that kind of unmatched column from the beginning.

oh, sorry and my bad. That line wasn't needed and I updated the example above.
Anyway, I just a bit worry about the behaivour change in the code flow below;

scala> val testData = Seq(("a", 1)).toDF("key", "value") scala> sql("SET spark.sql.caseSensitive=false") scala> val caseInsensitiveCol = testData("KEY") scala> sql("SET spark.sql.caseSensitive=true") scala> testData.drop(caseInsensitiveCol).show() // v2.4.3 +---+-----+ |key|value| +---+-----+ | a| 1| +---+-----+ // master w/this pr +-----+ |value| +-----+ | 1| +-----+

First, it's not a correct(?) use-case to switch the case sensitivity between declaring the variable and using the variable.

Second, I merged this to master only with the similar concern. :)
Since this is a bug fix, I think we don't need a documentation about this. And, 3.0.0 is a good place to change those behavior change due to the bug fix.

ok, thanks for the check!

Do you have any suggestion for your concern? Then, please share it with us~ You're welcome.

NVM. I just a bit worry about not use cases but the test coverage.

Yes. This test case of the second part (SQLConf.CASE_SENSITIVE.key -> "true") is weird to me.

Anyway, we can keep it unchanged. Ideally, we do not need the second part.

ok, I made a pr to drop that part: #25216
anyway, thanks for the check!

gatorsmile · 2019-07-21T03:05:06Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    val attrs = this.logicalPlan.output
    val colsAfterDrop = attrs.filter { attr =>
-      attr != expression
+      !attr.semanticEquals(expression)


def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized

What is the reason the comparison should be related to the deterministic when we want to drop it?

nvm. The output only contains Attribute

gatorsmile

LGTM

…ibutes comparison ## What changes were proposed in this pull request? In Dataset drop(col: Column) method, the `equals` comparison method was used instead of `semanticEquals`, which caused the problem of abnormal case-sensitivity behavior. When attributes of LogicalPlan are checked for equality, `semanticEquals` should be used instead. A similar PR I referred to: apache#22713 created by mgaido91 ## How was this patch tested? - Added new unit test case in DataFrameSuite - ./build/sbt "testOnly org.apache.spark.sql.*" - The python code from ticket reporter at https://issues.apache.org/jira/browse/SPARK-28189 Closes apache#25055 from Tonix517/SPARK-28189. Authored-by: Tony Zhang <tony.zhang@uber.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SPARK-28189] Use semanticEquals in Dataset drop method

7438f90

maropu reviewed Jul 5, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 5, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun changed the title ~~[SPARK-28189] Use semanticEquals in Dataset drop method for attributes comparison~~ [SPARK-28189][SQL] Use semanticEquals in Dataset drop method for attributes comparison Jul 5, 2019

dongjoon-hyun added the SQL label Jul 5, 2019

[SPARK-28189][SQL] Use semanticEquals in Dataset drop method

ef4122d

tugangkai reviewed Jul 5, 2019

View reviewed changes

[SPARK-28189][SQL] Use semanticEquals in Dataset drop method

1dc9aa7

dongjoon-hyun approved these changes Jul 7, 2019

View reviewed changes

dongjoon-hyun closed this in 20469d4 Jul 7, 2019

Tonix517 deleted the SPARK-28189 branch July 7, 2019 04:56

maropu reviewed Jul 7, 2019

View reviewed changes

gatorsmile reviewed Jul 21, 2019

View reviewed changes

cloud-fan mentioned this pull request Jan 31, 2020

backport [SPARK-27747][SPARK-27816][SPARK-28344] #27417

Closed

[SPARK-28189][SQL] Use semanticEquals in Dataset drop method for attributes comparison #25055

[SPARK-28189][SQL] Use semanticEquals in Dataset drop method for attributes comparison #25055

Uh oh!

Conversation

Tonix517 commented Jul 5, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maropu commented Jul 5, 2019

Uh oh!

Uh oh!

SparkQA commented Jul 5, 2019

Uh oh!

Tonix517 commented Jul 5, 2019

Uh oh!

SparkQA commented Jul 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tonix517 commented Jul 5, 2019

Uh oh!

dongjoon-hyun commented Jul 5, 2019

Uh oh!

SparkQA commented Jul 5, 2019

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tonix517 commented Jul 7, 2019

Uh oh!

maropu Jul 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment •

edited

Loading

maropu Jul 7, 2019 •

edited

Loading

dongjoon-hyun Jul 7, 2019 •

edited

Loading