Skip to content

Conversation

@Tonix517
Copy link

@Tonix517 Tonix517 commented Jul 5, 2019

What changes were proposed in this pull request?

In Dataset drop(col: Column) method, the equals comparison method was used instead of semanticEquals, which caused the problem of abnormal case-sensitivity behavior. When attributes of LogicalPlan are checked for equality, semanticEquals should be used instead.

A similar PR I referred to: #22713 created by @mgaido91

How was this patch tested?

}

test("drop column using drop with column reference with case-insensitive names") {
val col1 = testData("KEY")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check this on both cases: spark.sql.caseSensitive=true/false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @maropu 's comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that flag set as true, an exception was thrown out org.apache.spark.sql.AnalysisException: Cannot resolve column name "KEY" among (key, value);. It looks to be the correct behavior in that case. do I still need to check it in this test case and make sure such exception is thrown and caught?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. For case-sensitive issue, we need to have a test coverage for both.

val attrs = this.logicalPlan.output
val colsAfterDrop = attrs.filter { attr =>
attr != expression
!attr.semanticEquals(expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no palce having the same issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through Dataset.scala - didn't find similar issue. However there might be the same problems in other places in our SQL code..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other places. Please see #21449.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks. I think its ok to only target dataset.drop in this pr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @maropu 's opinion.

@maropu
Copy link
Member

maropu commented Jul 5, 2019

ok to test

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28189] Use semanticEquals in Dataset drop method for attributes comparison [SPARK-28189][SQL] Use semanticEquals in Dataset drop method for attributes comparison Jul 5, 2019
@SparkQA
Copy link

SparkQA commented Jul 5, 2019

Test build #107251 has finished for PR 25055 at commit 7438f90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Tonix517
Copy link
Author

Tonix517 commented Jul 5, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Jul 5, 2019

Test build #107260 has finished for PR 25055 at commit ef4122d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

val attrs = this.logicalPlan.output
val colsAfterDrop = attrs.filter { attr =>
attr != expression
!attr.semanticEquals(expression)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@Tonix517
Copy link
Author

Tonix517 commented Jul 5, 2019

retest this please

@dongjoon-hyun
Copy link
Member

Thank you for update, @Tonix517 .

@SparkQA
Copy link

SparkQA commented Jul 5, 2019

Test build #107289 has finished for PR 25055 at commit 1dc9aa7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you for your first contribution, @Tonix517 .
Thank you, @maropu and @mgaido91 , too.
Merged to master.

@Tonix517
Copy link
Author

Tonix517 commented Jul 7, 2019

@dongjoon-hyun @maropu @mgaido91 thank you guys very much on helping my first commit - looking forward to working with you more :)

@Tonix517 Tonix517 deleted the SPARK-28189 branch July 7, 2019 04:56
// With SQL config caseSensitive ON, AnalysisException should be thrown
withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
val e = intercept[AnalysisException] {
testData("KEY")
Copy link
Member

@maropu maropu Jul 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, it seems this test is not related to this pr?
you just wanted to do like this?

  test("SPARK-28189 drop column using drop with column reference with case-insensitive names") {
    var caseInsensitiveCol: Column = null
    withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
      caseInsensitiveCol = testData("KEY")
    }

    Seq("true", "false").foreach { isCaseSensitive =>
      withSQLConf(SQLConf.CASE_SENSITIVE.key -> isCaseSensitive) {
        val df = testData.drop(caseInsensitiveCol)
        checkAnswer(df, testData.selectExpr("value"))
        assert(df.schema.map(_.name) === Seq("value"))
      }
    }
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? @maropu . In your example, the following will fail in the same way because testData only has key and value.

      withSQLConf(SQLConf.CASE_SENSITIVE.key -> isCaseSensitive) {
        caseInsensitiveCol = testData("KEY")

For me, the test case looked okay because caseSensitive mode doesn't allow that kind of unmatched column from the beginning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry and my bad. That line wasn't needed and I updated the example above.
Anyway, I just a bit worry about the behaivour change in the code flow below;

scala> val testData = Seq(("a", 1)).toDF("key", "value")

scala> sql("SET spark.sql.caseSensitive=false")
scala> val caseInsensitiveCol = testData("KEY")

scala> sql("SET spark.sql.caseSensitive=true")
scala> testData.drop(caseInsensitiveCol).show()
// v2.4.3
+---+-----+
|key|value|
+---+-----+
|  a|    1|
+---+-----+

// master w/this pr
+-----+
|value|
+-----+
|    1|
+-----+

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • First, it's not a correct(?) use-case to switch the case sensitivity between declaring the variable and using the variable.
  • Second, I merged this to master only with the similar concern. :)
    Since this is a bug fix, I think we don't need a documentation about this. And, 3.0.0 is a good place to change those behavior change due to the bug fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for the check!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestion for your concern? Then, please share it with us~ You're welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM. I just a bit worry about not use cases but the test coverage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This test case of the second part (SQLConf.CASE_SENSITIVE.key -> "true") is weird to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, we can keep it unchanged. Ideally, we do not need the second part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I made a pr to drop that part: #25216
anyway, thanks for the check!

val attrs = this.logicalPlan.output
val colsAfterDrop = attrs.filter { attr =>
attr != expression
!attr.semanticEquals(expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  def semanticEquals(other: Expression): Boolean =
    deterministic && other.deterministic && canonicalized == other.canonicalized

What is the reason the comparison should be related to the deterministic when we want to drop it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. The output only contains Attribute

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

rshkv pushed a commit to palantir/spark that referenced this pull request Jan 28, 2021
…ibutes comparison

## What changes were proposed in this pull request?

In Dataset drop(col: Column) method, the `equals` comparison method was used instead of `semanticEquals`, which caused the problem of abnormal case-sensitivity behavior. When attributes of LogicalPlan are checked for equality, `semanticEquals` should be used instead.

A similar PR I referred to: apache#22713 created by mgaido91

## How was this patch tested?

- Added new unit test case in DataFrameSuite
- ./build/sbt "testOnly org.apache.spark.sql.*"
- The python code from ticket reporter at https://issues.apache.org/jira/browse/SPARK-28189

Closes apache#25055 from Tonix517/SPARK-28189.

Authored-by: Tony Zhang <tony.zhang@uber.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants