-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28189][SQL] Use semanticEquals in Dataset drop method for attributes comparison #25055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| test("drop column using drop with column reference with case-insensitive names") { | ||
| val col1 = testData("KEY") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check this on both cases: spark.sql.caseSensitive=true/false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @maropu 's comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that flag set as true, an exception was thrown out org.apache.spark.sql.AnalysisException: Cannot resolve column name "KEY" among (key, value);. It looks to be the correct behavior in that case. do I still need to check it in this test case and make sure such exception is thrown and caught?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. For case-sensitive issue, we need to have a test coverage for both.
| val attrs = this.logicalPlan.output | ||
| val colsAfterDrop = attrs.filter { attr => | ||
| attr != expression | ||
| !attr.semanticEquals(expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no palce having the same issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through Dataset.scala - didn't find similar issue. However there might be the same problems in other places in our SQL code..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other places. Please see #21449.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks. I think its ok to only target dataset.drop in this pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @maropu 's opinion.
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Outdated
Show resolved
Hide resolved
|
ok to test |
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #107251 has finished for PR 25055 at commit
|
|
retest this please |
|
Test build #107260 has finished for PR 25055 at commit
|
| val attrs = this.logicalPlan.output | ||
| val colsAfterDrop = attrs.filter { attr => | ||
| attr != expression | ||
| !attr.semanticEquals(expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
|
retest this please |
|
Thank you for update, @Tonix517 . |
|
Test build #107289 has finished for PR 25055 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@dongjoon-hyun @maropu @mgaido91 thank you guys very much on helping my first commit - looking forward to working with you more :) |
| // With SQL config caseSensitive ON, AnalysisException should be thrown | ||
| withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") { | ||
| val e = intercept[AnalysisException] { | ||
| testData("KEY") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, it seems this test is not related to this pr?
you just wanted to do like this?
test("SPARK-28189 drop column using drop with column reference with case-insensitive names") {
var caseInsensitiveCol: Column = null
withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
caseInsensitiveCol = testData("KEY")
}
Seq("true", "false").foreach { isCaseSensitive =>
withSQLConf(SQLConf.CASE_SENSITIVE.key -> isCaseSensitive) {
val df = testData.drop(caseInsensitiveCol)
checkAnswer(df, testData.selectExpr("value"))
assert(df.schema.map(_.name) === Seq("value"))
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? @maropu . In your example, the following will fail in the same way because testData only has key and value.
withSQLConf(SQLConf.CASE_SENSITIVE.key -> isCaseSensitive) {
caseInsensitiveCol = testData("KEY")
For me, the test case looked okay because caseSensitive mode doesn't allow that kind of unmatched column from the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry and my bad. That line wasn't needed and I updated the example above.
Anyway, I just a bit worry about the behaivour change in the code flow below;
scala> val testData = Seq(("a", 1)).toDF("key", "value")
scala> sql("SET spark.sql.caseSensitive=false")
scala> val caseInsensitiveCol = testData("KEY")
scala> sql("SET spark.sql.caseSensitive=true")
scala> testData.drop(caseInsensitiveCol).show()
// v2.4.3
+---+-----+
|key|value|
+---+-----+
| a| 1|
+---+-----+
// master w/this pr
+-----+
|value|
+-----+
| 1|
+-----+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- First, it's not a correct(?) use-case to switch the case sensitivity between declaring the variable and using the variable.
- Second, I merged this to
masteronly with the similar concern. :)
Since this is a bug fix, I think we don't need a documentation about this. And, 3.0.0 is a good place to change those behavior change due to the bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks for the check!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestion for your concern? Then, please share it with us~ You're welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM. I just a bit worry about not use cases but the test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This test case of the second part (SQLConf.CASE_SENSITIVE.key -> "true") is weird to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, we can keep it unchanged. Ideally, we do not need the second part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I made a pr to drop that part: #25216
anyway, thanks for the check!
| val attrs = this.logicalPlan.output | ||
| val colsAfterDrop = attrs.filter { attr => | ||
| attr != expression | ||
| !attr.semanticEquals(expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def semanticEquals(other: Expression): Boolean =
deterministic && other.deterministic && canonicalized == other.canonicalized
What is the reason the comparison should be related to the deterministic when we want to drop it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm. The output only contains Attribute
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ibutes comparison ## What changes were proposed in this pull request? In Dataset drop(col: Column) method, the `equals` comparison method was used instead of `semanticEquals`, which caused the problem of abnormal case-sensitivity behavior. When attributes of LogicalPlan are checked for equality, `semanticEquals` should be used instead. A similar PR I referred to: apache#22713 created by mgaido91 ## How was this patch tested? - Added new unit test case in DataFrameSuite - ./build/sbt "testOnly org.apache.spark.sql.*" - The python code from ticket reporter at https://issues.apache.org/jira/browse/SPARK-28189 Closes apache#25055 from Tonix517/SPARK-28189. Authored-by: Tony Zhang <tony.zhang@uber.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
In Dataset drop(col: Column) method, the
equalscomparison method was used instead ofsemanticEquals, which caused the problem of abnormal case-sensitivity behavior. When attributes of LogicalPlan are checked for equality,semanticEqualsshould be used instead.A similar PR I referred to: #22713 created by @mgaido91
How was this patch tested?