-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22249][SQL] isin with empty list throws exception on cached DataFrame #19494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -102,6 +102,7 @@ case class InMemoryTableScanExec( | |
| case IsNull(a: Attribute) => statsFor(a).nullCount > 0 | ||
| case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0 | ||
|
|
||
| case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral | ||
| case In(a: AttributeReference, list: Seq[Expression]) if list.forall(_.isInstanceOf[Literal]) => | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please change it to
or using exists. |
||
| list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] && | ||
| l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -429,4 +429,19 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { | |
| checkAnswer(agg_without_cache, agg_with_cache) | ||
| } | ||
| } | ||
|
|
||
| test("SPARK-22249: IN should work also with cached DataFrame") { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test case is just an end-to-end test. This test will still pass if the optimizer has a change. We also need a unit test case.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you mean this test case is not enough and I should add a test case to check the proper behavior of the optimizer after the change? |
||
| val df = spark.range(10).cache() | ||
| // with an empty list | ||
| assert(df.filter($"id".isin()).count() == 0) | ||
| // with a non-empty list | ||
| assert(df.filter($"id".isin(2)).count() == 1) | ||
| assert(df.filter($"id".isin(2, 3)).count() == 2) | ||
| df.unpersist() | ||
| val dfNulls = spark.range(10).selectExpr("null as id").cache() | ||
| // with null as value for the attribute | ||
| assert(dfNulls.filter($"id".isin()).count() == 0) | ||
| assert(dfNulls.filter($"id".isin(2, 3)).count() == 0) | ||
| dfNulls.unpersist() | ||
| } | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to optimizer rules. It will cover both cached and non-cached cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this is wrong. If the left value is null, it should return null instead of false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, really. OK, yeah we need to change this. It can be reverted too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is only for filters, does it make any difference null or false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under this special context of filtering partitions, this
Inwith empty list will result in a false literal in the end, no matter the attribute is null or not. We don't possibly have some expressions likeIsNull(In(a, Nil))as the filter predicate for now.