-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22249][SQL] isin with empty list throws exception on cached DataFrame #19494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be simplified to list.exists(...)
| case In(a: AttributeReference, list: Seq[Expression]) if list.forall(_.isInstanceOf[Literal]) => | ||
| list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] && | ||
| l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _) | ||
| l.asInstanceOf[Literal] <= statsFor(a).upperBound) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still looks more complex and less efficient than
list.exists(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] && l.asInstanceOf[Literal] <= statsFor(a).upperBound)
or better
val stats = statsFor(a)
list.exists { l =>
val literal = l.asInstanceOf[Literal]
stats.lowerBound <= literal && literal <= stats.upperBound
}
The point being that you should be able to short-circuit evaluation here.
Or have I missed something basic like that these aren't Booleans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, they aren't Booleans. They are Expressions. And this is a PartialFunction which must return an Expression. Thus I can't think of a different (more efficient) way of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. How does .contains(true) work then? or did that not work?
I suppose all I mean is that we should write something that works on an empty list (returns false?) and also short-circuits (stops when anything is true). Is that possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a mistake, sorry. It returned always false.
I see what you mean, but in this piece of code we are only building the Expression and we are not evaluating it. Thus it is not possible to short-circuit, because the Expression must be built entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I understand. Rather than build an extra term into the expression every time, why not special-case an empty list? return Literal(false) if empty, otherwise return the same thing as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit more elegant and concise as code style, but your idea can be more efficient (though I am not sure how much overhead is introduced by a False evaluation). Then I am updating this PR according to your suggestion, thanks.
|
@srowen I also updated the UT to check all the possible cases. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems OK to me. @dongjoon-hyun do you think this is OK? I think you wrote the original lines.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
ok to test |
|
Test build #82780 has finished for PR 19494 at commit
|
|
retest this please |
|
Test build #82789 has finished for PR 19494 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
…taFrame ## What changes were proposed in this pull request? As pointed out in the JIRA, there is a bug which causes an exception to be thrown if `isin` is called with an empty list on a cached DataFrame. The PR fixes it. ## How was this patch tested? Added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes #19494 from mgaido91/SPARK-22249. (cherry picked from commit 8148f19) Signed-off-by: Sean Owen <sowen@cloudera.com>
|
Merged to master/2.2. It didn't cherry pick cleanly to earlier branches but if it's needed there and someone opens a backport PR I can merge that too |
| case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0 | ||
|
|
||
| case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral | ||
| case In(a: AttributeReference, list: Seq[Expression]) if list.forall(_.isInstanceOf[Literal]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change it to
if list.forall(_.isInstanceOf[Literal]) && list.nonEmpty()
or using exists.
| case IsNull(a: Attribute) => statsFor(a).nullCount > 0 | ||
| case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0 | ||
|
|
||
| case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to optimizer rules. It will cover both cached and non-cached cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this is wrong. If the left value is null, it should return null instead of false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, really. OK, yeah we need to change this. It can be reverted too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is only for filters, does it make any difference null or false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under this special context of filtering partitions, this In with empty list will result in a false literal in the end, no matter the attribute is null or not. We don't possibly have some expressions like IsNull(In(a, Nil)) as the filter predicate for now.
| } | ||
| } | ||
|
|
||
| test("SPARK-22249: IN should work also with cached DataFrame") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case is just an end-to-end test. This test will still pass if the optimizer has a change. We also need a unit test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean this test case is not enough and I should add a test case to check the proper behavior of the optimizer after the change?
|
Sorry for my late review. @mgaido91 Could you submit a follow-up PR to address my comments. Thanks! |
|
I will not revert this PR but please submit the fix ASAP. |
|
I will, thanks for your suggestions @gatorsmile. |
|
I created the followup PR: #19522. Thanks. |
…n the optimizer ## What changes were proposed in this pull request? This PR addresses the comments by gatorsmile on [the previous PR](#19494). ## How was this patch tested? Previous UT and added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes #19522 from mgaido91/SPARK-22249_FOLLOWUP. (cherry picked from commit 1f25d86) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…n the optimizer ## What changes were proposed in this pull request? This PR addresses the comments by gatorsmile on [the previous PR](apache#19494). ## How was this patch tested? Previous UT and added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#19522 from mgaido91/SPARK-22249_FOLLOWUP.
…taFrame ## What changes were proposed in this pull request? As pointed out in the JIRA, there is a bug which causes an exception to be thrown if `isin` is called with an empty list on a cached DataFrame. The PR fixes it. ## How was this patch tested? Added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#19494 from mgaido91/SPARK-22249. (cherry picked from commit 8148f19) Signed-off-by: Sean Owen <sowen@cloudera.com>
…n the optimizer ## What changes were proposed in this pull request? This PR addresses the comments by gatorsmile on [the previous PR](apache#19494). ## How was this patch tested? Previous UT and added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#19522 from mgaido91/SPARK-22249_FOLLOWUP. (cherry picked from commit 1f25d86) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
As pointed out in the JIRA, there is a bug which causes an exception to be thrown if
isinis called with an empty list on a cached DataFrame. The PR fixes it.How was this patch tested?
Added UT.