-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is empty in the optimizer #19522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0 | ||
|
|
||
| case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral | ||
| // We rely on the optimizations in org.apache.spark.sql.catalyst.optimizer.OptimizeIn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not rely on Optimizer for fixing the bugs.
We need to fix the line 107 anyway.
| object OptimizeIn extends Rule[LogicalPlan] { | ||
| def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
| case q: LogicalPlan => q transformExpressionsDown { | ||
| case expr @ In(v, _) if expr.isListEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the comment in the rule.
| def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
| case q: LogicalPlan => q transformExpressionsDown { | ||
| case expr @ In(v, _) if expr.isListEmpty => | ||
| If(IsNull(v), Literal.create(null, BooleanType), FalseLiteral) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Coalesce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, but I can't understand your suggestion: Coalesce returns the first non-null value. Here we should return Null when the value is null, false otherwise. I can't think of a function doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If v is not nullable, we should return false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. The current conversion does not help the perf. We just need to convert it to false, if we know the left side is not nullable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we should submit a separate PR for this optimizer change.
We need to backport the fix to 2.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we don't change the plan here, then maybe it's worth to keep the initial change in the buildFilters to return false there without actually evaluating the filter itself, which is not needed in that case. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I also create a JIRA for the optimizer change then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the SQL standard, the original fix is wrong. More importantly, the fix does not bring any noticeable perf improvement, because buildFilter is only used for partition pruning. In the future, we might enhance it for more advanced statistic-based filter inference. For example, foldable expressions can be evaluated earlier and this code change could cause a regression.
Yes. Please open a new JIRA for optimizer enhancement.
|
|
||
| case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral | ||
| case In(a: AttributeReference, list: Seq[Expression]) if list.forall(_.isInstanceOf[Literal]) => | ||
| case In(a: AttributeReference, list: Seq[Expression]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need a ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is used in the body of the case
| case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral | ||
| case In(a: AttributeReference, list: Seq[Expression]) if list.forall(_.isInstanceOf[Literal]) => | ||
| case In(a: AttributeReference, list: Seq[Expression]) | ||
| if list.forall(_.isInstanceOf[Literal]) && list.nonEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a unit test case for buildFilter? You might need a new test suite here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will, as soon as we decide which is the right behavior, thanks.
|
@gatorsmile , thanks, I updated the PR according to your comments. Now it should be ok. I am creating a new JIRA with for the changes to the optimizer. Thanks. |
|
ok to test |
|
Test build #82873 has finished for PR 19522 at commit
|
|
Test build #3952 has finished for PR 19522 at commit
|
|
Thanks! Merged to master/2.2 |
…n the optimizer ## What changes were proposed in this pull request? This PR addresses the comments by gatorsmile on [the previous PR](#19494). ## How was this patch tested? Previous UT and added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes #19522 from mgaido91/SPARK-22249_FOLLOWUP. (cherry picked from commit 1f25d86) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…ue and empty list ## What changes were proposed in this pull request? For performance reason, we should resolve in operation on an empty list as false in the optimizations phase, ad discussed in apache#19522. ## How was this patch tested? Added UT cc gatorsmile Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes apache#19523 from mgaido91/SPARK-22301.
…n the optimizer ## What changes were proposed in this pull request? This PR addresses the comments by gatorsmile on [the previous PR](apache#19494). ## How was this patch tested? Previous UT and added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#19522 from mgaido91/SPARK-22249_FOLLOWUP. (cherry picked from commit 1f25d86) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
This PR addresses the comments by @gatorsmile on the previous PR.
How was this patch tested?
Previous UT and added UT.