[SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is empty in the optimizer #19522

mgaido91 · 2017-10-17T21:47:44Z

What changes were proposed in this pull request?

This PR addresses the comments by @gatorsmile on the previous PR.

How was this patch tested?

Previous UT and added UT.

…n the optimizer

gatorsmile · 2017-10-17T21:55:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

    case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0

-    case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral
+    // We rely on the optimizations in org.apache.spark.sql.catalyst.optimizer.OptimizeIn


We should not rely on Optimizer for fixing the bugs.

We need to fix the line 107 anyway.

gatorsmile · 2017-10-17T21:56:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

 object OptimizeIn extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case q: LogicalPlan => q transformExpressionsDown {
+      case expr @ In(v, _) if expr.isListEmpty =>


Update the comment in the rule.

gatorsmile · 2017-10-17T22:01:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case q: LogicalPlan => q transformExpressionsDown {
+      case expr @ In(v, _) if expr.isListEmpty =>
+        If(IsNull(v), Literal.create(null, BooleanType), FalseLiteral)


Use Coalesce?

sorry, but I can't understand your suggestion: Coalesce returns the first non-null value. Here we should return Null when the value is null, false otherwise. I can't think of a function doing this.

If v is not nullable, we should return false.

True. The current conversion does not help the perf. We just need to convert it to false, if we know the left side is not nullable.

BTW, we should submit a separate PR for this optimizer change.

We need to backport the fix to 2.2

But if we don't change the plan here, then maybe it's worth to keep the initial change in the buildFilters to return false there without actually evaluating the filter itself, which is not needed in that case. What do you think?

Should I also create a JIRA for the optimizer change then?

Based on the SQL standard, the original fix is wrong. More importantly, the fix does not bring any noticeable perf improvement, because buildFilter is only used for partition pruning. In the future, we might enhance it for more advanced statistic-based filter inference. For example, foldable expressions can be evaluated earlier and this code change could cause a regression.

Yes. Please open a new JIRA for optimizer enhancement.

gatorsmile · 2017-10-17T22:02:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala


-    case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral
-    case In(a: AttributeReference, list: Seq[Expression]) if list.forall(_.isInstanceOf[Literal]) =>
+    case In(a: AttributeReference, list: Seq[Expression])


Do we still need a ?

yes, it is used in the body of the case

gatorsmile · 2017-10-17T22:07:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

-    case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral
-    case In(a: AttributeReference, list: Seq[Expression]) if list.forall(_.isInstanceOf[Literal]) =>
+    case In(a: AttributeReference, list: Seq[Expression])
+      if list.forall(_.isInstanceOf[Literal]) && list.nonEmpty =>


Could we add a unit test case for buildFilter? You might need a new test suite here.

I will, as soon as we decide which is the right behavior, thanks.

… empty in the optimizer"

mgaido91 · 2017-10-18T00:15:34Z

@gatorsmile , thanks, I updated the PR according to your comments. Now it should be ok. I am creating a new JIRA with for the changes to the optimizer. Thanks.

gatorsmile · 2017-10-18T05:28:37Z

ok to test

SparkQA · 2017-10-18T07:05:02Z

Test build #82873 has finished for PR 19522 at commit e95bc7b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-18T10:00:59Z

Test build #3952 has finished for PR 19522 at commit e95bc7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-18T16:15:12Z

Thanks! Merged to master/2.2

…n the optimizer ## What changes were proposed in this pull request? This PR addresses the comments by gatorsmile on [the previous PR](#19494). ## How was this patch tested? Previous UT and added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes #19522 from mgaido91/SPARK-22249_FOLLOWUP. (cherry picked from commit 1f25d86) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…ue and empty list ## What changes were proposed in this pull request? For performance reason, we should resolve in operation on an empty list as false in the optimizations phase, ad discussed in apache#19522. ## How was this patch tested? Added UT cc gatorsmile Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes apache#19523 from mgaido91/SPARK-22301.

…n the optimizer ## What changes were proposed in this pull request? This PR addresses the comments by gatorsmile on [the previous PR](apache#19494). ## How was this patch tested? Previous UT and added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#19522 from mgaido91/SPARK-22249_FOLLOWUP. (cherry picked from commit 1f25d86) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

[SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is empty i…

2327ad1

…n the optimizer

mgaido91 mentioned this pull request Oct 17, 2017

[SPARK-22249][SQL] isin with empty list throws exception on cached DataFrame #19494

Closed

gatorsmile reviewed Oct 17, 2017

View reviewed changes

mgaido91 added 2 commits October 17, 2017 23:58

add check to list in InMemoryTableScanExec

ac81901

fix comment

c990323

gatorsmile reviewed Oct 17, 2017

View reviewed changes

mgaido91 added 3 commits October 18, 2017 00:53

optimizein only when attribute is not nullable

55d84e6

add UT to check buildFilters behavior

8594231

Revert "[SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is…

e95bc7b

… empty in the optimizer"

mgaido91 mentioned this pull request Oct 18, 2017

[SPARK-22301][SQL] Add rule to Optimizer for In with not-nullable value and empty list #19523

Closed

asfgit closed this in 1f25d86 Oct 18, 2017

mgaido91 deleted the SPARK-22249_FOLLOWUP branch November 4, 2017 08:49

[SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is empty in the optimizer #19522

[SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is empty in the optimizer #19522

Uh oh!

Conversation

mgaido91 commented Oct 17, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Oct 18, 2017

Uh oh!

gatorsmile commented Oct 18, 2017

Uh oh!

SparkQA commented Oct 18, 2017

Uh oh!

SparkQA commented Oct 18, 2017

Uh oh!

gatorsmile commented Oct 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants