Skip to content

Conversation

@mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

This PR addresses the comments by @gatorsmile on the previous PR.

How was this patch tested?

Previous UT and added UT.

case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0

case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral
// We rely on the optimizations in org.apache.spark.sql.catalyst.optimizer.OptimizeIn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not rely on Optimizer for fixing the bugs.

We need to fix the line 107 anyway.

object OptimizeIn extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case q: LogicalPlan => q transformExpressionsDown {
case expr @ In(v, _) if expr.isListEmpty =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment in the rule.

def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case q: LogicalPlan => q transformExpressionsDown {
case expr @ In(v, _) if expr.isListEmpty =>
If(IsNull(v), Literal.create(null, BooleanType), FalseLiteral)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Coalesce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, but I can't understand your suggestion: Coalesce returns the first non-null value. Here we should return Null when the value is null, false otherwise. I can't think of a function doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If v is not nullable, we should return false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. The current conversion does not help the perf. We just need to convert it to false, if we know the left side is not nullable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we should submit a separate PR for this optimizer change.

We need to backport the fix to 2.2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we don't change the plan here, then maybe it's worth to keep the initial change in the buildFilters to return false there without actually evaluating the filter itself, which is not needed in that case. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also create a JIRA for the optimizer change then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the SQL standard, the original fix is wrong. More importantly, the fix does not bring any noticeable perf improvement, because buildFilter is only used for partition pruning. In the future, we might enhance it for more advanced statistic-based filter inference. For example, foldable expressions can be evaluated earlier and this code change could cause a regression.

Yes. Please open a new JIRA for optimizer enhancement.


case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral
case In(a: AttributeReference, list: Seq[Expression]) if list.forall(_.isInstanceOf[Literal]) =>
case In(a: AttributeReference, list: Seq[Expression])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need a ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is used in the body of the case

case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral
case In(a: AttributeReference, list: Seq[Expression]) if list.forall(_.isInstanceOf[Literal]) =>
case In(a: AttributeReference, list: Seq[Expression])
if list.forall(_.isInstanceOf[Literal]) && list.nonEmpty =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a unit test case for buildFilter? You might need a new test suite here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will, as soon as we decide which is the right behavior, thanks.

@mgaido91
Copy link
Contributor Author

@gatorsmile , thanks, I updated the PR according to your comments. Now it should be ok. I am creating a new JIRA with for the changes to the optimizer. Thanks.

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Oct 18, 2017

Test build #82873 has finished for PR 19522 at commit e95bc7b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2017

Test build #3952 has finished for PR 19522 at commit e95bc7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master/2.2

asfgit pushed a commit that referenced this pull request Oct 18, 2017
…n the optimizer

## What changes were proposed in this pull request?

This PR addresses the comments by gatorsmile on [the previous PR](#19494).

## How was this patch tested?

Previous UT and added UT.

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #19522 from mgaido91/SPARK-22249_FOLLOWUP.

(cherry picked from commit 1f25d86)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in 1f25d86 Oct 18, 2017
ghost pushed a commit to dbtsai/spark that referenced this pull request Oct 24, 2017
…ue and empty list

## What changes were proposed in this pull request?

For performance reason, we should resolve in operation on an empty list as false in the optimizations phase, ad discussed in apache#19522.

## How was this patch tested?
Added UT

cc gatorsmile

Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>

Closes apache#19523 from mgaido91/SPARK-22301.
@mgaido91 mgaido91 deleted the SPARK-22249_FOLLOWUP branch November 4, 2017 08:49
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…n the optimizer

## What changes were proposed in this pull request?

This PR addresses the comments by gatorsmile on [the previous PR](apache#19494).

## How was this patch tested?

Previous UT and added UT.

Author: Marco Gaido <marcogaido91@gmail.com>

Closes apache#19522 from mgaido91/SPARK-22249_FOLLOWUP.

(cherry picked from commit 1f25d86)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants