Skip to content

Conversation

@davidvrba
Copy link
Contributor

What changes were proposed in this pull request?

The purpose of this pr is to remove explicit null checks if they are not needed in order to simplify the generated code. Here is one example:

Expressions of this type

CASE WHEN isnull(title#5) THEN title#5 ELSE substring(title#5, 0, 3) END

are simplified to

substring(title#5, 0, 3)

if the considered expression is null-intolerant.

Why are the changes needed?

It simplifies expressions in the query plan which leads to potential optimization due to simplified codegen.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests are added.

@davidvrba
Copy link
Contributor Author

@cloud-fan kindly asking for review. Thanks for the help.

}

def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
case i @ If(predicate, trueValue, falseValue) => predicate match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a rule NullPropagation to optimize IsNull and IsNotNull to literals, and also a rule SimplifyConditionals to optimize If and CaseWhen if the condition is literal.

I'm curious about why they don't work and we need this extra rule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NullPropagation rule simplifies expressions to literals. But i feel that my pr is covering slightly different case. Here the expression that is being null-checked is in general not Literal and can not be converted to Literal (in general).
However I can also see that the logic of my rule can be moved to SimplifyConditionals, so I can move it there if this is the preferred way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it is moved

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116942 has finished for PR 27231 at commit de84684.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28478] [SQL] Remove redundant null checks [SPARK-28478][SQL] Remove redundant null checks Jan 20, 2020
(ifNullExpr == checkedExpr || ifNullExpr == Literal.create(null, checkedExpr.dataType))
&& e.children.contains(checkedExpr)) => true
case _ => false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about this style?

  private def isRedundantNullCheck(
      ifNullExpr: Expression,
      ifNotNullExpr: Expression,
      checkedExpr: Expression): Boolean = {
    ifNotNullExpr.isInstanceOf[NullIntolerant] && {
      (ifNullExpr == checkedExpr || ifNullExpr == Literal.create(null, checkedExpr.dataType)) &&
        ifNotNullExpr.children.contains(checkedExpr)
    }
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first condition ifNullExpr == checkedExpr -> ifNullExpr.semanticEquals(checkedExpr)? e.g., if isnull(a + b) b + a else xxx

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second condition ifNullExpr == Literal.create(null, checkedExpr.dataType) -> ifNullExpr.foldable && ifNullExpr.eval() == null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you generalize the last condition more? e.g., how about the case, substring(other_func(title#5), 0, 3) in the example you described?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that should be possible.

import org.apache.spark.sql.catalyst.plans.PlanTest
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.rules._
import org.apache.spark.sql.catalyst.plans.{PlanTest}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: {PlanTest} -> PlanTest

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117088 has finished for PR 27231 at commit 81354dd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jan 21, 2020

retest this please

case IsNull(child) if isRedundantNullCheck(trueValue, falseValue, child) => falseValue
case IsNotNull(child) if isRedundantNullCheck(falseValue, trueValue, child) => trueValue
case _ => i
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this format?;

      // If the null-check is redundant, remove it
      case If(IsNull(child), trueValue, falseValue)
        if isRedundantNullCheck(trueValue, falseValue, child) => falseValue
      case If(IsNotNull(child), trueValue, falseValue)
        if isRedundantNullCheck(falseValue, trueValue, child) => trueValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add the inner pattern-matching (cond match { )? I think its better to avoid unnecessary pattern matching (In the current fix, all the cases for If exprs can be matched in the line 466).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. You are right, i do not need the inner pattern match, i will fix that.

elseValue.getOrElse(Literal.create(null, child.dataType)),
child) => elseValue.getOrElse(Literal.create(null, child.dataType))
case _ => e
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

      // remove redundant null checks for CaseWhen with one branch
      case CaseWhen(Seq((IsNotNull(child), trueValue)), Some(falseValue))
        if isRedundantNullCheck(falseValue, trueValue, child) => trueValue
      case CaseWhen(Seq((IsNull(child), trueValue)), Some(falseValue))
        if isRedundantNullCheck(trueValue, falseValue, child) => falseValue
      case CaseWhen(Seq((IsNotNull(child), trueValue)), None)
        if isRedundantNullCheck(Literal.create(null, child.dataType), trueValue, child) => trueValue
      case e @ CaseWhen(Seq((IsNull(child), trueValue)), None) =>
        val nullValue = Literal.create(null, child.dataType)
        if (isRedundantNullCheck(trueValue, nullValue, child)) {
          nullValue
        } else {
          e
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright

val actual = Optimize.execute(Project(Alias(e1, "out")() :: Nil, OneRowRelation()).analyze)
val correctAnswer = Project(Alias(e2, "out")() :: Nil, LocalRelation('a.int)).analyze
val actual = Optimize.execute(
Project(Alias(e1, "out")() :: Nil, LocalRelation('a.int)).analyze)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need to break this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

import org.apache.spark.sql.catalyst.rules.RuleExecutor
import org.apache.spark.sql.types.{IntegerType, NullType}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You need to avoid unnecessary changes like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

// remove redundant null checks for CaseWhen with one branch
branches(0)._1 match {
case IsNotNull(child) if isRedundantNullCheck(
elseValue.getOrElse(Literal.create(null, child.dataType)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child.dataType -> e.dataType?


val isNotNullCond = IsNotNull(UnresolvedAttribute(Seq("a")))
val isNullCond = IsNull(UnresolvedAttribute("b"))
private val nullValue = Literal.create(null, IntegerType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change from NullType to IntegerType here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the same dataType as i have for the a attribute. But i can just add another nullValue to the test and keep the previous with the original dataType.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think its better to avoid the behaviour changes in the existing tests.

assertEquivalent(
CaseWhen((isNotNullCond, Subtract(Literal(3), Literal(2))) ::
(isNullCond, Literal(1)) ::
(isNullCondB, Literal(1)) ::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to change the existing tests where possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i will try to avoid that.

@SparkQA
Copy link

SparkQA commented Jan 21, 2020

Test build #117143 has finished for PR 27231 at commit 81354dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117200 has finished for PR 27231 at commit 956413f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davidvrba
Copy link
Contributor Author

@cloud-fan @maropu do you have any more suggestions / comments / recommendations to this pr? In the last commit i added a bit of generalization to include cases where the null-checked column is not necessarily a direct child of the ifNotNullExpr.

ifNotNullExpr: Expression,
checkedExpr: Expression): Boolean = {
val isNullIntolerant = ifNotNullExpr.find { x =>
!x.isInstanceOf[NullIntolerant] && x.find(e => e.semanticEquals(checkedExpr)).nonEmpty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually i think we need slightly different logic. Consider these two examples where x will be the null-checked column:

  1. substring(x, coalesce(a, b), c)

  2. substring(coalesce(x, d), a, c)

For 1. we need to be null-intolerant (even though coalesce is null-tolerant), so if x is null, we replace the substring with null value no matter what are the other children. For 2. we need to be null-tolerant and we will not replace the substring by null value. So we need to check the expression with respect to the position of x (the column that is being null-checked). Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, you meant FiterExec.isNullIntolerant(ifNotNullExpr) || additional checks for the case having null-tolerant exprs inside ifNotNullExpr? (FiterExec.isNullIntolerant is private though...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, FilterExec.isNullIntolerant(ifNotNullExpr) is a stronger condition than we need so in case there is null-tolerant expr inside we need to check if the null-checked column is in its subtree. Using the logic from FilterExec.isNullIntolerant the function could look like this:

def isNullIntolerant(expr: Expression): Boolean = expr match {
  case e: NullIntolerant => e.children.forall(isNullIntolerant)
  case e if e.find(x => x.semanticEquals(checkedExpr)).isEmpty => true
  case _ => false
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. For better code readability, could you split the condition into the two parts as I suggested above? Also, I think its better to leave some comments about why we need more checks there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the committed code is not very intuitive so i can think of this way which seems to be more readable (added also some comments):

private def isRedundantNullCheck(
    ifNullExpr: Expression,
    ifNotNullExpr: Expression,
    checkedExpr: Expression): Boolean = {
  
  // checks if expr is null-intolerant with respect to checkedExpr
  def isNullIntolerant(expr: Expression): Boolean = expr match {
    case e: NullIntolerant => e.children.forall(isNullIntolerant)
    // if some child is null-tolerant but the checkedEpxr is not in its subtree
    // we can still consider the whole expr as null-intolerant 
    // with respect to checkedExpr
    case e if e.find(x => x.semanticEquals(checkedExpr)).isEmpty => true
    case _ => false
  }
  
  isNullIntolerant(ifNotNullExpr) && {
    (ifNullExpr.semanticEquals(checkedExpr) ||
      (ifNullExpr.foldable && ifNullExpr.eval() == null)) &&
      // we still need to make sure that checkedExpr is inside ifNotNullExpr
      ifNotNullExpr.find(x => x.semanticEquals(checkedExpr)).nonEmpty
  }
}

But not sure if this is what you had in mind when suggesting to split the condition. Can you think of a better way how to compose this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that still looks complicated.. If we cannot avoid the complexity for the stronger condition, as another option, I think we can cover the simple case (FiterExec.isNullIntolerant(ifNotNullExpr)) only in this pr. If necessary, we might be able to optimize the condition in future work. I think keeping the code simple is more important. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the think is that if we use the simple version with FilterExec.isNullIntolerant(ifNutNullExpr) we will loose (because of the recursive check) all expressions that contain literals (because literals are null-tolerant), so for example expressions like this substring(title#5, 0, 3) will not be included in the optimization (which the jira was targeted for in the first place). So I suggest one of these 2 options:

  1. Use the complex version of the code and thus include more expressions in the optimization
  2. Have the code more simple and use the original version before the generalization step, i.e.
private def isRedundantNullCheck(
    ifNullExpr: Expression,
    ifNotNullExpr: Expression,
    checkedExpr: Expression): Boolean = {
  ifNotNullExpr.isInstanceOf[NullIntolerant] && {
    (ifNullExpr == checkedExpr || ifNullExpr == Literal.create(null, checkedExpr.dataType)) &&
      ifNotNullExpr.children.contains(checkedExpr)
  }
}

where checkedExpr must be direct child and thus we don't have to check the whole subtree for null-intolerance (so expressions that have Literals in the subtree are still included).
I am fine with either of these 2 options. What do you think?

case IsNull(child) if isRedundantNullCheck(trueValue, falseValue, child) => falseValue
case IsNotNull(child) if isRedundantNullCheck(falseValue, trueValue, child) => trueValue
case _ => i
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add the inner pattern-matching (cond match { )? I think its better to avoid unnecessary pattern matching (In the current fix, all the cases for If exprs can be matched in the line 466).

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants