[SPARK-28478][SQL] Remove redundant null checks #27231

davidvrba · 2020-01-16T06:49:00Z

What changes were proposed in this pull request?

The purpose of this pr is to remove explicit null checks if they are not needed in order to simplify the generated code. Here is one example:

Expressions of this type

CASE WHEN isnull(title#5) THEN title#5 ELSE substring(title#5, 0, 3) END

are simplified to

substring(title#5, 0, 3)

if the considered expression is null-intolerant.

Why are the changes needed?

It simplifies expressions in the query plan which leads to potential optimization due to simplified codegen.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests are added.

davidvrba · 2020-01-16T08:32:37Z

@cloud-fan kindly asking for review. Thanks for the help.

cloud-fan · 2020-01-16T08:49:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case i @ If(predicate, trueValue, falseValue) => predicate match {


We have a rule NullPropagation to optimize IsNull and IsNotNull to literals, and also a rule SimplifyConditionals to optimize If and CaseWhen if the condition is literal.

I'm curious about why they don't work and we need this extra rule.

The NullPropagation rule simplifies expressions to literals. But i feel that my pr is covering slightly different case. Here the expression that is being null-checked is in general not Literal and can not be converted to Literal (in general).
However I can also see that the logic of my rule can be moved to SimplifyConditionals, so I can move it there if this is the preferred way.

yes please.

ok, it is moved

cloud-fan · 2020-01-17T12:19:41Z

ok to test

SparkQA · 2020-01-17T16:42:00Z

Test build #116942 has finished for PR 27231 at commit de84684.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-20T04:53:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+      (ifNullExpr == checkedExpr || ifNullExpr == Literal.create(null, checkedExpr.dataType))
+      && e.children.contains(checkedExpr)) => true
+    case _ => false
+  }


nit: How about this style?

private def isRedundantNullCheck( ifNullExpr: Expression, ifNotNullExpr: Expression, checkedExpr: Expression): Boolean = { ifNotNullExpr.isInstanceOf[NullIntolerant] && { (ifNullExpr == checkedExpr || ifNullExpr == Literal.create(null, checkedExpr.dataType)) && ifNotNullExpr.children.contains(checkedExpr) } }

The first condition ifNullExpr == checkedExpr -> ifNullExpr.semanticEquals(checkedExpr)? e.g., if isnull(a + b) b + a else xxx

The second condition ifNullExpr == Literal.create(null, checkedExpr.dataType) -> ifNullExpr.foldable && ifNullExpr.eval() == null?

Makes sense.

Can you generalize the last condition more? e.g., how about the case, substring(other_func(title#5), 0, 3) in the example you described?

yes, that should be possible.

maropu · 2020-01-20T04:54:05Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala

-import org.apache.spark.sql.catalyst.plans.PlanTest
-import org.apache.spark.sql.catalyst.plans.logical._
-import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.plans.{PlanTest}


nit: {PlanTest} -> PlanTest

SparkQA · 2020-01-20T08:05:02Z

Test build #117088 has finished for PR 27231 at commit 81354dd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-21T01:15:17Z

retest this please

maropu · 2020-01-21T01:31:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+        case IsNull(child) if isRedundantNullCheck(trueValue, falseValue, child) => falseValue
+        case IsNotNull(child) if isRedundantNullCheck(falseValue, trueValue, child) => trueValue
+        case _ => i
+      }


How about this format?;

// If the null-check is redundant, remove it case If(IsNull(child), trueValue, falseValue) if isRedundantNullCheck(trueValue, falseValue, child) => falseValue case If(IsNotNull(child), trueValue, falseValue) if isRedundantNullCheck(falseValue, trueValue, child) => trueValue

Why did you add the inner pattern-matching (cond match { )? I think its better to avoid unnecessary pattern matching (In the current fix, all the cases for If exprs can be matched in the line 466).

I see. You are right, i do not need the inner pattern match, i will fix that.

maropu · 2020-01-21T02:06:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+            elseValue.getOrElse(Literal.create(null, child.dataType)),
+            child) => elseValue.getOrElse(Literal.create(null, child.dataType))
+          case _ => e
+        }


How about this?

// remove redundant null checks for CaseWhen with one branch case CaseWhen(Seq((IsNotNull(child), trueValue)), Some(falseValue)) if isRedundantNullCheck(falseValue, trueValue, child) => trueValue case CaseWhen(Seq((IsNull(child), trueValue)), Some(falseValue)) if isRedundantNullCheck(trueValue, falseValue, child) => falseValue case CaseWhen(Seq((IsNotNull(child), trueValue)), None) if isRedundantNullCheck(Literal.create(null, child.dataType), trueValue, child) => trueValue case e @ CaseWhen(Seq((IsNull(child), trueValue)), None) => val nullValue = Literal.create(null, child.dataType) if (isRedundantNullCheck(trueValue, nullValue, child)) { nullValue } else { e }

maropu · 2020-01-21T02:08:58Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala

-    val actual = Optimize.execute(Project(Alias(e1, "out")() :: Nil, OneRowRelation()).analyze)
+    val correctAnswer = Project(Alias(e2, "out")() :: Nil, LocalRelation('a.int)).analyze
+    val actual = Optimize.execute(
+      Project(Alias(e1, "out")() :: Nil, LocalRelation('a.int)).analyze)


nit: you don't need to break this line.

maropu · 2020-01-21T02:09:31Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala

+import org.apache.spark.sql.catalyst.rules.RuleExecutor
 import org.apache.spark.sql.types.{IntegerType, NullType}

-


nit: You need to avoid unnecessary changes like this.

maropu · 2020-01-21T02:17:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+        // remove redundant null checks for CaseWhen with one branch
+        branches(0)._1 match {
+          case IsNotNull(child) if isRedundantNullCheck(
+            elseValue.getOrElse(Literal.create(null, child.dataType)),


child.dataType -> e.dataType?

maropu · 2020-01-21T02:20:14Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala

-
-  val isNotNullCond = IsNotNull(UnresolvedAttribute(Seq("a")))
-  val isNullCond = IsNull(UnresolvedAttribute("b"))
+  private val nullValue = Literal.create(null, IntegerType)


Why did you change from NullType to IntegerType here?

I need the same dataType as i have for the a attribute. But i can just add another nullValue to the test and keep the previous with the original dataType.

Yea, I think its better to avoid the behaviour changes in the existing tests.

maropu · 2020-01-21T02:21:39Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala

    assertEquivalent(
      CaseWhen((isNotNullCond, Subtract(Literal(3), Literal(2))) ::
-        (isNullCond, Literal(1)) ::
+        (isNullCondB, Literal(1)) ::


You don't need to change the existing tests where possible.

Ok, i will try to avoid that.

SparkQA · 2020-01-21T05:21:31Z

Test build #117143 has finished for PR 27231 at commit 81354dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-22T00:42:58Z

Test build #117200 has finished for PR 27231 at commit 956413f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davidvrba · 2020-01-26T17:07:25Z

@cloud-fan @maropu do you have any more suggestions / comments / recommendations to this pr? In the last commit i added a bit of generalization to include cases where the null-checked column is not necessarily a direct child of the ifNotNullExpr.

maropu · 2020-01-28T00:52:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+      ifNotNullExpr: Expression,
+      checkedExpr: Expression): Boolean = {
+    val isNullIntolerant = ifNotNullExpr.find { x =>
+      !x.isInstanceOf[NullIntolerant] && x.find(e => e.semanticEquals(checkedExpr)).nonEmpty


The same logic? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L105-L109

Actually i think we need slightly different logic. Consider these two examples where x will be the null-checked column:

substring(x, coalesce(a, b), c)

substring(coalesce(x, d), a, c)

For 1. we need to be null-intolerant (even though coalesce is null-tolerant), so if x is null, we replace the substring with null value no matter what are the other children. For 2. we need to be null-tolerant and we will not replace the substring by null value. So we need to check the expression with respect to the position of x (the column that is being null-checked). Does it make sense?

Probably, you meant FiterExec.isNullIntolerant(ifNotNullExpr) || additional checks for the case having null-tolerant exprs inside ifNotNullExpr? (FiterExec.isNullIntolerant is private though...)

Yeah, FilterExec.isNullIntolerant(ifNotNullExpr) is a stronger condition than we need so in case there is null-tolerant expr inside we need to check if the null-checked column is in its subtree. Using the logic from FilterExec.isNullIntolerant the function could look like this:

def isNullIntolerant(expr: Expression): Boolean = expr match { case e: NullIntolerant => e.children.forall(isNullIntolerant) case e if e.find(x => x.semanticEquals(checkedExpr)).isEmpty => true case _ => false }

Ah, I see. For better code readability, could you split the condition into the two parts as I suggested above? Also, I think its better to leave some comments about why we need more checks there.

I agree that the committed code is not very intuitive so i can think of this way which seems to be more readable (added also some comments):

private def isRedundantNullCheck( ifNullExpr: Expression, ifNotNullExpr: Expression, checkedExpr: Expression): Boolean = { // checks if expr is null-intolerant with respect to checkedExpr def isNullIntolerant(expr: Expression): Boolean = expr match { case e: NullIntolerant => e.children.forall(isNullIntolerant) // if some child is null-tolerant but the checkedEpxr is not in its subtree // we can still consider the whole expr as null-intolerant // with respect to checkedExpr case e if e.find(x => x.semanticEquals(checkedExpr)).isEmpty => true case _ => false } isNullIntolerant(ifNotNullExpr) && { (ifNullExpr.semanticEquals(checkedExpr) || (ifNullExpr.foldable && ifNullExpr.eval() == null)) && // we still need to make sure that checkedExpr is inside ifNotNullExpr ifNotNullExpr.find(x => x.semanticEquals(checkedExpr)).nonEmpty } }

But not sure if this is what you had in mind when suggesting to split the condition. Can you think of a better way how to compose this?

Hmmm, that still looks complicated.. If we cannot avoid the complexity for the stronger condition, as another option, I think we can cover the simple case (FiterExec.isNullIntolerant(ifNotNullExpr)) only in this pr. If necessary, we might be able to optimize the condition in future work. I think keeping the code simple is more important. WDYT?

Well, the think is that if we use the simple version with FilterExec.isNullIntolerant(ifNutNullExpr) we will loose (because of the recursive check) all expressions that contain literals (because literals are null-tolerant), so for example expressions like this substring(title#5, 0, 3) will not be included in the optimization (which the jira was targeted for in the first place). So I suggest one of these 2 options:

Use the complex version of the code and thus include more expressions in the optimization

Have the code more simple and use the original version before the generalization step, i.e.

private def isRedundantNullCheck( ifNullExpr: Expression, ifNotNullExpr: Expression, checkedExpr: Expression): Boolean = { ifNotNullExpr.isInstanceOf[NullIntolerant] && { (ifNullExpr == checkedExpr || ifNullExpr == Literal.create(null, checkedExpr.dataType)) && ifNotNullExpr.children.contains(checkedExpr) } }

where checkedExpr must be direct child and thus we don't have to check the whole subtree for null-intolerance (so expressions that have Literals in the subtree are still included).
I am fine with either of these 2 options. What do you think?

maropu · 2020-01-28T00:57:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+        case IsNull(child) if isRedundantNullCheck(trueValue, falseValue, child) => falseValue
+        case IsNotNull(child) if isRedundantNullCheck(falseValue, trueValue, child) => trueValue
+        case _ => i
+      }


Why did you add the inner pattern-matching (cond match { )? I think its better to avoid unnecessary pattern matching (In the current fix, all the cases for If exprs can be matched in the line 466).

github-actions · 2020-05-21T00:13:34Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

spark-28478 remove unnecessary null checks

5d5a184

cloud-fan reviewed Jan 16, 2020

View reviewed changes

spark-28478 logic moved to SimplifyConditionals

de84684

dongjoon-hyun added the SQL label Jan 20, 2020

dongjoon-hyun changed the title ~~[SPARK-28478] [SQL] Remove redundant null checks~~ [SPARK-28478][SQL] Remove redundant null checks Jan 20, 2020

maropu reviewed Jan 20, 2020

View reviewed changes

spark-28478 small changes after CR

81354dd

maropu reviewed Jan 21, 2020

View reviewed changes

spark-28478 generalize for more complex expressions

956413f

maropu reviewed Jan 28, 2020

View reviewed changes

dongjoon-hyun added OPTIMIZER and removed OPTIMIZER labels Feb 5, 2020

github-actions bot added the Stale label May 21, 2020

github-actions bot closed this May 22, 2020

maropu mentioned this pull request Aug 28, 2020

[SPARK-32721][SQL] Simplify if clauses with null and boolean #29567

Closed

		import org.apache.spark.sql.catalyst.rules.RuleExecutor
		import org.apache.spark.sql.types.{IntegerType, NullType}

[SPARK-28478][SQL] Remove redundant null checks #27231

[SPARK-28478][SQL] Remove redundant null checks #27231

Uh oh!

Conversation

davidvrba commented Jan 16, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

davidvrba commented Jan 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 17, 2020

Uh oh!

SparkQA commented Jan 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 20, 2020

Uh oh!

maropu commented Jan 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 21, 2020

Uh oh!

SparkQA commented Jan 22, 2020

Uh oh!

davidvrba commented Jan 26, 2020

Uh oh!