-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28478][SQL] Remove redundant null checks #27231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28478][SQL] Remove redundant null checks #27231
Conversation
|
@cloud-fan kindly asking for review. Thanks for the help. |
| } | ||
|
|
||
| def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { | ||
| case i @ If(predicate, trueValue, falseValue) => predicate match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a rule NullPropagation to optimize IsNull and IsNotNull to literals, and also a rule SimplifyConditionals to optimize If and CaseWhen if the condition is literal.
I'm curious about why they don't work and we need this extra rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NullPropagation rule simplifies expressions to literals. But i feel that my pr is covering slightly different case. Here the expression that is being null-checked is in general not Literal and can not be converted to Literal (in general).
However I can also see that the logic of my rule can be moved to SimplifyConditionals, so I can move it there if this is the preferred way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it is moved
|
ok to test |
|
Test build #116942 has finished for PR 27231 at commit
|
| (ifNullExpr == checkedExpr || ifNullExpr == Literal.create(null, checkedExpr.dataType)) | ||
| && e.children.contains(checkedExpr)) => true | ||
| case _ => false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: How about this style?
private def isRedundantNullCheck(
ifNullExpr: Expression,
ifNotNullExpr: Expression,
checkedExpr: Expression): Boolean = {
ifNotNullExpr.isInstanceOf[NullIntolerant] && {
(ifNullExpr == checkedExpr || ifNullExpr == Literal.create(null, checkedExpr.dataType)) &&
ifNotNullExpr.children.contains(checkedExpr)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first condition ifNullExpr == checkedExpr -> ifNullExpr.semanticEquals(checkedExpr)? e.g., if isnull(a + b) b + a else xxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second condition ifNullExpr == Literal.create(null, checkedExpr.dataType) -> ifNullExpr.foldable && ifNullExpr.eval() == null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you generalize the last condition more? e.g., how about the case, substring(other_func(title#5), 0, 3) in the example you described?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that should be possible.
| import org.apache.spark.sql.catalyst.plans.PlanTest | ||
| import org.apache.spark.sql.catalyst.plans.logical._ | ||
| import org.apache.spark.sql.catalyst.rules._ | ||
| import org.apache.spark.sql.catalyst.plans.{PlanTest} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: {PlanTest} -> PlanTest
|
Test build #117088 has finished for PR 27231 at commit
|
|
retest this please |
| case IsNull(child) if isRedundantNullCheck(trueValue, falseValue, child) => falseValue | ||
| case IsNotNull(child) if isRedundantNullCheck(falseValue, trueValue, child) => trueValue | ||
| case _ => i | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this format?;
// If the null-check is redundant, remove it
case If(IsNull(child), trueValue, falseValue)
if isRedundantNullCheck(trueValue, falseValue, child) => falseValue
case If(IsNotNull(child), trueValue, falseValue)
if isRedundantNullCheck(falseValue, trueValue, child) => trueValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you add the inner pattern-matching (cond match { )? I think its better to avoid unnecessary pattern matching (In the current fix, all the cases for If exprs can be matched in the line 466).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. You are right, i do not need the inner pattern match, i will fix that.
| elseValue.getOrElse(Literal.create(null, child.dataType)), | ||
| child) => elseValue.getOrElse(Literal.create(null, child.dataType)) | ||
| case _ => e | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
// remove redundant null checks for CaseWhen with one branch
case CaseWhen(Seq((IsNotNull(child), trueValue)), Some(falseValue))
if isRedundantNullCheck(falseValue, trueValue, child) => trueValue
case CaseWhen(Seq((IsNull(child), trueValue)), Some(falseValue))
if isRedundantNullCheck(trueValue, falseValue, child) => falseValue
case CaseWhen(Seq((IsNotNull(child), trueValue)), None)
if isRedundantNullCheck(Literal.create(null, child.dataType), trueValue, child) => trueValue
case e @ CaseWhen(Seq((IsNull(child), trueValue)), None) =>
val nullValue = Literal.create(null, child.dataType)
if (isRedundantNullCheck(trueValue, nullValue, child)) {
nullValue
} else {
e
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright
| val actual = Optimize.execute(Project(Alias(e1, "out")() :: Nil, OneRowRelation()).analyze) | ||
| val correctAnswer = Project(Alias(e2, "out")() :: Nil, LocalRelation('a.int)).analyze | ||
| val actual = Optimize.execute( | ||
| Project(Alias(e1, "out")() :: Nil, LocalRelation('a.int)).analyze) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you don't need to break this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
| import org.apache.spark.sql.catalyst.rules.RuleExecutor | ||
| import org.apache.spark.sql.types.{IntegerType, NullType} | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You need to avoid unnecessary changes like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
| // remove redundant null checks for CaseWhen with one branch | ||
| branches(0)._1 match { | ||
| case IsNotNull(child) if isRedundantNullCheck( | ||
| elseValue.getOrElse(Literal.create(null, child.dataType)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child.dataType -> e.dataType?
|
|
||
| val isNotNullCond = IsNotNull(UnresolvedAttribute(Seq("a"))) | ||
| val isNullCond = IsNull(UnresolvedAttribute("b")) | ||
| private val nullValue = Literal.create(null, IntegerType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change from NullType to IntegerType here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the same dataType as i have for the a attribute. But i can just add another nullValue to the test and keep the previous with the original dataType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think its better to avoid the behaviour changes in the existing tests.
| assertEquivalent( | ||
| CaseWhen((isNotNullCond, Subtract(Literal(3), Literal(2))) :: | ||
| (isNullCond, Literal(1)) :: | ||
| (isNullCondB, Literal(1)) :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to change the existing tests where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i will try to avoid that.
|
Test build #117143 has finished for PR 27231 at commit
|
|
Test build #117200 has finished for PR 27231 at commit
|
|
@cloud-fan @maropu do you have any more suggestions / comments / recommendations to this pr? In the last commit i added a bit of generalization to include cases where the null-checked column is not necessarily a direct child of the |
| ifNotNullExpr: Expression, | ||
| checkedExpr: Expression): Boolean = { | ||
| val isNullIntolerant = ifNotNullExpr.find { x => | ||
| !x.isInstanceOf[NullIntolerant] && x.find(e => e.semanticEquals(checkedExpr)).nonEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually i think we need slightly different logic. Consider these two examples where x will be the null-checked column:
-
substring(x, coalesce(a, b), c) -
substring(coalesce(x, d), a, c)
For 1. we need to be null-intolerant (even though coalesce is null-tolerant), so if x is null, we replace the substring with null value no matter what are the other children. For 2. we need to be null-tolerant and we will not replace the substring by null value. So we need to check the expression with respect to the position of x (the column that is being null-checked). Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, you meant FiterExec.isNullIntolerant(ifNotNullExpr) || additional checks for the case having null-tolerant exprs inside ifNotNullExpr? (FiterExec.isNullIntolerant is private though...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, FilterExec.isNullIntolerant(ifNotNullExpr) is a stronger condition than we need so in case there is null-tolerant expr inside we need to check if the null-checked column is in its subtree. Using the logic from FilterExec.isNullIntolerant the function could look like this:
def isNullIntolerant(expr: Expression): Boolean = expr match {
case e: NullIntolerant => e.children.forall(isNullIntolerant)
case e if e.find(x => x.semanticEquals(checkedExpr)).isEmpty => true
case _ => false
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. For better code readability, could you split the condition into the two parts as I suggested above? Also, I think its better to leave some comments about why we need more checks there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the committed code is not very intuitive so i can think of this way which seems to be more readable (added also some comments):
private def isRedundantNullCheck(
ifNullExpr: Expression,
ifNotNullExpr: Expression,
checkedExpr: Expression): Boolean = {
// checks if expr is null-intolerant with respect to checkedExpr
def isNullIntolerant(expr: Expression): Boolean = expr match {
case e: NullIntolerant => e.children.forall(isNullIntolerant)
// if some child is null-tolerant but the checkedEpxr is not in its subtree
// we can still consider the whole expr as null-intolerant
// with respect to checkedExpr
case e if e.find(x => x.semanticEquals(checkedExpr)).isEmpty => true
case _ => false
}
isNullIntolerant(ifNotNullExpr) && {
(ifNullExpr.semanticEquals(checkedExpr) ||
(ifNullExpr.foldable && ifNullExpr.eval() == null)) &&
// we still need to make sure that checkedExpr is inside ifNotNullExpr
ifNotNullExpr.find(x => x.semanticEquals(checkedExpr)).nonEmpty
}
}
But not sure if this is what you had in mind when suggesting to split the condition. Can you think of a better way how to compose this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, that still looks complicated.. If we cannot avoid the complexity for the stronger condition, as another option, I think we can cover the simple case (FiterExec.isNullIntolerant(ifNotNullExpr)) only in this pr. If necessary, we might be able to optimize the condition in future work. I think keeping the code simple is more important. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the think is that if we use the simple version with FilterExec.isNullIntolerant(ifNutNullExpr) we will loose (because of the recursive check) all expressions that contain literals (because literals are null-tolerant), so for example expressions like this substring(title#5, 0, 3) will not be included in the optimization (which the jira was targeted for in the first place). So I suggest one of these 2 options:
- Use the complex version of the code and thus include more expressions in the optimization
- Have the code more simple and use the original version before the generalization step, i.e.
private def isRedundantNullCheck(
ifNullExpr: Expression,
ifNotNullExpr: Expression,
checkedExpr: Expression): Boolean = {
ifNotNullExpr.isInstanceOf[NullIntolerant] && {
(ifNullExpr == checkedExpr || ifNullExpr == Literal.create(null, checkedExpr.dataType)) &&
ifNotNullExpr.children.contains(checkedExpr)
}
}
where checkedExpr must be direct child and thus we don't have to check the whole subtree for null-intolerance (so expressions that have Literals in the subtree are still included).
I am fine with either of these 2 options. What do you think?
| case IsNull(child) if isRedundantNullCheck(trueValue, falseValue, child) => falseValue | ||
| case IsNotNull(child) if isRedundantNullCheck(falseValue, trueValue, child) => trueValue | ||
| case _ => i | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you add the inner pattern-matching (cond match { )? I think its better to avoid unnecessary pattern matching (In the current fix, all the cases for If exprs can be matched in the line 466).
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
The purpose of this pr is to remove explicit null checks if they are not needed in order to simplify the generated code. Here is one example:
Expressions of this type
are simplified to
if the considered expression is null-intolerant.
Why are the changes needed?
It simplifies expressions in the query plan which leads to potential optimization due to simplified codegen.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New tests are added.