-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35829][SQL] Clean up evaluates subexpressions and add more flexibility to evaluate particular subexpressoin #32980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e8a03f5
777b9a4
c774797
4574b30
9428a48
2ecb592
014bc8b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -76,24 +76,38 @@ object ExprCode { | |
| /** | ||
| * State used for subexpression elimination. | ||
| * | ||
| * @param isNull A term that holds a boolean value representing whether the expression evaluated | ||
| * to null. | ||
| * @param value A term for a value of a common sub-expression. Not valid if `isNull` | ||
| * is set to `true`. | ||
| * @param eval The source code for evaluating the subexpression. | ||
| * @param children The sequence of subexpressions as the children expressions. Before | ||
| * evaluating this subexpression, we should evaluate all children | ||
| * subexpressions first. This is used if we want to selectively evaluate | ||
| * particular subexpressions, instead of all at once. In the case, we need | ||
| * to make sure we evaluate all children subexpressions too. | ||
| */ | ||
| case class SubExprEliminationState(isNull: ExprValue, value: ExprValue) | ||
| case class SubExprEliminationState( | ||
cloud-fan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| eval: ExprCode, | ||
| children: Seq[SubExprEliminationState]) | ||
|
|
||
| object SubExprEliminationState { | ||
| def apply(eval: ExprCode): SubExprEliminationState = { | ||
| new SubExprEliminationState(eval, Seq.empty) | ||
| } | ||
|
|
||
| def apply( | ||
| eval: ExprCode, | ||
| children: Seq[SubExprEliminationState]): SubExprEliminationState = { | ||
| new SubExprEliminationState(eval, children.reverse) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Codes and common subexpressions mapping used for subexpression elimination. | ||
| * | ||
| * @param codes Strings representing the codes that evaluate common subexpressions. | ||
| * @param states Foreach expression that is participating in subexpression elimination, | ||
| * the state to use. | ||
| * @param exprCodesNeedEvaluate Some expression codes that need to be evaluated before | ||
| * calling common subexpressions. | ||
| */ | ||
| case class SubExprCodes( | ||
| codes: Seq[String], | ||
| states: Map[Expression, SubExprEliminationState], | ||
| exprCodesNeedEvaluate: Seq[ExprCode]) | ||
|
|
||
|
|
@@ -1030,11 +1044,55 @@ class CodegenContext extends Logging { | |
| } | ||
|
|
||
| /** | ||
| * Checks and sets up the state and codegen for subexpression elimination. This finds the | ||
| * common subexpressions, generates the code snippets that evaluate those expressions and | ||
| * populates the mapping of common subexpressions to the generated code snippets. The generated | ||
| * code snippets will be returned and should be inserted into generated codes before these | ||
| * common subexpressions actually are used first time. | ||
| * Evaluates a sequence of `SubExprEliminationState` which represent subexpressions. After | ||
| * evaluating a subexpression, this method will clean up the code block to avoid duplicate | ||
| * evaluation. | ||
| */ | ||
| def evaluateSubExprEliminationState(subExprStates: Iterable[SubExprEliminationState]): String = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All its caller side use |
||
| val code = new StringBuilder() | ||
|
|
||
| subExprStates.foreach { state => | ||
| val currentCode = evaluateSubExprEliminationState(state.children) + "\n" + state.eval.code | ||
| code.append(currentCode + "\n") | ||
| state.eval.code = EmptyBlock | ||
| } | ||
|
|
||
| code.toString() | ||
| } | ||
|
|
||
| /** | ||
| * Checks and sets up the state and codegen for subexpression elimination in whole-stage codegen. | ||
| * | ||
| * This finds the common subexpressions, generates the code snippets that evaluate those | ||
| * expressions and populates the mapping of common subexpressions to the generated code snippets. | ||
| * | ||
| * The generated code snippet for subexpression is wrapped in `SubExprEliminationState`, which | ||
| * contains an `ExprCode` and the children `SubExprEliminationState` if any. The `ExprCode` | ||
| * includes java source code, result variable name and is-null variable name of the subexpression. | ||
| * | ||
| * Besides, this also returns a sequences of `ExprCode` which are expression codes that need to | ||
| * be evaluated (as their input parameters) before evaluating subexpressions. | ||
| * | ||
| * To evaluate the returned subexpressions, please call `evaluateSubExprEliminationState` with | ||
| * the `SubExprEliminationState`s to be evaluated. During generating the code, it will cleanup | ||
| * the states to avoid duplicate evaluation. | ||
| * | ||
| * The details of subexpression generation: | ||
| * 1. Gets subexpression set. See `EquivalentExpressions`. | ||
| * 2. Generate code of subexpressions as a whole block of code (non-split case) | ||
| * 3. Check if the total length of the above block is larger than the split-threshold. If so, | ||
| * try to split it in step 4, otherwise returning the non-split code block. | ||
| * 4. Check if parameter lengths of all subexpressions satisfy the JVM limitation, if so, | ||
| * try to split, otherwise returning the non-split code block. | ||
| * 5. For each subexpression, generating a function and put the code into it. To evaluate the | ||
| * subexpression, just call the function. | ||
| * | ||
| * The explanation of subexpression codegen: | ||
| * 1. Wrapping in `withSubExprEliminationExprs` call with current subexpression map. Each | ||
| * subexpression may depends on other subexpressions (children). So when generating code | ||
| * for subexpressions, we iterate over each subexpression and put the mapping between | ||
| * (subexpression -> `SubExprEliminationState`) into the map. So in next subexpression | ||
| * evaluation, we can look for generated subexpressions and do replacement. | ||
| */ | ||
| def subexpressionEliminationForWholeStageCodegen(expressions: Seq[Expression]): SubExprCodes = { | ||
| // Create a clear EquivalentExpressions and SubExprEliminationState mapping | ||
|
|
@@ -1049,17 +1107,25 @@ class CodegenContext extends Logging { | |
| // elimination. | ||
| val commonExprs = equivalentExpressions.getAllEquivalentExprs(1) | ||
|
|
||
| val nonSplitExprCode = { | ||
| val nonSplitCode = { | ||
| val allStates = mutable.ArrayBuffer.empty[SubExprEliminationState] | ||
| commonExprs.map { exprs => | ||
| val eval = withSubExprEliminationExprs(localSubExprEliminationExprsForNonSplit.toMap) { | ||
| withSubExprEliminationExprs(localSubExprEliminationExprsForNonSplit.toMap) { | ||
| val eval = exprs.head.genCode(this) | ||
| // Generate the code for this expression tree. | ||
| val state = SubExprEliminationState(eval.isNull, eval.value) | ||
| // Collects other subexpressions from the children. | ||
| val childrenSubExprs = mutable.ArrayBuffer.empty[SubExprEliminationState] | ||
| exprs.head.foreach { | ||
| case e if subExprEliminationExprs.contains(e) => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to add some comments to explain the assumption: this code works because
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW collecting child expressions here looks really inefficient, but I don't have a better idea for now ...
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. This is not general expression but special (subexpr) ones, so we don't do collecting child expressions in general but in limited range. Except that if you have many subexpr and they are highly nested.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
As I commented before, I plan to remove the sorting. A better idea is to add |
||
| childrenSubExprs += subExprEliminationExprs(e) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Q: Is it difficult to add some tests for this new behaviour?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me add a few tests.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added new test. |
||
| case _ => | ||
| } | ||
| val state = SubExprEliminationState(eval, childrenSubExprs.toSeq) | ||
| exprs.foreach(localSubExprEliminationExprsForNonSplit.put(_, state)) | ||
| allStates += state | ||
| Seq(eval) | ||
| }.head | ||
| eval.code.toString | ||
| } | ||
| } | ||
| allStates.toSeq | ||
| } | ||
|
|
||
| // For some operators, they do not require all its child's outputs to be evaluated in advance. | ||
|
|
@@ -1071,14 +1137,13 @@ class CodegenContext extends Logging { | |
| (inputVars.toSeq, exprCodes.toSeq) | ||
| }.unzip | ||
|
|
||
| val splitThreshold = SQLConf.get.methodSplitThreshold | ||
|
|
||
| val (codes, subExprsMap, exprCodes) = if (nonSplitExprCode.map(_.length).sum > splitThreshold) { | ||
| val needSplit = nonSplitCode.map(_.eval.code.length).sum > SQLConf.get.methodSplitThreshold | ||
| val (subExprsMap, exprCodes) = if (needSplit) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not related to this PR: so here we repeat the logic of generating
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previously it is lazy so we can do non-split conditionally. Now we nestedly generate subExprs so it cannot be lazy now. |
||
| if (inputVarsForAllFuncs.map(calculateParamLengthFromExprValues).forall(isValidParamLength)) { | ||
| val localSubExprEliminationExprs = | ||
| mutable.HashMap.empty[Expression, SubExprEliminationState] | ||
|
|
||
| val splitCodes = commonExprs.zipWithIndex.map { case (exprs, i) => | ||
| commonExprs.zipWithIndex.foreach { case (exprs, i) => | ||
| val expr = exprs.head | ||
| val eval = withSubExprEliminationExprs(localSubExprEliminationExprs.toMap) { | ||
| Seq(expr.genCode(this)) | ||
|
|
@@ -1111,24 +1176,34 @@ class CodegenContext extends Logging { | |
| |} | ||
| """.stripMargin | ||
|
|
||
| val state = SubExprEliminationState(isNull, JavaCode.global(value, expr.dataType)) | ||
| exprs.foreach(localSubExprEliminationExprs.put(_, state)) | ||
| // Collects other subexpressions from the children. | ||
| val childrenSubExprs = mutable.ArrayBuffer.empty[SubExprEliminationState] | ||
| exprs.head.foreach { | ||
| case e if localSubExprEliminationExprs.contains(e) => | ||
| childrenSubExprs += localSubExprEliminationExprs(e) | ||
| case _ => | ||
| } | ||
|
|
||
| val inputVariables = inputVars.map(_.variableName).mkString(", ") | ||
| s"${addNewFunction(fnName, fn)}($inputVariables);" | ||
| val code = code"${addNewFunction(fnName, fn)}($inputVariables);" | ||
| val state = SubExprEliminationState( | ||
| ExprCode(code, isNull, JavaCode.global(value, expr.dataType)), | ||
| childrenSubExprs.toSeq) | ||
| exprs.foreach(localSubExprEliminationExprs.put(_, state)) | ||
| } | ||
| (splitCodes, localSubExprEliminationExprs, exprCodesNeedEvaluate) | ||
| (localSubExprEliminationExprs, exprCodesNeedEvaluate) | ||
| } else { | ||
| if (Utils.isTesting) { | ||
| throw QueryExecutionErrors.failedSplitSubExpressionError(MAX_JVM_METHOD_PARAMS_LENGTH) | ||
| } else { | ||
| logInfo(QueryExecutionErrors.failedSplitSubExpressionMsg(MAX_JVM_METHOD_PARAMS_LENGTH)) | ||
| (nonSplitExprCode, localSubExprEliminationExprsForNonSplit, Seq.empty) | ||
| (localSubExprEliminationExprsForNonSplit, Seq.empty) | ||
| } | ||
| } | ||
| } else { | ||
| (nonSplitExprCode, localSubExprEliminationExprsForNonSplit, Seq.empty) | ||
| (localSubExprEliminationExprsForNonSplit, Seq.empty) | ||
| } | ||
| SubExprCodes(codes, subExprsMap.toMap, exprCodes.flatten) | ||
| SubExprCodes(subExprsMap.toMap, exprCodes.flatten) | ||
| } | ||
cloud-fan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| /** | ||
|
|
@@ -1174,10 +1249,12 @@ class CodegenContext extends Logging { | |
| // Currently, we will do this for all non-leaf only expression trees (i.e. expr trees with | ||
| // at least two nodes) as the cost of doing it is expected to be low. | ||
|
|
||
| val subExprCode = s"${addNewFunction(fnName, fn)}($INPUT_ROW);" | ||
| subexprFunctions += s"${addNewFunction(fnName, fn)}($INPUT_ROW);" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: shall we use
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh yes, as the functions in class is a map, it will overwrite. But yes, we should use
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| val state = SubExprEliminationState( | ||
| JavaCode.isNullGlobal(isNull), | ||
| JavaCode.global(value, expr.dataType)) | ||
| ExprCode(code"$subExprCode", | ||
| JavaCode.isNullGlobal(isNull), | ||
| JavaCode.global(value, expr.dataType))) | ||
| subExprEliminationExprs ++= e.map(_ -> state).toMap | ||
| } | ||
| } | ||
|
|
@@ -1776,9 +1853,8 @@ object CodeGenerator extends Logging { | |
| while (stack.nonEmpty) { | ||
| stack.pop() match { | ||
| case e if subExprs.contains(e) => | ||
| val SubExprEliminationState(isNull, value) = subExprs(e) | ||
| collectLocalVariable(value) | ||
| collectLocalVariable(isNull) | ||
| collectLocalVariable(subExprs(e).eval.value) | ||
| collectLocalVariable(subExprs(e).eval.isNull) | ||
|
|
||
| case ref: BoundReference if ctx.currentVars != null && | ||
| ctx.currentVars(ref.ordinal) != null => | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your previous PR improves
EquivalentExpressionsto always return child subexpression first. It seems that PR is not useful after this PR because we track the children explicitly?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. We need to return child subexpressions first. So we can make sure child subexpression is codegen-ed and put into the map before parent subexpression. When we want to codegen parent subexpression, it can look up the child subexpression and put it as child of the parent.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I have a new idea for how to codegen subexpression following child-parent orders without sorting. It is more reliable than the sorting approach. I will open another PR for that.