-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35829][SQL] Clean up evaluates subexpressions and add more flexibility to evaluate particular subexpressoin #32980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
This comment has been minimized.
This comment has been minimized.
|
Test build #140030 has finished for PR 32980 at commit
|
|
cc @rednaxelafx too FYI |
| * | ||
| * @param codes Strings representing the codes that evaluate common subexpressions. | ||
| * @param codes all `SubExprEliminationState` representing the codes that evaluate common | ||
| * subexpressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit hard to understand. what's the difference between SubExprEliminationState here and in states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The're the same SubExprEliminationState. states is used as map when we look for subexpressions to replace in an expression. codes are all values in the map, and they are in the sequence when we create them.
Now I'm thinking it more, maybe we don't need to keep the sequence (codes). As this PR cleans up child subexpressions during evaluation. The order of evaluation seems not important anymore.
|
Test build #140069 has finished for PR 32980 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
Show resolved
Hide resolved
| childrenSubExprs += subExprEliminationExprs(e) | ||
| case _ => | ||
| } | ||
| val state = SubExprEliminationState(eval.code, eval.isNull, eval.value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems it's simpler if we define SubExprEliminationState as SubExprEliminationState(eval: ExprValue, children: ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean SubExprEliminationState(eval: ExprCode, children: ...)?
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
Show resolved
Hide resolved
|
Hmm, directly use the values in the map causes some test failure. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #140099 has finished for PR 32980 at commit
|
|
Test build #140102 has finished for PR 32980 at commit
|
| * evaluating this subexpression, we should evaluate all children | ||
| * subexpressions first. This is used if we want to selectively evaluate | ||
| * particular subexpressions, instead of all at once. In the case, we need | ||
| * to make sure we evaluate all children subexpressions too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your previous PR improves EquivalentExpressions to always return child subexpression first. It seems that PR is not useful after this PR because we track the children explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. We need to return child subexpressions first. So we can make sure child subexpression is codegen-ed and put into the map before parent subexpression. When we want to codegen parent subexpression, it can look up the child subexpression and put it as child of the parent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I have a new idea for how to codegen subexpression following child-parent orders without sorting. It is more reliable than the sorting approach. I will open another PR for that.
|
Any more thoughts? @cloud-fan @maropu |
|
@cloud-fan Could you take another look? Thanks! |
| * evaluating a subexpression, this method will clean up the code block to avoid duplicate | ||
| * evaluation. | ||
| */ | ||
| def evaluateSubExprEliminationState(subExprStates: Iterable[SubExprEliminationState]): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Iterable -> Seq?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All its caller side use Iterable. If changing to Seq here, all callers need to add .toSeq.
| * expressions and populates the mapping of common subexpressions to the generated code snippets. | ||
| * | ||
| * The generated code snippet for subexpression is wrapped in `SubExprEliminationState`, which | ||
| * contains a `ExprCode` and the children `SubExprEliminationState` if any. The `ExprCode` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a ExprCode -> an ExprCode
| childrenSubExprs += subExprEliminationExprs(e) | ||
| case _ => | ||
| } | ||
| val state = SubExprEliminationState(eval, childrenSubExprs.toSeq.reverse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
childrenSubExprs.toSeq.reverse -> childrenSubExprs.reverse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, how about moving .reverse into the SubExprEliminationState side if we always need to sort it;
object SubExprEliminationState {
def apply(eval: ExprCode, children: Seq[SubExprEliminationState]): SubExprEliminationState = {
new SubExprEliminationState(eval, children.reverse)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
| val childrenSubExprs = mutable.ArrayBuffer.empty[SubExprEliminationState] | ||
| exprs.head.foreach { | ||
| case e if subExprEliminationExprs.contains(e) => | ||
| childrenSubExprs += subExprEliminationExprs(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Is it difficult to add some tests for this new behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add a few tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added new test.
| object SubExprEliminationState { | ||
| def apply( | ||
| eval: ExprCode, | ||
| children: Seq[SubExprEliminationState] = Seq.empty): SubExprEliminationState = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: def apply(eval: ExprCode): .... If children parameter is also provided, the default case class apply should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for @maropu's comment #32980 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you mean to also add def apply(eval: ExprCode) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added def apply(eval: ExprCode).
| // Collects other subexpressions from the children. | ||
| val childrenSubExprs = mutable.ArrayBuffer.empty[SubExprEliminationState] | ||
| exprs.head.foreach { | ||
| case e if subExprEliminationExprs.contains(e) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add some comments to explain the assumption: this code works because EquivalentExpressions returns child expressions first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW collecting child expressions here looks really inefficient, but I don't have a better idea for now ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. This is not general expression but special (subexpr) ones, so we don't do collecting child expressions in general but in limited range. Except that if you have many subexpr and they are highly nested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add some comments to explain the assumption: this code works because EquivalentExpressions returns child expressions first.
As I commented before, I plan to remove the sorting. A better idea is to add SubExprEliminationState first into the map (not codegen yet). Then during codegen, we can look at the map to chain children.
|
|
||
| val (codes, subExprsMap, exprCodes) = if (nonSplitExprCode.map(_.length).sum > splitThreshold) { | ||
| val needSplit = nonSplitCode.map(_.eval.code.length).sum > SQLConf.get.methodSplitThreshold | ||
| val (subExprsMap, exprCodes) = if (needSplit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR: so here we repeat the logic of generating SubExprEliminationStates with splitting the code? nonSplitCode is totally wasted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously it is lazy so we can do non-split conditionally. Now we nestedly generate subExprs so it cannot be lazy now. SubExprEliminationStates are needed to nestedly generate code for them.
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except a few minor comments
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #140369 has finished for PR 32980 at commit
|
|
Test build #140394 has finished for PR 32980 at commit
|
|
@maropu Any more comments? Otherwise I will merge this later. Thanks. |
|
Thanks for review! Merging to master. |
|
Thank you, @viirya . late lgtm. |
| // at least two nodes) as the cost of doing it is expected to be low. | ||
|
|
||
| val subExprCode = s"${addNewFunction(fnName, fn)}($INPUT_ROW);" | ||
| subexprFunctions += s"${addNewFunction(fnName, fn)}($INPUT_ROW);" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we use subexprFunctions += subExprCode here? otherwise we are calling addNewFunction twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, as the functions in class is a map, it will overwrite. But yes, we should use subExprCode. Let me submit a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…of addNewFunction ### What changes were proposed in this pull request? A followup of #32980. We should use `subExprCode` to avoid duplicate call of `addNewFunction`. ### Why are the changes needed? Avoid duplicate all of `addNewFunction`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #33305 from viirya/fix-minor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
What changes were proposed in this pull request?
This patch refactors the evaluation of subexpressions.
There are two changes:
Why are the changes needed?
Currently
subexpressionEliminationForWholeStageCodegenreturn the gen-ed code of subexpressions. The caller simply puts the code into its code block. We need more flexible evaluation here. For example, for Filter operator's subexpression evaluation, we may need to evaluate particular subexpression for one predicate. Current approach cannot satisfy the requirement.Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests.