[SPARK-35829][SQL] Clean up evaluates subexpressions and add more flexibility to evaluate particular subexpressoin #32980

viirya · 2021-06-19T19:40:14Z

What changes were proposed in this pull request?

This patch refactors the evaluation of subexpressions.

There are two changes:

Clean up subexpression code after evaluation to avoid duplicate evaluation.
Evaluate all children subexpressions when evaluating a subexpression.

Why are the changes needed?

Currently subexpressionEliminationForWholeStageCodegen return the gen-ed code of subexpressions. The caller simply puts the code into its code block. We need more flexible evaluation here. For example, for Filter operator's subexpression evaluation, we may need to evaluate particular subexpression for one predicate. Current approach cannot satisfy the requirement.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

SparkQA · 2021-06-19T21:08:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44555/

SparkQA · 2021-06-19T21:43:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44555/

SparkQA · 2021-06-19T22:02:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44557/

SparkQA · 2021-06-19T22:35:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44557/

SparkQA · 2021-06-20T01:23:26Z

Test build #140030 has finished for PR 32980 at commit e8a03f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SubExprEliminationState(

HyukjinKwon · 2021-06-20T02:29:26Z

cc @rednaxelafx too FYI

viirya · 2021-06-20T02:33:54Z

cc @maropu @cloud-fan

cloud-fan · 2021-06-21T03:35:44Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

 *
- * @param codes Strings representing the codes that evaluate common subexpressions.
+ * @param codes all `SubExprEliminationState` representing the codes that evaluate common
+ *              subexpressions.


this is a bit hard to understand. what's the difference between SubExprEliminationState here and in states?

The're the same SubExprEliminationState. states is used as map when we look for subexpressions to replace in an expression. codes are all values in the map, and they are in the sequence when we create them.

Now I'm thinking it more, maybe we don't need to keep the sequence (codes). As this PR cleans up child subexpressions during evaluation. The order of evaluation seems not important anymore.

SparkQA · 2021-06-21T09:19:12Z

Test build #140069 has finished for PR 32980 at commit 777b9a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-21T10:18:40Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44597/

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

cloud-fan · 2021-06-21T17:23:31Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+              childrenSubExprs += subExprEliminationExprs(e)
+            case _ =>
+          }
+          val state = SubExprEliminationState(eval.code, eval.isNull, eval.value,


seems it's simpler if we define SubExprEliminationState as SubExprEliminationState(eval: ExprValue, children: ...)

You mean SubExprEliminationState(eval: ExprCode, children: ...)?

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

viirya · 2021-06-21T20:28:09Z

Hmm, directly use the values in the map causes some test failure.

SparkQA · 2021-06-21T22:12:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44627/

SparkQA · 2021-06-21T22:20:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44627/

SparkQA · 2021-06-21T23:07:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44630/

SparkQA · 2021-06-21T23:16:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44630/

SparkQA · 2021-06-22T02:00:04Z

Test build #140099 has finished for PR 32980 at commit c774797.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-22T02:41:56Z

Test build #140102 has finished for PR 32980 at commit 4574b30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-06-22T09:30:44Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+ *                 evaluating this subexpression, we should evaluate all children
+ *                 subexpressions first. This is used if we want to selectively evaluate
+ *                 particular subexpressions, instead of all at once. In the case, we need
+ *                 to make sure we evaluate all children subexpressions too.


Your previous PR improves EquivalentExpressions to always return child subexpression first. It seems that PR is not useful after this PR because we track the children explicitly?

Not exactly. We need to return child subexpressions first. So we can make sure child subexpression is codegen-ed and put into the map before parent subexpression. When we want to codegen parent subexpression, it can look up the child subexpression and put it as child of the parent.

Actually I have a new idea for how to codegen subexpression following child-parent orders without sorting. It is more reliable than the sorting approach. I will open another PR for that.

viirya · 2021-06-24T05:00:27Z

Any more thoughts? @cloud-fan @maropu

viirya · 2021-06-25T21:26:32Z

@cloud-fan Could you take another look? Thanks!

maropu · 2021-06-28T00:29:37Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+   * evaluating a subexpression, this method will clean up the code block to avoid duplicate
+   * evaluation.
+   */
+  def evaluateSubExprEliminationState(subExprStates: Iterable[SubExprEliminationState]): String = {


nit: Iterable -> Seq?

All its caller side use Iterable. If changing to Seq here, all callers need to add .toSeq.

maropu · 2021-06-28T00:30:34Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+   * expressions and populates the mapping of common subexpressions to the generated code snippets.
+   *
+   * The generated code snippet for subexpression is wrapped in `SubExprEliminationState`, which
+   * contains a `ExprCode` and the children `SubExprEliminationState` if any. The `ExprCode`


nit: a ExprCode -> an ExprCode

maropu · 2021-06-28T00:32:22Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+              childrenSubExprs += subExprEliminationExprs(e)
+            case _ =>
+          }
+          val state = SubExprEliminationState(eval, childrenSubExprs.toSeq.reverse)


childrenSubExprs.toSeq.reverse -> childrenSubExprs.reverse?

btw, how about moving .reverse into the SubExprEliminationState side if we always need to sort it;

object SubExprEliminationState { def apply(eval: ExprCode, children: Seq[SubExprEliminationState]): SubExprEliminationState = { new SubExprEliminationState(eval, children.reverse) } }

maropu · 2021-06-28T00:43:58Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+          val childrenSubExprs = mutable.ArrayBuffer.empty[SubExprEliminationState]
+          exprs.head.foreach {
+            case e if subExprEliminationExprs.contains(e) =>
+              childrenSubExprs += subExprEliminationExprs(e)


Q: Is it difficult to add some tests for this new behaviour?

Let me add a few tests.

Added new test.

cloud-fan · 2021-06-29T14:26:02Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+object SubExprEliminationState {
+  def apply(
+      eval: ExprCode,
+      children: Seq[SubExprEliminationState] = Seq.empty): SubExprEliminationState = {


nit: def apply(eval: ExprCode): .... If children parameter is also provided, the default case class apply should work.

This is for @maropu's comment #32980 (comment).

Or you mean to also add def apply(eval: ExprCode) here?

Added def apply(eval: ExprCode).

cloud-fan · 2021-06-29T14:41:10Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+          // Collects other subexpressions from the children.
+          val childrenSubExprs = mutable.ArrayBuffer.empty[SubExprEliminationState]
+          exprs.head.foreach {
+            case e if subExprEliminationExprs.contains(e) =>


We need to add some comments to explain the assumption: this code works because EquivalentExpressions returns child expressions first.

BTW collecting child expressions here looks really inefficient, but I don't have a better idea for now ...

I see. This is not general expression but special (subexpr) ones, so we don't do collecting child expressions in general but in limited range. Except that if you have many subexpr and they are highly nested.

We need to add some comments to explain the assumption: this code works because EquivalentExpressions returns child expressions first.

As I commented before, I plan to remove the sorting. A better idea is to add SubExprEliminationState first into the map (not codegen yet). Then during codegen, we can look at the map to chain children.

cloud-fan · 2021-06-29T14:49:04Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

-
-    val (codes, subExprsMap, exprCodes) = if (nonSplitExprCode.map(_.length).sum > splitThreshold) {
+    val needSplit = nonSplitCode.map(_.eval.code.length).sum > SQLConf.get.methodSplitThreshold
+    val (subExprsMap, exprCodes) = if (needSplit) {


Not related to this PR: so here we repeat the logic of generating SubExprEliminationStates with splitting the code? nonSplitCode is totally wasted?

Previously it is lazy so we can do non-split conditionally. Now we nestedly generate subExprs so it cannot be lazy now. SubExprEliminationStates are needed to nestedly generate code for them.

cloud-fan

LGTM except a few minor comments

SparkQA · 2021-06-29T20:22:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44885/

SparkQA · 2021-06-29T21:13:09Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44885/

SparkQA · 2021-06-30T00:41:44Z

Test build #140369 has finished for PR 32980 at commit 2ecb592.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class MergedBlockMetaRequest extends AbstractMessage implements RequestMessage
public class MergedBlockMetaSuccess extends AbstractResponseMessage
public abstract class AbstractFetchShuffleBlocks extends BlockTransferMessage
public class FetchShuffleBlockChunks extends AbstractFetchShuffleBlocks
public class FetchShuffleBlocks extends AbstractFetchShuffleBlocks
final case class FileNameSpec(prefix: String, suffix: String)
class AvroSchemaHelper(avroSchema: Schema, avroPath: Seq[String])
class DecimalOps(FractionalOps):
class IntegralExtensionOps(IntegralOps):
class FractionalExtensionOps(FractionalOps):
class StringExtensionOps(StringOps):
new_class = type(\"NameType\", (NameTypeHolder,),
class GroupBy(Generic[T_Frame], metaclass=ABCMeta):
class DataFrameGroupBy(GroupBy[DataFrame]):
class SeriesGroupBy(GroupBy[Series]):
new_class = type(\"NameType\", (NameTypeHolder,),
class SparkIndexOpsMethods(Generic[T_IndexOps], metaclass=ABCMeta):
class SparkSeriesMethods(SparkIndexOpsMethods[\"ps.Series\"]):
class SparkIndexMethods(SparkIndexOpsMethods[\"ps.Index\"]):
class RollingAndExpanding(Generic[T_Frame], metaclass=ABCMeta):
class RollingLike(RollingAndExpanding[T_Frame]):
class Rolling(RollingLike[T_Frame]):
class RollingGroupby(RollingLike[T_Frame]):
class ExpandingLike(RollingAndExpanding[T_Frame]):
class Expanding(ExpandingLike[T_Frame]):
class ExpandingGroupby(ExpandingLike[T_Frame]):
sealed trait FieldName extends LeafExpression with Unevaluable
case class UnresolvedFieldName(name: Seq[String]) extends FieldName
case class ResolvedFieldName(name: Seq[String]) extends FieldName
case class Cast(
case class GetTimestampWithoutTZ(
case class ParseToTimestampWithoutTZ(
case class RebalancePartitions(
trait AlterTableCommand extends UnaryCommand
case class AlterTableDropColumns(
case class AlterTableRenameColumn(
new SparkException(s\"Cannot find catalog plugin class for catalog '$name': $pluginClassName\")
new SparkException(\"Cannot instantiate abstract catalog plugin class for \" +
new SparkException(s\"Can not load in UserDefinedType $
final class ParquetReadState
case class MergingSessionsExec(
class MergingSessionsIterator(
trait StatefulOperatorCustomMetric
case class StatefulOperatorCustomSumMetric(name: String, desc: String)
trait TestGroupState[S] extends GroupState[S]

SparkQA · 2021-06-30T00:56:38Z

Test build #140394 has finished for PR 32980 at commit 014bc8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-06-30T01:02:08Z

@maropu Any more comments? Otherwise I will merge this later. Thanks.

viirya · 2021-06-30T05:14:16Z

Thanks for review! Merging to master.

maropu · 2021-07-01T02:11:57Z

Thank you, @viirya . late lgtm.

cloud-fan · 2021-07-12T18:05:18Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

      // at least two nodes) as the cost of doing it is expected to be low.

+      val subExprCode = s"${addNewFunction(fnName, fn)}($INPUT_ROW);"
      subexprFunctions += s"${addNewFunction(fnName, fn)}($INPUT_ROW);"


nit: shall we use subexprFunctions += subExprCode here? otherwise we are calling addNewFunction twice.

Oh yes, as the functions in class is a map, it will overwrite. But yes, we should use subExprCode. Let me submit a followup.

…of addNewFunction ### What changes were proposed in this pull request? A followup of #32980. We should use `subExprCode` to avoid duplicate call of `addNewFunction`. ### Why are the changes needed? Avoid duplicate all of `addNewFunction`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #33305 from viirya/fix-minor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

github-actions bot added the SQL label Jun 19, 2021

Refactor how to evaluate subexpressions.

e8a03f5

viirya force-pushed the subexpr-eval branch from 59c7213 to e8a03f5 Compare June 19, 2021 21:10

This comment has been minimized.

Sign in to view

cloud-fan reviewed Jun 21, 2021

View reviewed changes

Remove the duplicate SubExprEliminationState.

777b9a4

cloud-fan reviewed Jun 21, 2021

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala Show resolved Hide resolved

cloud-fan reviewed Jun 21, 2021

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala Show resolved Hide resolved

viirya added 2 commits June 21, 2021 14:05

Fix incorrect map.

c774797

Simplify SubExprEliminationState and add more doc.

4574b30

cloud-fan reviewed Jun 22, 2021

View reviewed changes

maropu reviewed Jun 28, 2021

View reviewed changes

viirya added 2 commits June 28, 2021 12:46

Add new test.

9428a48

Merge remote-tracking branch 'upstream/master' into subexpr-eval

2ecb592

cloud-fan reviewed Jun 29, 2021

View reviewed changes

cloud-fan approved these changes Jun 29, 2021

View reviewed changes

Add another apply.

014bc8b

viirya mentioned this pull request Jun 29, 2021

[SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions #30245

Closed

viirya closed this in 064230d Jun 30, 2021

viirya deleted the subexpr-eval branch June 30, 2021 05:15

cloud-fan reviewed Jul 12, 2021

View reviewed changes

viirya mentioned this pull request Jul 12, 2021

[SPARK-35829][SQL][FOLLOWUP] Use subExprCode to avoid duplicate call of addNewFunction #33305

Closed

[SPARK-35829][SQL] Clean up evaluates subexpressions and add more flexibility to evaluate particular subexpressoin #32980

[SPARK-35829][SQL] Clean up evaluates subexpressions and add more flexibility to evaluate particular subexpressoin #32980

Uh oh!

Conversation

viirya commented Jun 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 19, 2021

Uh oh!

SparkQA commented Jun 19, 2021

Uh oh!

SparkQA commented Jun 19, 2021

Uh oh!

SparkQA commented Jun 19, 2021

Uh oh!

This comment has been minimized.

SparkQA commented Jun 20, 2021

Uh oh!

HyukjinKwon commented Jun 20, 2021

Uh oh!

viirya commented Jun 20, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jun 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viirya commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jun 24, 2021

Uh oh!

viirya commented Jun 25, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jun 19, 2021 •

edited

Loading

viirya Jun 21, 2021 •

edited

Loading

viirya Jun 22, 2021 •

edited

Loading