[SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions #30245

viirya · 2020-11-04T06:24:49Z

What changes were proposed in this pull request?

Currently we skip subexpression elimination in branches of conditional expressions including If, CaseWhen, and Coalesce. Actually we can do subexpression elimination for such branches if the subexpression is common across all branches. This patch proposes to support subexpression elimination in branches of conditional expressions.

Why are the changes needed?

We may miss subexpression elimination chances in branches of conditional expressions. This kind of subexpression is frequently seen. It may be written manually by users or come from query optimizer. For example, project collapsing could embed expressions between two Projects and produces conditional expression like:

CASE WHEN jsonToStruct(json).a = '1' THEN 1.0 WHEN jsonToStruct(json).a = '2' THEN 2.0 ... ELSE 1.2 END

If jsonToStruct(json) is time-expensive expression, we don't eliminate the duplication and waste time on running it repeatedly now.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

…ons.

viirya · 2020-11-05T01:51:05Z

cc @cloud-fan @maropu @dongjoon-hyun

dongjoon-hyun

Thank you for pinging me, @viirya . I took a look briefly and this looks useful. I'll revisit tomorrow.

viirya · 2020-11-05T05:24:34Z

Thank you @dongjoon-hyun

SparkQA · 2020-11-05T06:30:10Z

Test build #130627 has finished for PR 30245 at commit cd3776c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-11-05T22:27:39Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+        }
+      })
+      exprSetForAll = exprSetForAll.intersect(exprSet)
+    }


Do we need to handle head and tail seperately?

For expression head, we add underlying expressions into exprSetForAll set. But for expressions in tail, we keep intersect between exprSetForAll and exprSet.

We can merge two blocks, but in the block we need to check if current expression is head expression and do different logic based on the check.

I prefer current one since it looks simpler.

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

dongjoon-hyun · 2020-11-05T22:30:53Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

-      childrenToRecurse.foreach(addExprTree)
+    // For some special expressions we cannot just recurse into all of its children, but we can
+    // recursively add the common expressions shared between all of its children.
+    def commonChildrenToRecurse: Seq[Seq[Expression]] = expr match {


nit. Although this is used only here, can we declare this outside of this function as a private method? Currently, addExprTree seems to grow unnecessarily.

dongjoon-hyun

Also, could you add a negative test case having the expression cannot be eliminated from conditional expressions?

viirya · 2020-11-05T22:42:16Z

Also, could you add a negative test case having the expression cannot be eliminated from conditional expressions?

I mixed positive and negative test cases. I think I can add some comment to explain it.

SparkQA · 2020-11-06T09:21:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35315/

SparkQA · 2020-11-06T09:43:57Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35315/

cloud-fan · 2020-11-06T09:52:33Z

Do you still remember why "subexpression elimination" must be eagerly executed? Because implementing "lazy" is expensive?

SparkQA · 2020-11-06T13:02:43Z

Test build #130705 has finished for PR 30245 at commit 9182e3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-11-06T17:18:43Z

Do you still remember why "subexpression elimination" must be eagerly executed? Because implementing "lazy" is expensive?

I don't remember if we have tried to implement "lazy" behavior in codegen. Looks like at least it will bring complex as we need extra variable to check if a subexpression is evaulated in first time. Every time we use a subexpression, we might need to first check the extra variable and decide to evaluate the subexpression or just use evaluated value.

cloud-fan · 2020-11-10T04:51:25Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+   * For example, given two expressions `(a + (b + (c + 1)))` and `(d + (e + (c + 1)))`,
+   * the common expression `(c + 1)` will be added into `equivalenceMap`.
+   */
+  def addCommonExprs(exprs: Seq[Expression], addFunc: Expression => Boolean = addExpr): Unit = {


This can be private as well.

cloud-fan · 2020-11-10T04:53:31Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+      val otherExprSet = mutable.Set[Expr]()
+
+      addExprTree(expr, (innerExpr: Expression) => {
+        if (innerExpr.deterministic) {


Similar code appears twice. Can we create a method for it?

cloud-fan · 2020-11-10T04:59:55Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

+    val equivalence1 = new EquivalentExpressions
+    equivalence1.addExprTree(caseWhenExpr1)
+
+    // `add2` is repeatedly in all conditions.


add1 is also repeated. Why it's not included?

We treat the first condition specially because it is definitely run. So it counts one for add2. Other conditions all contain add2 so it counts for one. That is where the count 2 comes from for add2.

For add1, although all values contain it, it is definitely run, so we count it one. If no other expression contains add1, we don't extract subexpression for add1 as it will run just once (we only run one value of CaseWhen).

SparkQA · 2020-11-10T05:46:35Z

Test build #130813 has finished for PR 30245 at commit 33f3bd3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-10T06:21:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35444/

SparkQA · 2020-11-10T06:23:20Z

Test build #130812 has finished for PR 30245 at commit 16314a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-10T06:51:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35444/

SparkQA · 2020-11-10T08:05:01Z

Test build #130835 has finished for PR 30245 at commit b415728.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-10T08:35:53Z

retest this please

SparkQA · 2020-11-10T10:12:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35461/

SparkQA · 2020-11-10T10:43:06Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35461/

SparkQA · 2020-11-10T13:46:03Z

Test build #130853 has finished for PR 30245 at commit b415728.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-11-10T17:11:45Z

Thanks! Passed Jenkins and GitHub Actions. Will merge this today.

viirya · 2020-11-11T00:16:24Z

Thanks! Merging to master.

leoluan2009 · 2020-11-12T01:18:24Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

+    assert(equivalence1.getAllEquivalentExprs.filter(_.size == 2).head == Seq(add, add))
+    // one-time expressions: only ifExpr and its predicate expression
+    assert(equivalence1.getAllEquivalentExprs.count(_.size == 1) == 2)
+    assert(equivalence1.getAllEquivalentExprs.filter(_.size == 1).head == Seq(ifExpr1))


Should we use contains method? HashMap can not guarantee the order

Ok, I will create a follow-up for making sure it will not possibly flaky. Thanks.

Created #30371.

Thank you, @leoluan2009 and @viirya . The follow-up is merged to reduce the flakiness.

…es if elseValue is set ### What changes were proposed in this pull request? This PR fixes a bug with subexpression elimination for CaseWhen statements. #30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue. ### Why are the changes needed? Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true. ### Does this PR introduce _any_ user-facing change? Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example: ``` val col = when($"id" < 0, myUdf($"id")) spark.range(1).select(when(col > 0, col)).show() ``` `myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run. ### How was this patch tested? Updated existing test with new case. Closes #32595 from Kimahriman/bug-case-subexpr-elimination. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

… values if elseValue is set ### What changes were proposed in this pull request? This PR fixes a bug with subexpression elimination for CaseWhen statements. #30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue. ### Why are the changes needed? Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true. ### Does this PR introduce _any_ user-facing change? Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example: ``` val col = when($"id" < 0, myUdf($"id")) spark.range(1).select(when(col > 0, col)).show() ``` `myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run. ### How was this patch tested? Updated existing test with new case. Closes #32651 from Kimahriman/bug-case-subexpr-elimination-3.1. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

cloud-fan · 2021-06-29T18:34:51Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+      // if it is shared among conditions, but it doesn't need to be shared in values. Similarly,
+      // a subexpression among values doesn't need to be in conditions because no matter which
+      // condition is true, it will be evaluated.
+      val conditions = c.branches.tail.map(_._1)


There is a flaw here: we exclude the first condition, so a common subexpressions in the rest of the conditions doesn't mean it's always evaluated.

e.g. CaseWhen(cond1, ... cond2, ..., cond2, ...), cond2 is shared between the rest conditions but it's not always evaluated.

yes, this is related to #32977. This looks more a aggressive optimization. Consider if we respect short-circuit evaluation practice for CaseWhen, this might be an issue if users reply short-circuit evaluation to guard later conditions.

Safest approach is to only consider all conditions.

WDYT? Should we only consider all conditions?

I think we should. I hit an issue caused by it in my refactor and I'll open a PR for the refactor with multiple bugs fixed.

Ok, thanks!

BTW, does #32980 conflict with your refactor?

only some trivial conflicts, #32980 should be merged first as it has been reviewed and approved.

FWIW, I also addressed this issue in #32987 which assumed CaseWhen's (and Coalesce) should short circuit and guard later conditions. The main benefit/difference is if you have

CaseWhen(cond1, ..., cond1, ..., cond2, ...), cond1 gets pulled out as a subexpression when it wouldn't otherwise even with #33142 I think

… values if elseValue is set ### What changes were proposed in this pull request? This PR fixes a bug with subexpression elimination for CaseWhen statements. apache#30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue. ### Why are the changes needed? Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true. ### Does this PR introduce _any_ user-facing change? Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example: ``` val col = when($"id" < 0, myUdf($"id")) spark.range(1).select(when(col > 0, col)).show() ``` `myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run. ### How was this patch tested? Updated existing test with new case. Closes apache#32651 from Kimahriman/bug-case-subexpr-elimination-3.1. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…es if elseValue is set This PR fixes a bug with subexpression elimination for CaseWhen statements. apache#30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue. Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true. Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example: ``` val col = when($"id" < 0, myUdf($"id")) spark.range(1).select(when(col > 0, col)).show() ``` `myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run. Updated existing test with new case. Closes apache#32595 from Kimahriman/bug-case-subexpr-elimination. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya marked this pull request as draft November 4, 2020 06:24

viirya commented Nov 4, 2020

View reviewed changes

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala Show resolved Hide resolved

This comment has been minimized.

Sign in to view

Support subexpression elimination in branches of conditional expressi…

db0cfcc

…ons.

viirya force-pushed the SPARK-33337 branch from a8e0c22 to db0cfcc Compare November 4, 2020 07:15

This comment has been minimized.

Sign in to view

Add test cases for CaseWhen and Coalesce.

cd3776c

viirya marked this pull request as ready for review November 5, 2020 01:50

viirya changed the title ~~[WIP][SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions~~ [SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions Nov 5, 2020

This comment has been minimized.

Sign in to view

dongjoon-hyun reviewed Nov 5, 2020

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Nov 5, 2020

View reviewed changes

For review comments.

9182e3d

cloud-fan reviewed Nov 10, 2020

View reviewed changes

For review comments.

b415728

cloud-fan approved these changes Nov 10, 2020

View reviewed changes

viirya closed this in 6fa80ed Nov 11, 2020

leoluan2009 reviewed Nov 12, 2020

View reviewed changes

Kimahriman mentioned this pull request May 19, 2021

[SPARK-35449][SQL] Only extract common expressions from CaseWhen values if elseValue is set #32595

Closed

Kimahriman mentioned this pull request May 24, 2021

[SPARK-35449][SQL][3.1] Only extract common expressions from CaseWhen values if elseValue is set #32651

Closed

cloud-fan reviewed Jun 29, 2021

View reviewed changes

cloud-fan mentioned this pull request Jun 29, 2021

[SPARK-35940][SQL] Refactor EquivalentExpressions to make it more efficient #33142

Closed

viirya deleted the SPARK-33337 branch December 27, 2023 18:28

[SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions #30245

[SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions #30245

Uh oh!

Conversation

viirya commented Nov 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

viirya commented Nov 5, 2020

Uh oh!

This comment has been minimized.

This comment has been minimized.

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 6, 2020

Uh oh!

SparkQA commented Nov 6, 2020

Uh oh!

cloud-fan commented Nov 6, 2020

Uh oh!

SparkQA commented Nov 6, 2020

Uh oh!

viirya commented Nov 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

cloud-fan commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

viirya commented Nov 10, 2020

viirya commented Nov 4, 2020 •

edited

Loading

viirya Jun 29, 2021 •

edited

Loading