[SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression #32559

viirya · 2021-05-16T07:48:01Z

What changes were proposed in this pull request?

This patch fixes an issue when dealing with common expressions in conditional expressions such as CaseWhen during subexpression elimination.

For example, previously we find common expressions among conditions of CaseWhen, but children expressions are also counted into. We should not count these children expressions as common expressions.

Why are the changes needed?

If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added tests.

SparkQA · 2021-05-16T09:21:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43106/

SparkQA · 2021-05-16T09:21:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43106/

SparkQA · 2021-05-16T13:06:32Z

Test build #138585 has finished for PR 32559 at commit 4111a04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-05-16T16:20:03Z

cc @maropu @cloud-fan

SparkQA · 2021-05-16T18:19:39Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43112/

SparkQA · 2021-05-16T21:55:27Z

Test build #138591 has finished for PR 32559 at commit ddb911e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-05-17T04:30:34Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

    }

-    commonExprSet.foreach(expr => addFunc(expr.e))
+    // Not all expressions in the set should be added. We should filter out the subexprs.


Do we need to revise line 83 consistently?

Yea, revised the method comment. Thanks.

SparkQA · 2021-05-17T04:57:56Z

Test build #138603 has finished for PR 32559 at commit 01a8c02.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

SparkQA · 2021-05-17T05:02:57Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43124/

dongjoon-hyun · 2021-05-17T05:03:13Z

BTW, master branch is currently broken at compilation.

dongjoon-hyun · 2021-05-17T05:15:08Z

master branch compilation is recovered. Could you rebase to the master branch, @viirya ?

dongjoon-hyun · 2021-05-17T05:16:14Z

Also it would be better to wait for Takeshi and Wenchen's review.

viirya · 2021-05-17T06:24:42Z

Thanks @dongjoon-hyun! Yea, just rebased to the master branch. I will leave this open to wait for the review from @maropu and @cloud-fan.

SparkQA · 2021-05-17T07:31:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43130/

SparkQA · 2021-05-17T07:31:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43130/

SparkQA · 2021-05-17T10:55:21Z

Test build #138609 has finished for PR 32559 at commit d062001.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

Is this a bug? It looks a kind of improvements to me.

maropu · 2021-05-17T07:23:54Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+   * the common expression `(c + 1)` will be added into `equivalenceMap`. Note that if an
+   * expression and its child expressions are all commonly occurred in each of given expressions,
+   * we filter out the child expressions. For example, if `((a + b) + c)` and `(a + b)` are
+   * common expressions, we only add `((a + b) + c)`.


If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity.

Could you leave comments here about why we need to filter out these exprs here?

Just a question; even if we filter out the redundant expr (e.g., (a + b) in this case) here, the suboptimal (this PR pointed out) case still can happen if the expr, (a + b), is added as a common one in the other part? I thought a query like this: Seq((1, 1, 1)).toDF("a", "b", "c").select(when($"a" + $"b" + $"c" > 0, $"a" + $"b" + $"c").when($"a" + $"b" + $"c" <= 0, $"a" + $"b")).

The so called common expressions must occur at all branches/values. So in the above case, (a + b) is actually the only one common expression among two values $"a" + $"b" + $"c and $"a" + $"b".

Updated the comment.

maropu · 2021-05-17T12:31:59Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+    val commonExprSet = candidateExprs.filter { candidateExpr =>
+      candidateExprs.forall { expr =>
+        expr == candidateExpr || expr.e.find(_.semanticEquals(candidateExpr.e)).isEmpty
+      }


Is this loop not expensive? It seems the time-complexity is big-O(the total number of expr nodes in candidateExprs) x (candidateExprs.size)^2 )?

+1, but I don't have a better idea now...

Yea, I considered this part but didn't come out better one.

Yea, okay. I don't have a idea, too... That was just a question.

cloud-fan · 2021-05-17T14:37:17Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

  }
+
+  test("SPARK-35410: SubExpr elimination should not include redundant child exprs " +
+    "for conditional expressions") {


is this only a problem for conditional expression?

So far the only one I can think about.

Found a non-conditional example that still is an issue even with this update (a bit contrived, but I'm sure there's a real use case)

val myUdf = udf(() => { println("In UDF") 1 }).withName("myUdf") spark.range(1).withColumn("a", myUdf()).select(($"a" + $"a") / ($"a" + $"a")).show()

This generates subexpressions myUdf() and (myUdf() + myUdf()), even though only the second one is used.

Thanks @Kimahriman. I see. Let me also look at it. As it is non-conditional case, but looks like the similar case. Let me see if it can be solved similarly.

Oh, I figured out. This might be an issue since we have sub-expr elimination. We also need to remove redundant children exprs for non-conditional cases.

But the fix might be different. I will work on it locally and submit another fix for it.

Any more thoughts on this? Was the subexpr sorting supposed to address this?

It might need another fix. I'm working on it and will submit it after these PRs merged.

viirya · 2021-05-17T16:23:26Z

Is this a bug? It looks a kind of improvements to me.

You can consider it as an improvement, yea. Although from user perspective, it is somehow hard to distinguish them clearly.

maropu · 2021-05-17T23:55:45Z

Is this a bug? It looks a kind of improvements to me.

You can consider it as an improvement, yea. Although from user perspective, it is somehow hard to distinguish them clearly.

I was just wondering if we need to backport this fix or not. I think the update of CSE-related code can affect the performance of user's queries easily (e.g., , performance penalties caused by the expensive loop), so IMO it's safe to merge it into master only.

Kimahriman · 2021-05-18T00:28:38Z

It is a bit of a performance regression in certain cases so that seems like a bug. We have heavily chained expressions in when clauses and I suspect (but haven't been able to prove yet because of the complexity) it's causing us some issues.

Kimahriman · 2021-05-18T00:33:28Z

I did actually hit a bug today where the when value was being evaluated even though the condition was false. I wasn't able to find the exact root cause yet but turning off subexpression elimination fixed the issue. It was basically when(col.rlike(...), udf(col)), but more complex on both sides so somehow the UDF was getting subexpression eval'd early and failed because it didn't match the regular expression

SparkQA · 2021-05-18T08:44:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43192/

viirya · 2021-05-19T17:44:59Z

Yeah that was the idea was for things like conditions/values that are only sometimes evaluated, evaluate them as a subexpression IF that expression is used somewhere else that's always evaluated anyway. Right now (assuming the above patch is applied), extra conditions and values that are only sometimes evaluated might not be pulled out as subexpressions even if they could be. So you would never evaluate an expression eagerly if we weren't definitely going to evaluate it at some point. I can try to make a PR to explain what I mean (and fix the bug I mentioned)

Okay. Could you submit the bug fix as a separate PR? For the other idea, it is another improvement and it is better not to mix them together.

viirya · 2021-05-19T17:46:29Z

@maropu @cloud-fan Do you have other comments on this change? Thanks.

viirya · 2021-05-19T19:40:15Z

Okay. Could you submit the bug fix as a separate PR? For the other idea, it is another improvement and it is better not to mix them together.

@Kimahriman Created a JIRA for the elseValue issue: https://issues.apache.org/jira/browse/SPARK-35449

viirya · 2021-05-19T21:02:30Z

Oh, BTW, I think SPARK-35449 is actually the bug you hit. This could be seen as an improvement as @maropu suggested.

Kimahriman · 2021-05-19T21:18:43Z

Oh, BTW, I think SPARK-35449 is actually the bug you hit. This could be seen as an improvement as @maropu suggested.

Yeah I think that's correct. Though I checked one of my queries and it generated 34 subexpressions and only used one of them. So depends if you consider that a bug or improvement hah

Kimahriman · 2021-05-19T23:16:10Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+      accum.add(1)
+      s
+    })
+    val df1 = spark.range(5).select(when(functions.length(simpleUDF($"id")) > 0,


I think the fix for https://issues.apache.org/jira/browse/SPARK-35449 will break this, since it's really a "bug" that the case value is included in subexpression resolution without an else value. Not a huge deal, I can try to fix in my follow up once this is merged

dongjoon-hyun · 2021-05-20T20:35:33Z

For this one, we are going to revisit after #32586 to be safe? Did I understand correctly, @viirya ?

viirya · 2021-05-20T20:59:07Z

For this one, we are going to revisit after #32586 to be safe? Did I understand correctly, @viirya ?

I think they are orthogonal improvements and can be merged independently.

dongjoon-hyun · 2021-05-20T21:13:28Z

Both are addressing corner cases for SubExprs. I mean they are touching the same problem domains.

Kimahriman · 2021-05-20T21:15:16Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

  private def addCommonExprs(
      exprs: Seq[Expression],
      addFunc: Expression => Boolean = addExpr): Unit = {
    val exprSetForAll = mutable.Set[Expr]()


One potentially unrelated thing I just noticed, do we need to keep track of all of the Expressions here as well (as in an Expr -> Seq[Expression] map)? It's really basically keeping the first Expression found, but the codegen looks like it uses the Expression hash (versus the semantic hash) to lookup subexpressions. Very much an edge case, just wondering if I'm understanding things correctly

You mean equivalenceMap?

I don't mean add it directly to that here. I'm just thinking of a really stupid example, when((col + 1) > 0, col + 1).otherwise(1 + col). Wouldn't col + 1 and 1 + col resolve as a common expression because they're semantically equal, but only col + 1 is added to equivalenceMap, so during codegen 1 + col wouldn't be resolved to the subexpression?

col + 1 and 1 + col will both be recognized as subexpression.

Yeah but won't the codgen stage not replace 1 + col since only col + 1 will be added to the equivalenceMap entry for Expr(col + 1)? For non commonExprs cases, both would be in equivalenceMap so that the codegen stage maps both of those expressions to the resulting subexpression. Again, not super related to this PR, but was the easiest place to ask

Both 1 + col and col + 1 will be replaced with the extracted subexpression during codege. We don't just look of key at equivalenceMap when replacing with subexpression.

viirya · 2021-05-20T21:59:26Z

Both are addressing corner cases for SubExprs. I mean they are touching the same problem domains.

Yea, sure. I agree.

SparkQA · 2021-05-21T19:03:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43335/

SparkQA · 2021-05-21T19:39:23Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43335/

SparkQA · 2021-05-21T22:25:40Z

Test build #138813 has finished for PR 32559 at commit 9973c1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DeferFetchRequestResult(fetchRequest: FetchRequest) extends FetchResult
class DataTypeOps(object, metaclass=ABCMeta):
\"\"\"The base class for binary operations of pandas-on-Spark objects (of different data types).\"\"\"
class BooleanOps(DataTypeOps):
class CategoricalOps(DataTypeOps):
class DateOps(DataTypeOps):
class DatetimeOps(DataTypeOps):
class NumericOps(DataTypeOps):
class IntegralOps(NumericOps):
class FractionalOps(NumericOps):
class StringOps(DataTypeOps):
case class ReferenceEqualPlanWrapper(plan: LogicalPlan)
class ExpressionContainmentOrdering extends Ordering[Expression]
new RuntimeException(s\"class $clsName has unexpected serializer: $objSerializer\")
case class UpdatingSessionsExec(
class UpdatingSessionsIterator(

viirya · 2021-05-22T06:27:24Z

#32586 was merged. Can we look at this if it is good to go? Thanks. cc @cloud-fan @dongjoon-hyun @maropu

maropu

okay, this improvement looks fine to me.

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2021-05-23T15:26:01Z

Could you make a backport to branch-3.1, @viirya ? There was a conflict on it.

viirya · 2021-05-23T17:04:11Z

Sure. Thanks @dongjoon-hyun @maropu @cloud-fan @Kimahriman

viirya · 2021-05-23T21:55:08Z

Ah, as this could be considered as an improvement (#32559 (review), #32559 (comment), ), we can just have it merged to master only.

dongjoon-hyun · 2021-05-23T22:18:33Z

Got it!

…hildren exprs in conditional expression This patch fixes a bug when dealing with common expressions in conditional expressions such as `CaseWhen` during subexpression elimination. For example, previously we find common expressions among conditions of `CaseWhen`, but children expressions are also counted into. We should not count these children expressions as common expressions. If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity. No Added tests. Closes apache#32559 from viirya/SPARK-35410. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

SubExpr elimination should not include redundant child exprs.

4111a04

Update test.

ddb911e

github-actions bot added the SQL label May 16, 2021

dongjoon-hyun reviewed May 17, 2021

View reviewed changes

Revise comment.

01a8c02

dongjoon-hyun approved these changes May 17, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into SPARK-35410

d062001

maropu reviewed May 17, 2021

View reviewed changes

cloud-fan reviewed May 17, 2021

View reviewed changes

Update comment.

4278e70

Kimahriman reviewed May 19, 2021

View reviewed changes

Kimahriman mentioned this pull request May 20, 2021

[SPARK-35449][SQL] Only extract common expressions from CaseWhen values if elseValue is set #32595

Closed

Kimahriman reviewed May 20, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into SPARK-35410

9973c1a

maropu approved these changes May 22, 2021

View reviewed changes

dongjoon-hyun approved these changes May 23, 2021

View reviewed changes

dongjoon-hyun closed this in 9e1b204 May 23, 2021

Kimahriman mentioned this pull request May 29, 2021

[SPARK-35560][SQL] Remove redundant subexpression evaluation in nested subexpressions #32699

Closed

viirya deleted the SPARK-35410 branch December 27, 2023 18:25

[SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression #32559

[SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression #32559

Uh oh!

Conversation

viirya commented May 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented May 16, 2021

Uh oh!

SparkQA commented May 16, 2021

Uh oh!

SparkQA commented May 16, 2021

Uh oh!

viirya commented May 16, 2021

Uh oh!

SparkQA commented May 16, 2021

Uh oh!

SparkQA commented May 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

dongjoon-hyun commented May 17, 2021

Uh oh!

dongjoon-hyun commented May 17, 2021

Uh oh!

dongjoon-hyun commented May 17, 2021

Uh oh!

viirya commented May 17, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

viirya commented May 16, 2021 •

edited

Loading

maropu May 17, 2021 •

edited

Loading

dongjoon-hyun commented May 20, 2021 •

edited

Loading

Kimahriman May 20, 2021 •

edited

Loading