Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented May 16, 2021

What changes were proposed in this pull request?

This patch fixes an issue when dealing with common expressions in conditional expressions such as CaseWhen during subexpression elimination.

For example, previously we find common expressions among conditions of CaseWhen, but children expressions are also counted into. We should not count these children expressions as common expressions.

Why are the changes needed?

If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added tests.

@SparkQA
Copy link

SparkQA commented May 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43106/

@SparkQA
Copy link

SparkQA commented May 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43106/

@SparkQA
Copy link

SparkQA commented May 16, 2021

Test build #138585 has finished for PR 32559 at commit 4111a04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 16, 2021

cc @maropu @cloud-fan

@github-actions github-actions bot added the SQL label May 16, 2021
@SparkQA
Copy link

SparkQA commented May 16, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43112/

@SparkQA
Copy link

SparkQA commented May 16, 2021

Test build #138591 has finished for PR 32559 at commit ddb911e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

commonExprSet.foreach(expr => addFunc(expr.e))
// Not all expressions in the set should be added. We should filter out the subexprs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to revise line 83 consistently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, revised the method comment. Thanks.

@SparkQA
Copy link

SparkQA commented May 17, 2021

Test build #138603 has finished for PR 32559 at commit 01a8c02.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented May 17, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43124/

@dongjoon-hyun
Copy link
Member

BTW, master branch is currently broken at compilation.

@dongjoon-hyun
Copy link
Member

master branch compilation is recovered. Could you rebase to the master branch, @viirya ?

@dongjoon-hyun
Copy link
Member

Also it would be better to wait for Takeshi and Wenchen's review.

@viirya
Copy link
Member Author

viirya commented May 17, 2021

Thanks @dongjoon-hyun! Yea, just rebased to the master branch. I will leave this open to wait for the review from @maropu and @cloud-fan.

@SparkQA
Copy link

SparkQA commented May 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43130/

@SparkQA
Copy link

SparkQA commented May 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43130/

@SparkQA
Copy link

SparkQA commented May 17, 2021

Test build #138609 has finished for PR 32559 at commit d062001.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug? It looks a kind of improvements to me.

* the common expression `(c + 1)` will be added into `equivalenceMap`. Note that if an
* expression and its child expressions are all commonly occurred in each of given expressions,
* we filter out the child expressions. For example, if `((a + b) + c)` and `(a + b)` are
* common expressions, we only add `((a + b) + c)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity.

Could you leave comments here about why we need to filter out these exprs here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question; even if we filter out the redundant expr (e.g., (a + b) in this case) here, the suboptimal (this PR pointed out) case still can happen if the expr, (a + b), is added as a common one in the other part? I thought a query like this: Seq((1, 1, 1)).toDF("a", "b", "c").select(when($"a" + $"b" + $"c" > 0, $"a" + $"b" + $"c").when($"a" + $"b" + $"c" <= 0, $"a" + $"b")).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The so called common expressions must occur at all branches/values. So in the above case, (a + b) is actually the only one common expression among two values $"a" + $"b" + $"c and $"a" + $"b".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment.

val commonExprSet = candidateExprs.filter { candidateExpr =>
candidateExprs.forall { expr =>
expr == candidateExpr || expr.e.find(_.semanticEquals(candidateExpr.e)).isEmpty
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this loop not expensive? It seems the time-complexity is big-O(the total number of expr nodes in candidateExprs) x (candidateExprs.size)^2 )?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, but I don't have a better idea now...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I considered this part but didn't come out better one.

Copy link
Member

@maropu maropu May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, okay. I don't have a idea, too... That was just a question.

}

test("SPARK-35410: SubExpr elimination should not include redundant child exprs " +
"for conditional expressions") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only a problem for conditional expression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the only one I can think about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a non-conditional example that still is an issue even with this update (a bit contrived, but I'm sure there's a real use case)

val myUdf = udf(() => {
  println("In UDF")
  1
}).withName("myUdf")
spark.range(1).withColumn("a", myUdf()).select(($"a" + $"a") / ($"a" + $"a")).show()

This generates subexpressions myUdf() and (myUdf() + myUdf()), even though only the second one is used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Kimahriman. I see. Let me also look at it. As it is non-conditional case, but looks like the similar case. Let me see if it can be solved similarly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I figured out. This might be an issue since we have sub-expr elimination. We also need to remove redundant children exprs for non-conditional cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the fix might be different. I will work on it locally and submit another fix for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any more thoughts on this? Was the subexpr sorting supposed to address this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might need another fix. I'm working on it and will submit it after these PRs merged.

@viirya
Copy link
Member Author

viirya commented May 17, 2021

Is this a bug? It looks a kind of improvements to me.

You can consider it as an improvement, yea. Although from user perspective, it is somehow hard to distinguish them clearly.

@maropu
Copy link
Member

maropu commented May 17, 2021

Is this a bug? It looks a kind of improvements to me.

You can consider it as an improvement, yea. Although from user perspective, it is somehow hard to distinguish them clearly.

I was just wondering if we need to backport this fix or not. I think the update of CSE-related code can affect the performance of user's queries easily (e.g., , performance penalties caused by the expensive loop), so IMO it's safe to merge it into master only.

@Kimahriman
Copy link
Contributor

It is a bit of a performance regression in certain cases so that seems like a bug. We have heavily chained expressions in when clauses and I suspect (but haven't been able to prove yet because of the complexity) it's causing us some issues.

@Kimahriman
Copy link
Contributor

I did actually hit a bug today where the when value was being evaluated even though the condition was false. I wasn't able to find the exact root cause yet but turning off subexpression elimination fixed the issue. It was basically when(col.rlike(...), udf(col)), but more complex on both sides so somehow the UDF was getting subexpression eval'd early and failed because it didn't match the regular expression

@SparkQA
Copy link

SparkQA commented May 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43192/

@viirya
Copy link
Member Author

viirya commented May 19, 2021

Yeah that was the idea was for things like conditions/values that are only sometimes evaluated, evaluate them as a subexpression IF that expression is used somewhere else that's always evaluated anyway. Right now (assuming the above patch is applied), extra conditions and values that are only sometimes evaluated might not be pulled out as subexpressions even if they could be. So you would never evaluate an expression eagerly if we weren't definitely going to evaluate it at some point. I can try to make a PR to explain what I mean (and fix the bug I mentioned)

Okay. Could you submit the bug fix as a separate PR? For the other idea, it is another improvement and it is better not to mix them together.

@viirya
Copy link
Member Author

viirya commented May 19, 2021

@maropu @cloud-fan Do you have other comments on this change? Thanks.

@viirya
Copy link
Member Author

viirya commented May 19, 2021

Okay. Could you submit the bug fix as a separate PR? For the other idea, it is another improvement and it is better not to mix them together.

@Kimahriman Created a JIRA for the elseValue issue: https://issues.apache.org/jira/browse/SPARK-35449

@viirya
Copy link
Member Author

viirya commented May 19, 2021

Oh, BTW, I think SPARK-35449 is actually the bug you hit. This could be seen as an improvement as @maropu suggested.

@Kimahriman
Copy link
Contributor

Oh, BTW, I think SPARK-35449 is actually the bug you hit. This could be seen as an improvement as @maropu suggested.

Yeah I think that's correct. Though I checked one of my queries and it generated 34 subexpressions and only used one of them. So depends if you consider that a bug or improvement hah

accum.add(1)
s
})
val df1 = spark.range(5).select(when(functions.length(simpleUDF($"id")) > 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fix for https://issues.apache.org/jira/browse/SPARK-35449 will break this, since it's really a "bug" that the case value is included in subexpression resolution without an else value. Not a huge deal, I can try to fix in my follow up once this is merged

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 20, 2021

For this one, we are going to revisit after #32586 to be safe? Did I understand correctly, @viirya ?

@viirya
Copy link
Member Author

viirya commented May 20, 2021

For this one, we are going to revisit after #32586 to be safe? Did I understand correctly, @viirya ?

I think they are orthogonal improvements and can be merged independently.

@dongjoon-hyun
Copy link
Member

Both are addressing corner cases for SubExprs. I mean they are touching the same problem domains.

private def addCommonExprs(
exprs: Seq[Expression],
addFunc: Expression => Boolean = addExpr): Unit = {
val exprSetForAll = mutable.Set[Expr]()
Copy link
Contributor

@Kimahriman Kimahriman May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potentially unrelated thing I just noticed, do we need to keep track of all of the Expressions here as well (as in an Expr -> Seq[Expression] map)? It's really basically keeping the first Expression found, but the codegen looks like it uses the Expression hash (versus the semantic hash) to lookup subexpressions. Very much an edge case, just wondering if I'm understanding things correctly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean equivalenceMap?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean add it directly to that here. I'm just thinking of a really stupid example, when((col + 1) > 0, col + 1).otherwise(1 + col). Wouldn't col + 1 and 1 + col resolve as a common expression because they're semantically equal, but only col + 1 is added to equivalenceMap, so during codegen 1 + col wouldn't be resolved to the subexpression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col + 1 and 1 + col will both be recognized as subexpression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but won't the codgen stage not replace 1 + col since only col + 1 will be added to the equivalenceMap entry for Expr(col + 1)? For non commonExprs cases, both would be in equivalenceMap so that the codegen stage maps both of those expressions to the resulting subexpression. Again, not super related to this PR, but was the easiest place to ask

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both 1 + col and col + 1 will be replaced with the extracted subexpression during codege. We don't just look of key at equivalenceMap when replacing with subexpression.

@viirya
Copy link
Member Author

viirya commented May 20, 2021

Both are addressing corner cases for SubExprs. I mean they are touching the same problem domains.

Yea, sure. I agree.

@SparkQA
Copy link

SparkQA commented May 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43335/

@SparkQA
Copy link

SparkQA commented May 21, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43335/

@SparkQA
Copy link

SparkQA commented May 21, 2021

Test build #138813 has finished for PR 32559 at commit 9973c1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DeferFetchRequestResult(fetchRequest: FetchRequest) extends FetchResult
  • class DataTypeOps(object, metaclass=ABCMeta):
  • \"\"\"The base class for binary operations of pandas-on-Spark objects (of different data types).\"\"\"
  • class BooleanOps(DataTypeOps):
  • class CategoricalOps(DataTypeOps):
  • class DateOps(DataTypeOps):
  • class DatetimeOps(DataTypeOps):
  • class NumericOps(DataTypeOps):
  • class IntegralOps(NumericOps):
  • class FractionalOps(NumericOps):
  • class StringOps(DataTypeOps):
  • case class ReferenceEqualPlanWrapper(plan: LogicalPlan)
  • class ExpressionContainmentOrdering extends Ordering[Expression]
  • new RuntimeException(s\"class $clsName has unexpected serializer: $objSerializer\")
  • case class UpdatingSessionsExec(
  • class UpdatingSessionsIterator(

@viirya
Copy link
Member Author

viirya commented May 22, 2021

#32586 was merged. Can we look at this if it is good to go? Thanks. cc @cloud-fan @dongjoon-hyun @maropu

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, this improvement looks fine to me.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Could you make a backport to branch-3.1, @viirya ? There was a conflict on it.

@viirya
Copy link
Member Author

viirya commented May 23, 2021

Sure. Thanks @dongjoon-hyun @maropu @cloud-fan @Kimahriman

@viirya
Copy link
Member Author

viirya commented May 23, 2021

Ah, as this could be considered as an improvement (#32559 (review), #32559 (comment), ), we can just have it merged to master only.

@dongjoon-hyun
Copy link
Member

Got it!

Kimahriman pushed a commit to Kimahriman/spark that referenced this pull request Feb 22, 2022
…hildren exprs in conditional expression

This patch fixes a bug when dealing with common expressions in conditional expressions such as `CaseWhen` during subexpression elimination.

For example, previously we find common expressions among conditions of `CaseWhen`, but children expressions are also counted into. We should not count these children expressions as common expressions.

If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity.

No

Added tests.

Closes apache#32559 from viirya/SPARK-35410.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@viirya viirya deleted the SPARK-35410 branch December 27, 2023 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants