-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression #32559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138585 has finished for PR 32559 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #138591 has finished for PR 32559 at commit
|
| } | ||
|
|
||
| commonExprSet.foreach(expr => addFunc(expr.e)) | ||
| // Not all expressions in the set should be added. We should filter out the subexprs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to revise line 83 consistently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, revised the method comment. Thanks.
|
Test build #138603 has finished for PR 32559 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
BTW, |
|
|
|
Also it would be better to wait for Takeshi and Wenchen's review. |
|
Thanks @dongjoon-hyun! Yea, just rebased to the master branch. I will leave this open to wait for the review from @maropu and @cloud-fan. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138609 has finished for PR 32559 at commit
|
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bug? It looks a kind of improvements to me.
| * the common expression `(c + 1)` will be added into `equivalenceMap`. Note that if an | ||
| * expression and its child expressions are all commonly occurred in each of given expressions, | ||
| * we filter out the child expressions. For example, if `((a + b) + c)` and `(a + b)` are | ||
| * common expressions, we only add `((a + b) + c)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity.
Could you leave comments here about why we need to filter out these exprs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question; even if we filter out the redundant expr (e.g., (a + b) in this case) here, the suboptimal (this PR pointed out) case still can happen if the expr, (a + b), is added as a common one in the other part? I thought a query like this: Seq((1, 1, 1)).toDF("a", "b", "c").select(when($"a" + $"b" + $"c" > 0, $"a" + $"b" + $"c").when($"a" + $"b" + $"c" <= 0, $"a" + $"b")).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The so called common expressions must occur at all branches/values. So in the above case, (a + b) is actually the only one common expression among two values $"a" + $"b" + $"c and $"a" + $"b".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment.
| val commonExprSet = candidateExprs.filter { candidateExpr => | ||
| candidateExprs.forall { expr => | ||
| expr == candidateExpr || expr.e.find(_.semanticEquals(candidateExpr.e)).isEmpty | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this loop not expensive? It seems the time-complexity is big-O(the total number of expr nodes in candidateExprs) x (candidateExprs.size)^2 )?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, but I don't have a better idea now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I considered this part but didn't come out better one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, okay. I don't have a idea, too... That was just a question.
| } | ||
|
|
||
| test("SPARK-35410: SubExpr elimination should not include redundant child exprs " + | ||
| "for conditional expressions") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this only a problem for conditional expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far the only one I can think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a non-conditional example that still is an issue even with this update (a bit contrived, but I'm sure there's a real use case)
val myUdf = udf(() => {
println("In UDF")
1
}).withName("myUdf")
spark.range(1).withColumn("a", myUdf()).select(($"a" + $"a") / ($"a" + $"a")).show()
This generates subexpressions myUdf() and (myUdf() + myUdf()), even though only the second one is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Kimahriman. I see. Let me also look at it. As it is non-conditional case, but looks like the similar case. Let me see if it can be solved similarly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I figured out. This might be an issue since we have sub-expr elimination. We also need to remove redundant children exprs for non-conditional cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the fix might be different. I will work on it locally and submit another fix for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any more thoughts on this? Was the subexpr sorting supposed to address this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might need another fix. I'm working on it and will submit it after these PRs merged.
You can consider it as an improvement, yea. Although from user perspective, it is somehow hard to distinguish them clearly. |
I was just wondering if we need to backport this fix or not. I think the update of CSE-related code can affect the performance of user's queries easily (e.g., , performance penalties caused by the expensive loop), so IMO it's safe to merge it into master only. |
|
It is a bit of a performance regression in certain cases so that seems like a bug. We have heavily chained expressions in when clauses and I suspect (but haven't been able to prove yet because of the complexity) it's causing us some issues. |
|
I did actually hit a bug today where the when value was being evaluated even though the condition was false. I wasn't able to find the exact root cause yet but turning off subexpression elimination fixed the issue. It was basically |
|
Kubernetes integration test starting |
Okay. Could you submit the bug fix as a separate PR? For the other idea, it is another improvement and it is better not to mix them together. |
|
@maropu @cloud-fan Do you have other comments on this change? Thanks. |
@Kimahriman Created a JIRA for the elseValue issue: https://issues.apache.org/jira/browse/SPARK-35449 |
|
Oh, BTW, I think SPARK-35449 is actually the bug you hit. This could be seen as an improvement as @maropu suggested. |
Yeah I think that's correct. Though I checked one of my queries and it generated 34 subexpressions and only used one of them. So depends if you consider that a bug or improvement hah |
| accum.add(1) | ||
| s | ||
| }) | ||
| val df1 = spark.range(5).select(when(functions.length(simpleUDF($"id")) > 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fix for https://issues.apache.org/jira/browse/SPARK-35449 will break this, since it's really a "bug" that the case value is included in subexpression resolution without an else value. Not a huge deal, I can try to fix in my follow up once this is merged
|
Both are addressing corner cases for |
| private def addCommonExprs( | ||
| exprs: Seq[Expression], | ||
| addFunc: Expression => Boolean = addExpr): Unit = { | ||
| val exprSetForAll = mutable.Set[Expr]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potentially unrelated thing I just noticed, do we need to keep track of all of the Expressions here as well (as in an Expr -> Seq[Expression] map)? It's really basically keeping the first Expression found, but the codegen looks like it uses the Expression hash (versus the semantic hash) to lookup subexpressions. Very much an edge case, just wondering if I'm understanding things correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean equivalenceMap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean add it directly to that here. I'm just thinking of a really stupid example, when((col + 1) > 0, col + 1).otherwise(1 + col). Wouldn't col + 1 and 1 + col resolve as a common expression because they're semantically equal, but only col + 1 is added to equivalenceMap, so during codegen 1 + col wouldn't be resolved to the subexpression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
col + 1 and 1 + col will both be recognized as subexpression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but won't the codgen stage not replace 1 + col since only col + 1 will be added to the equivalenceMap entry for Expr(col + 1)? For non commonExprs cases, both would be in equivalenceMap so that the codegen stage maps both of those expressions to the resulting subexpression. Again, not super related to this PR, but was the easiest place to ask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both 1 + col and col + 1 will be replaced with the extracted subexpression during codege. We don't just look of key at equivalenceMap when replacing with subexpression.
Yea, sure. I agree. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #138813 has finished for PR 32559 at commit
|
|
#32586 was merged. Can we look at this if it is good to go? Thanks. cc @cloud-fan @dongjoon-hyun @maropu |
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, this improvement looks fine to me.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
Could you make a backport to branch-3.1, @viirya ? There was a conflict on it. |
|
Sure. Thanks @dongjoon-hyun @maropu @cloud-fan @Kimahriman |
|
Ah, as this could be considered as an improvement (#32559 (review), #32559 (comment), ), we can just have it merged to master only. |
|
Got it! |
…hildren exprs in conditional expression This patch fixes a bug when dealing with common expressions in conditional expressions such as `CaseWhen` during subexpression elimination. For example, previously we find common expressions among conditions of `CaseWhen`, but children expressions are also counted into. We should not count these children expressions as common expressions. If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity. No Added tests. Closes apache#32559 from viirya/SPARK-35410. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This patch fixes an issue when dealing with common expressions in conditional expressions such as
CaseWhenduring subexpression elimination.For example, previously we find common expressions among conditions of
CaseWhen, but children expressions are also counted into. We should not count these children expressions as common expressions.Why are the changes needed?
If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added tests.