-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35449][SQL] Only extract common expressions from CaseWhen values if elseValue is set #32595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. waiting for CI
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #138722 has finished for PR 32595 at commit
|
|
cc @maropu @cloud-fan too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conditions4 is not used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy pasta, fixed the below line
|
is it a perf-only issue? |
In most cases, it is. But what I understand from @Kimahriman's description, seems in their usage like So it could cause query error. |
abc6150 to
a8b9e89
Compare
Yeah because of the UDF issue I'd consider it more a bug with performance side-effects. Whether those side-effects are positive or negative largely depends on whether #32559 is merged. Without it, this can increase performance by reducing the cases where you could have unused subexpressions generated. With it, it can decrease performance by not being able to create subexpressions for simple when clauses like |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #138746 has finished for PR 32595 at commit
|
|
@Kimahriman would you mind describing what behaviour change (bug fix) happens in "Does this PR introduce any user-facing change?"? The fix itself looks making sense but it would be great to clarify what bug is fixed by this too. |
Updated, sorry I never know what to put for that section for bug fixes. |
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix itself looks fine. I left minor comments. btw, could you fill the description in the jira? It looks empty now: https://issues.apache.org/jira/browse/SPARK-35449
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.
It's better to leave a simple comment here about why we need to check elseValue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put this new test in a new test unit `test("SPARK-35449: ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to a new test
a8b9e89 to
e024cec
Compare
I updated the JIRA and rebased off master to get the latest subexpression PRs in. Had to update one test based on these changes. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
e024cec to
5f203d0
Compare
|
Test build #138849 has finished for PR 32595 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #138850 has finished for PR 32595 at commit
|
|
Thanks all! Merging to master. |
|
This should to be fixed in branch-3.1 too. @Kimahriman Can you make a backport PR? Also cc @dongjoon-hyun as he is release manager of 3.1.2. |
|
@viirya It looks you forgot to close jira? |
|
Ah, right, @maropu, because backport to 3.1 fails, so the script doesn't update it. I forgot to do it manually too. Let me do it now. Thanks! |
Created #32651 |
…es if elseValue is set This PR fixes a bug with subexpression elimination for CaseWhen statements. apache#30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue. Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true. Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example: ``` val col = when($"id" < 0, myUdf($"id")) spark.range(1).select(when(col > 0, col)).show() ``` `myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run. Updated existing test with new case. Closes apache#32595 from Kimahriman/bug-case-subexpr-elimination. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
What changes were proposed in this pull request?
This PR fixes a bug with subexpression elimination for CaseWhen statements. #30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.
Why are the changes needed?
Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true.
Does this PR introduce any user-facing change?
Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example:
myUdf($"id")is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but ifid >= 0it should never actually be run.How was this patch tested?
Updated existing test with new case.