Skip to content

Conversation

@Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented May 19, 2021

What changes were proposed in this pull request?

This PR fixes a bug with subexpression elimination for CaseWhen statements. #30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.

Why are the changes needed?

Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true.

Does this PR introduce any user-facing change?

Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example:

val col = when($"id" < 0, myUdf($"id"))
spark.range(1).select(when(col > 0, col)).show()

myUdf($"id") is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if id >= 0 it should never actually be run.

How was this patch tested?

Updated existing test with new case.

@github-actions github-actions bot added the SQL label May 19, 2021
@Kimahriman
Copy link
Contributor Author

@viirya

@viirya
Copy link
Member

viirya commented May 20, 2021

ok to test

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. waiting for CI

@SparkQA
Copy link

SparkQA commented May 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43245/

@SparkQA
Copy link

SparkQA commented May 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43245/

@SparkQA
Copy link

SparkQA commented May 20, 2021

Test build #138722 has finished for PR 32595 at commit abc6150.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented May 20, 2021

cc @maropu @cloud-fan too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conditions4 is not used anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy pasta, fixed the below line

@cloud-fan
Copy link
Contributor

is it a perf-only issue?

@viirya
Copy link
Member

viirya commented May 20, 2021

is it a perf-only issue?

In most cases, it is. But what I understand from @Kimahriman's description, seems in their usage like when(..., udf(col)), udf(col) will be wrongly treated as sub-expr and run in advance. But the udf fails for the input because it doesn't satisfy the condition.

So it could cause query error.

@Kimahriman Kimahriman force-pushed the bug-case-subexpr-elimination branch from abc6150 to a8b9e89 Compare May 20, 2021 10:23
@Kimahriman
Copy link
Contributor Author

Kimahriman commented May 20, 2021

is it a perf-only issue?

Yeah because of the UDF issue I'd consider it more a bug with performance side-effects. Whether those side-effects are positive or negative largely depends on whether #32559 is merged. Without it, this can increase performance by reducing the cases where you could have unused subexpressions generated. With it, it can decrease performance by not being able to create subexpressions for simple when clauses like when(length(col) > 0, col). I'm working on a follow up to address cases like that for a performance improvement

@SparkQA
Copy link

SparkQA commented May 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43268/

@SparkQA
Copy link

SparkQA commented May 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43268/

@SparkQA
Copy link

SparkQA commented May 20, 2021

Test build #138746 has finished for PR 32595 at commit a8b9e89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@Kimahriman would you mind describing what behaviour change (bug fix) happens in "Does this PR introduce any user-facing change?"? The fix itself looks making sense but it would be great to clarify what bug is fixed by this too.

@Kimahriman
Copy link
Contributor Author

@Kimahriman would you mind describing what behaviour change (bug fix) happens in "Does this PR introduce any user-facing change?"? The fix itself looks making sense but it would be great to clarify what bug is fixed by this too.

Updated, sorry I never know what to put for that section for bug fixes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix itself looks fine. I left minor comments. btw, could you fill the description in the jira? It looks empty now: https://issues.apache.org/jira/browse/SPARK-35449

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.

It's better to leave a simple comment here about why we need to check elseValue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment in

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put this new test in a new test unit `test("SPARK-35449: ..."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a new test

@Kimahriman Kimahriman force-pushed the bug-case-subexpr-elimination branch from a8b9e89 to e024cec Compare May 23, 2021 16:15
@Kimahriman
Copy link
Contributor Author

The fix itself looks fine. I left minor comments. btw, could you fill the description in the jira? It looks empty now: https://issues.apache.org/jira/browse/SPARK-35449

I updated the JIRA and rebased off master to get the latest subexpression PRs in. Had to update one test based on these changes.

@SparkQA
Copy link

SparkQA commented May 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43371/

@SparkQA
Copy link

SparkQA commented May 23, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43371/

@Kimahriman Kimahriman force-pushed the bug-case-subexpr-elimination branch from e024cec to 5f203d0 Compare May 23, 2021 19:13
@SparkQA
Copy link

SparkQA commented May 23, 2021

Test build #138849 has finished for PR 32595 at commit e024cec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43372/

@SparkQA
Copy link

SparkQA commented May 23, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43372/

@SparkQA
Copy link

SparkQA commented May 23, 2021

Test build #138850 has finished for PR 32595 at commit 5f203d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented May 24, 2021

Thanks all! Merging to master.

@viirya viirya closed this in 6c0c617 May 24, 2021
@viirya
Copy link
Member

viirya commented May 24, 2021

This should to be fixed in branch-3.1 too. @Kimahriman Can you make a backport PR?

Also cc @dongjoon-hyun as he is release manager of 3.1.2.

@maropu
Copy link
Member

maropu commented May 24, 2021

@viirya It looks you forgot to close jira?

@viirya
Copy link
Member

viirya commented May 24, 2021

Ah, right, @maropu, because backport to 3.1 fails, so the script doesn't update it. I forgot to do it manually too. Let me do it now. Thanks!

@Kimahriman
Copy link
Contributor Author

This should to be fixed in branch-3.1 too. @Kimahriman Can you make a backport PR?

Also cc @dongjoon-hyun as he is release manager of 3.1.2.

Created #32651

Kimahriman added a commit to Kimahriman/spark that referenced this pull request Feb 22, 2022
…es if elseValue is set

This PR fixes a bug with subexpression elimination for CaseWhen statements. apache#30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.

Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true.

Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example:
```
val col = when($"id" < 0, myUdf($"id"))
spark.range(1).select(when(col > 0, col)).show()
```

`myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run.

Updated existing test with new case.

Closes apache#32595 from Kimahriman/bug-case-subexpr-elimination.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants