Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Nov 4, 2020

What changes were proposed in this pull request?

Currently we skip subexpression elimination in branches of conditional expressions including If, CaseWhen, and Coalesce. Actually we can do subexpression elimination for such branches if the subexpression is common across all branches. This patch proposes to support subexpression elimination in branches of conditional expressions.

Why are the changes needed?

We may miss subexpression elimination chances in branches of conditional expressions. This kind of subexpression is frequently seen. It may be written manually by users or come from query optimizer. For example, project collapsing could embed expressions between two Projects and produces conditional expression like:

CASE WHEN jsonToStruct(json).a = '1' THEN 1.0 WHEN jsonToStruct(json).a = '2' THEN 2.0 ... ELSE 1.2 END

If jsonToStruct(json) is time-expensive expression, we don't eliminate the duplication and waste time on running it repeatedly now.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

@viirya viirya marked this pull request as draft November 4, 2020 06:24
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@viirya viirya marked this pull request as ready for review November 5, 2020 01:50
@viirya viirya changed the title [WIP][SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions [SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions Nov 5, 2020
@viirya
Copy link
Member Author

viirya commented Nov 5, 2020

cc @cloud-fan @maropu @dongjoon-hyun

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @viirya . I took a look briefly and this looks useful. I'll revisit tomorrow.

@viirya
Copy link
Member Author

viirya commented Nov 5, 2020

Thank you @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Test build #130627 has finished for PR 30245 at commit cd3776c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
})
exprSetForAll = exprSetForAll.intersect(exprSet)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle head and tail seperately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For expression head, we add underlying expressions into exprSetForAll set. But for expressions in tail, we keep intersect between exprSetForAll and exprSet.

We can merge two blocks, but in the block we need to check if current expression is head expression and do different logic based on the check.

I prefer current one since it looks simpler.

childrenToRecurse.foreach(addExprTree)
// For some special expressions we cannot just recurse into all of its children, but we can
// recursively add the common expressions shared between all of its children.
def commonChildrenToRecurse: Seq[Seq[Expression]] = expr match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Although this is used only here, can we declare this outside of this function as a private method? Currently, addExprTree seems to grow unnecessarily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you add a negative test case having the expression cannot be eliminated from conditional expressions?

@viirya
Copy link
Member Author

viirya commented Nov 5, 2020

Also, could you add a negative test case having the expression cannot be eliminated from conditional expressions?

I mixed positive and negative test cases. I think I can add some comment to explain it.

@SparkQA
Copy link

SparkQA commented Nov 6, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35315/

@SparkQA
Copy link

SparkQA commented Nov 6, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35315/

@cloud-fan
Copy link
Contributor

Do you still remember why "subexpression elimination" must be eagerly executed? Because implementing "lazy" is expensive?

@SparkQA
Copy link

SparkQA commented Nov 6, 2020

Test build #130705 has finished for PR 30245 at commit 9182e3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 6, 2020

Do you still remember why "subexpression elimination" must be eagerly executed? Because implementing "lazy" is expensive?

I don't remember if we have tried to implement "lazy" behavior in codegen. Looks like at least it will bring complex as we need extra variable to check if a subexpression is evaulated in first time. Every time we use a subexpression, we might need to first check the extra variable and decide to evaluate the subexpression or just use evaluated value.

* For example, given two expressions `(a + (b + (c + 1)))` and `(d + (e + (c + 1)))`,
* the common expression `(c + 1)` will be added into `equivalenceMap`.
*/
def addCommonExprs(exprs: Seq[Expression], addFunc: Expression => Boolean = addExpr): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be private as well.

val otherExprSet = mutable.Set[Expr]()

addExprTree(expr, (innerExpr: Expression) => {
if (innerExpr.deterministic) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar code appears twice. Can we create a method for it?

val equivalence1 = new EquivalentExpressions
equivalence1.addExprTree(caseWhenExpr1)

// `add2` is repeatedly in all conditions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add1 is also repeated. Why it's not included?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We treat the first condition specially because it is definitely run. So it counts one for add2. Other conditions all contain add2 so it counts for one. That is where the count 2 comes from for add2.

For add1, although all values contain it, it is definitely run, so we count it one. If no other expression contains add1, we don't extract subexpression for add1 as it will run just once (we only run one value of CaseWhen).

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130813 has finished for PR 30245 at commit 33f3bd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35444/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130812 has finished for PR 30245 at commit 16314a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35444/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130835 has finished for PR 30245 at commit b415728.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35461/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35461/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130853 has finished for PR 30245 at commit b415728.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 10, 2020

Thanks! Passed Jenkins and GitHub Actions. Will merge this today.

@viirya
Copy link
Member Author

viirya commented Nov 11, 2020

Thanks! Merging to master.

@viirya viirya closed this in 6fa80ed Nov 11, 2020
assert(equivalence1.getAllEquivalentExprs.filter(_.size == 2).head == Seq(add, add))
// one-time expressions: only ifExpr and its predicate expression
assert(equivalence1.getAllEquivalentExprs.count(_.size == 1) == 2)
assert(equivalence1.getAllEquivalentExprs.filter(_.size == 1).head == Seq(ifExpr1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use contains method? HashMap can not guarantee the order

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will create a follow-up for making sure it will not possibly flaky. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #30371.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @leoluan2009 and @viirya . The follow-up is merged to reduce the flakiness.

viirya pushed a commit that referenced this pull request May 24, 2021
…es if elseValue is set

### What changes were proposed in this pull request?

This PR fixes a bug with subexpression elimination for CaseWhen statements. #30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.

### Why are the changes needed?

Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true.

### Does this PR introduce _any_ user-facing change?

Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example:
```
val col = when($"id" < 0, myUdf($"id"))
spark.range(1).select(when(col > 0, col)).show()
```

`myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run.

### How was this patch tested?

Updated existing test with new case.

Closes #32595 from Kimahriman/bug-case-subexpr-elimination.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya pushed a commit that referenced this pull request May 24, 2021
… values if elseValue is set

### What changes were proposed in this pull request?

This PR fixes a bug with subexpression elimination for CaseWhen statements. #30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.

### Why are the changes needed?

Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true.

### Does this PR introduce _any_ user-facing change?

Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example:
```
val col = when($"id" < 0, myUdf($"id"))
spark.range(1).select(when(col > 0, col)).show()
```

`myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run.

### How was this patch tested?

Updated existing test with new case.

Closes #32651 from Kimahriman/bug-case-subexpr-elimination-3.1.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
// if it is shared among conditions, but it doesn't need to be shared in values. Similarly,
// a subexpression among values doesn't need to be in conditions because no matter which
// condition is true, it will be evaluated.
val conditions = c.branches.tail.map(_._1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a flaw here: we exclude the first condition, so a common subexpressions in the rest of the conditions doesn't mean it's always evaluated.

e.g. CaseWhen(cond1, ... cond2, ..., cond2, ...), cond2 is shared between the rest conditions but it's not always evaluated.

Copy link
Member Author

@viirya viirya Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is related to #32977. This looks more a aggressive optimization. Consider if we respect short-circuit evaluation practice for CaseWhen, this might be an issue if users reply short-circuit evaluation to guard later conditions.

Safest approach is to only consider all conditions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT? Should we only consider all conditions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should. I hit an issue caused by it in my refactor and I'll open a PR for the refactor with multiple bugs fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, does #32980 conflict with your refactor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only some trivial conflicts, #32980 should be merged first as it has been reviewed and approved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I also addressed this issue in #32987 which assumed CaseWhen's (and Coalesce) should short circuit and guard later conditions. The main benefit/difference is if you have

CaseWhen(cond1, ..., cond1, ..., cond2, ...), cond1 gets pulled out as a subexpression when it wouldn't otherwise even with #33142 I think

flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
… values if elseValue is set

### What changes were proposed in this pull request?

This PR fixes a bug with subexpression elimination for CaseWhen statements. apache#30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.

### Why are the changes needed?

Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true.

### Does this PR introduce _any_ user-facing change?

Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example:
```
val col = when($"id" < 0, myUdf($"id"))
spark.range(1).select(when(col > 0, col)).show()
```

`myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run.

### How was this patch tested?

Updated existing test with new case.

Closes apache#32651 from Kimahriman/bug-case-subexpr-elimination-3.1.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
… values if elseValue is set

### What changes were proposed in this pull request?

This PR fixes a bug with subexpression elimination for CaseWhen statements. apache#30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.

### Why are the changes needed?

Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true.

### Does this PR introduce _any_ user-facing change?

Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example:
```
val col = when($"id" < 0, myUdf($"id"))
spark.range(1).select(when(col > 0, col)).show()
```

`myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run.

### How was this patch tested?

Updated existing test with new case.

Closes apache#32651 from Kimahriman/bug-case-subexpr-elimination-3.1.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Kimahriman added a commit to Kimahriman/spark that referenced this pull request Feb 22, 2022
…es if elseValue is set

This PR fixes a bug with subexpression elimination for CaseWhen statements. apache#30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue.

Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true.

Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example:
```
val col = when($"id" < 0, myUdf($"id"))
spark.range(1).select(when(col > 0, col)).show()
```

`myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run.

Updated existing test with new case.

Closes apache#32595 from Kimahriman/bug-case-subexpr-elimination.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
@viirya viirya deleted the SPARK-33337 branch December 27, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants