-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17733][SQL] InferFiltersFromConstraints rule never terminates for query #15319
Conversation
…verging set of constraints
@@ -2678,4 +2678,45 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
} | |||
} | |||
} | |||
|
|||
test("SPARK-17733 InferFiltersFromConstraints rule never terminates for query") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we construct a unit test rather than an end-to-end test here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - Perhaps we could add new testcases in InferFiltersFromConstraintsSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you already have a unit test for cases like these, how about we remove this now? This test was randomly generated to catch issues like this and in its current form, it isn't very obvious how this query has anything to do with InferFiltersFromConstraints
.
Test build #66205 has finished for PR 15319 at commit
|
retest this please |
Test build #66206 has finished for PR 15319 at commit
|
.analyze | ||
val correctAnswer = t1.where(IsNotNull('a) && 'a === Coalesce(Seq('a, 'b)) | ||
&& IsNotNull('b) && 'b === Coalesce(Seq('a, 'b)) | ||
&& IsNotNull(Coalesce(Seq('a, 'b))) && 'a === 'b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These predicates are infered from t.a = t2.a, t.d = t2.a, t.int_col = t2.a
, which in line with our expectation.
Test build #66212 has finished for PR 15319 at commit
|
Test build #66213 has finished for PR 15319 at commit
|
After more thoughts, I discovered various ways to create non-converge constraint set(in the following cases, a represents For the Cond. 1/2/3, we can avoid to create non-converge constraint set by checking whether constraint contains A new approach will be avoid replace an |
val batches = Batch("InferFilters", FixedPoint(5), InferFiltersFromConstraints) :: | ||
Batch("PredicatePushdown", FixedPoint(5), PushPredicateThroughJoin) :: | ||
Batch("CombineFilters", FixedPoint(5), CombineFilters) :: Nil | ||
val batches = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous batches
will not apply InferFiltersFromConstraints
after PushPredicateThroughJoin
.
Test build #66223 has finished for PR 15319 at commit
|
Test build #66224 has finished for PR 15319 at commit
|
Test build #66234 has finished for PR 15319 at commit
|
@sameeragarwal Could you review this PR please? |
// because then both `QueryPlan.inferAdditionalConstraints` and | ||
// `UnaryNode.getAliasedConstraints` applies and may produce a non-converging set of | ||
// constraints. | ||
// For more details, infer https://issues.apache.org/jira/browse/SPARK-17733 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "infer" -> "refer" (to)?
// `UnaryNode.getAliasedConstraints` applies and may produce a non-converging set of | ||
// constraints. | ||
// For more details, infer https://issues.apache.org/jira/browse/SPARK-17733 | ||
val aliasMap = AttributeMap((expressions ++ children.flatMap(_.expressions)).collect { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using AttributeSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, AttributeSet
is a better choice here.
Test build #66597 has finished for PR 15319 at commit
|
retest this please. |
Test build #66631 has finished for PR 15319 at commit
|
Not sure why these testcases are failing, they passed in my local envirement. |
Test build #3308 has finished for PR 15319 at commit
|
retest this please. |
Test build #66658 has finished for PR 15319 at commit
|
Test build #3319 has finished for PR 15319 at commit
|
retest this please. |
Test build #67007 has finished for PR 15319 at commit
|
Test build #67020 has finished for PR 15319 at commit
|
ping @sameeragarwal |
This PR is ready for review, would anyone look at it? |
var inferredConstraints = Set.empty[Expression] | ||
constraints.foreach { | ||
case eq @ EqualTo(l: Attribute, r: Attribute) => | ||
inferredConstraints ++= (constraints - eq).map(_ transform { | ||
case a: Attribute if a.semanticEquals(l) => r | ||
case a: Attribute if a.semanticEquals(l) && !aliasSet.contains(r) => r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiangxb1987 isn't this a fairly restrictive way to solve this problem? You are essentially not inferring any additional constraints from those that contain aliases. For e.g., if we have a subquery SELECT a AS a1, b AS b1 WHERE a1 = 1 AND a1 = b1
, this change would never allow us to infer a filter/constraint on b = 1
. Can we identify and just disallow recursive constraints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you a lot for your advice! Perhaps We should generate the following two sets:
- Set of
Alias
which have references to other expressions, for instance,Alias(f(b, c), "a")
, orAlias(a, "a1")
; - Generate sets of equivalence classes out of
EqualTo
operators in constraints, e.g., when we havea = b
andc = b
ande = f
, then the sets would be ((a, b, c), (e, f)).
Here, for any expressions to be used to infer new constraints, we should check that either it's not in ourAliasSet
, or its reference doesn't contain any expressions in the corresponding equivalence classes set.
I'll update this check rule ASAP. Thank you for helping!
Test build #67195 has finished for PR 15319 at commit
|
@sameeragarwal I've updated the rule, please review when you have time! Thank you! |
Thanks @jiangxb1987, this equivalence class approach looks pretty solid. I'll take a closer look tomorrow! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @jiangxb1987, I just left some comments around structure and styling but overall this approach looks good to me.
// Don't apply transform on constraints if the replacement will cause an recursive deduction, | ||
// when that happens a non-converging set of constraints will be created and finally throw | ||
// OOM Exception. | ||
// For more details, refer to https://issues.apache.org/jira/browse/SPARK-17733 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't seem to give a lot of context about the underlying issue. How about we just add a top-level comment for this method summarizing the issue and remove this? Perhaps, something along the lines of the following:
/**
* Infers an additional set of constraints from a given set of equality constraints.
* For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), this returns an
* additional constraint of the form `b = 5`.
*
* [SPARK-17733] We explicitly prevent producing recursive constraints of the form `a = f(a, b)`
* as they are often useless and can lead to a non-converging set of constraints.
*/
private def inferAdditionalConstraints(constraints: Set[Expression]): Set[Expression]
// when that happens a non-converging set of constraints will be created and finally throw | ||
// OOM Exception. | ||
// For more details, refer to https://issues.apache.org/jira/browse/SPARK-17733 | ||
val aliasMap = AttributeMap((expressions ++ children.flatMap(_.expressions)).collect { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since aliasMap
is referenced at a number of places, let's just make this a private lazy val and move it outside of this method in QueryPlan
.
*/ | ||
private def generateEqualExpressionSets( | ||
constraints: Set[Expression], | ||
aliasMap: AttributeMap[Expression]): Seq[Set[Expression]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can now remove aliasMap
from here
case a: Alias => (a.toAttribute, a.child) | ||
}) | ||
|
||
val equalExprSets = generateEqualExpressionSets(constraints, aliasMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: How about calling them val constraintClasses
?
* expression sets: (Set(a, b, c), Set(e, f)). This will be used to search all expressions equal | ||
* to an selected attribute. | ||
*/ | ||
private def generateEqualExpressionSets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about generateEquivalentConstraintClasses
or generateEquivalentConstraintSet
?
.join(t2, Inner, Some("t.a".attr === "t2.a".attr && "t.d".attr === "t2.a".attr)) | ||
.analyze | ||
val currectAnswer = t1.where(IsNotNull('a) && IsNotNull('b) | ||
&& 'a <=> 'a && 'b <=> 'b &&'a === 'b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2 spaces
.join(t2, Inner, Some("t.a".attr === "t2.a".attr && "t.int_col".attr === "t2.a".attr)) | ||
.analyze | ||
val currectAnswer = t1.where(IsNotNull('a) && IsNotNull(Coalesce(Seq('a, 'b))) | ||
&&'a === Coalesce(Seq('a, 'b))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2 spaces
&& Coalesce(Seq('b, 'b)) <=> Coalesce(Seq('b, 'b)) && 'b <=> 'b) | ||
.select('a, 'b.as('d), Coalesce(Seq('a, 'b)).as('int_col)).as("t") | ||
.join(t2.where(IsNotNull('a) && IsNotNull(Coalesce(Seq('a, 'a))) | ||
&& 'a === Coalesce(Seq('a, 'a)) && 'a <=> Coalesce(Seq('a, 'a)) && 'a <=> 'a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2 spaces
.select('a, 'b.as('d), Coalesce(Seq('a, 'b)).as('int_col)).as("t") | ||
.join(t2.where(IsNotNull('a) && IsNotNull(Coalesce(Seq('a, 'a))) | ||
&& 'a === Coalesce(Seq('a, 'a)) && 'a <=> Coalesce(Seq('a, 'a)) && 'a <=> 'a | ||
&& Coalesce(Seq('a, 'a)) <=> Coalesce(Seq('a, 'a))), Inner, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2 spaces
@@ -2678,4 +2678,45 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
} | |||
} | |||
} | |||
|
|||
test("SPARK-17733 InferFiltersFromConstraints rule never terminates for query") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you already have a unit test for cases like these, how about we remove this now? This test was randomly generated to catch issues like this and in its current form, it isn't very obvious how this query has anything to do with InferFiltersFromConstraints
.
Thank you @sameeragarwal ! I've updated these codes following your advice, please have a look at them when you have time. |
Test build #67495 has finished for PR 15319 at commit
|
Test build #67497 has finished for PR 15319 at commit
|
LGTM, Thanks! |
Merging to master/2.0! Thanks! |
…for query ## What changes were proposed in this pull request? The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias): `a = b, a = f(b, c)` Applying both these rules in the next iteration would infer: `f(b, c) = f(f(b, c), c)` This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM. ~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~ To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that. ## How was this patch tested? Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15319 from jiangxb1987/constraints. (cherry picked from commit 3c02357) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
…for query ## What changes were proposed in this pull request? The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias): `a = b, a = f(b, c)` Applying both these rules in the next iteration would infer: `f(b, c) = f(f(b, c), c)` This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM. ~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~ To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that. ## How was this patch tested? Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes apache#15319 from jiangxb1987/constraints.
…for query ## What changes were proposed in this pull request? The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias): `a = b, a = f(b, c)` Applying both these rules in the next iteration would infer: `f(b, c) = f(f(b, c), c)` This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM. ~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~ To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that. ## How was this patch tested? Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes apache#15319 from jiangxb1987/constraints.
## What changes were proposed in this pull request? Improve QueryPlanConstraints framework, make it robust and simple. In apache#15319, constraints for expressions like `a = f(b, c)` is resolved. However, for expressions like ```scala a = f(b, c) && c = g(a, b) ``` The current QueryPlanConstraints framework will produce non-converging constraints. Essentially, the problem is caused by having both the name and child of aliases in the same constraint set. We infer constraints, and push down constraints as predicates in filters, later on these predicates are propagated as constraints, etc.. Simply using the alias names only can resolve these problems. The size of constraints is reduced without losing any information. We can always get these inferred constraints on child of aliases when pushing down filters. Also, the EqualNullSafe between name and child in propagating alias is meaningless ```scala allConstraints += EqualNullSafe(e, a.toAttribute) ``` It just produces redundant constraints. ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes apache#19201 from gengliangwang/QueryPlanConstraints.
What changes were proposed in this pull request?
The function
QueryPlan.inferAdditionalConstraints
andUnaryNode.getAliasedConstraints
can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias):a = b, a = f(b, c)
Applying both these rules in the next iteration would infer:
f(b, c) = f(f(b, c), c)
This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM.
To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform anExpression
to another which contains it.To fix this problem, we apply additional check in
inferAdditionalConstraints
, when it's possible to generate recursive constraints, we skip generate that.How was this patch tested?
Add new testcase in
SQLQuerySuite
/InferFiltersFromConstraintsSuite
.