Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17733][SQL] InferFiltersFromConstraints rule never terminates for query #15319

Closed
wants to merge 13 commits into from

Conversation

jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Oct 1, 2016

What changes were proposed in this pull request?

The function QueryPlan.inferAdditionalConstraints and UnaryNode.getAliasedConstraints can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias):
a = b, a = f(b, c)
Applying both these rules in the next iteration would infer:
f(b, c) = f(f(b, c), c)
This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM.

To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an Expression to another which contains it.
To fix this problem, we apply additional check in inferAdditionalConstraints, when it's possible to generate recursive constraints, we skip generate that.

How was this patch tested?

Add new testcase in SQLQuerySuite/InferFiltersFromConstraintsSuite.

@@ -2678,4 +2678,45 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
}
}
}

test("SPARK-17733 InferFiltersFromConstraints rule never terminates for query") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we construct a unit test rather than an end-to-end test here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - Perhaps we could add new testcases in InferFiltersFromConstraintsSuite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you already have a unit test for cases like these, how about we remove this now? This test was randomly generated to catch issues like this and in its current form, it isn't very obvious how this query has anything to do with InferFiltersFromConstraints.

@SparkQA
Copy link

SparkQA commented Oct 1, 2016

Test build #66205 has finished for PR 15319 at commit ebba446.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 1, 2016

Test build #66206 has finished for PR 15319 at commit ebba446.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.analyze
val correctAnswer = t1.where(IsNotNull('a) && 'a === Coalesce(Seq('a, 'b))
&& IsNotNull('b) && 'b === Coalesce(Seq('a, 'b))
&& IsNotNull(Coalesce(Seq('a, 'b))) && 'a === 'b)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These predicates are infered from t.a = t2.a, t.d = t2.a, t.int_col = t2.a, which in line with our expectation.

@SparkQA
Copy link

SparkQA commented Oct 1, 2016

Test build #66212 has finished for PR 15319 at commit 7d9e2b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 1, 2016

Test build #66213 has finished for PR 15319 at commit 3b93209.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

After more thoughts, I discovered various ways to create non-converge constraint set(in the following cases, a represents Alias(f(b, c), "a")):
Condition 1. a = b will infer f(b, c) = f(f(b, c), c);
Condition 2. a = d, b = d will infer a = b and therefore deducts f(b, c) = f(f(b, c), c);
Condition 3. a = d, d = e, e = f, ..., x = b will deduct f(b, c) = f(f(b, c), c) too after certain iterations;
Condition 4. a = d, b = d will also infer f(d, c) = f(f(d, c), c) and so on.

For the Cond. 1/2/3, we can avoid to create non-converge constraint set by checking whether constraint contains a.child(e.g. f(b, c)), but for the Cond. 4, I haven't figure out a work around.

A new approach will be avoid replace an Alias into constraint, it will ensure that QueryPlan.inferAdditionalConstraints and UnaryNode.getAliasedConstraints won't both apply and therefore no non-converge constraint set will be created. For this approach, we may miss some constraints that could have been infered, but I think it won't be super harmful because these constraints are commonly complex that rarely filters more data.

val batches = Batch("InferFilters", FixedPoint(5), InferFiltersFromConstraints) ::
Batch("PredicatePushdown", FixedPoint(5), PushPredicateThroughJoin) ::
Batch("CombineFilters", FixedPoint(5), CombineFilters) :: Nil
val batches =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous batches will not apply InferFiltersFromConstraints after PushPredicateThroughJoin.

@SparkQA
Copy link

SparkQA commented Oct 1, 2016

Test build #66223 has finished for PR 15319 at commit 5b25fce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 1, 2016

Test build #66224 has finished for PR 15319 at commit e5912f8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 2, 2016

Test build #66234 has finished for PR 15319 at commit 9639c71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

cc @sameeragarwal

@jiangxb1987
Copy link
Contributor Author

@sameeragarwal Could you review this PR please?

// because then both `QueryPlan.inferAdditionalConstraints` and
// `UnaryNode.getAliasedConstraints` applies and may produce a non-converging set of
// constraints.
// For more details, infer https://issues.apache.org/jira/browse/SPARK-17733
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "infer" -> "refer" (to)?

// `UnaryNode.getAliasedConstraints` applies and may produce a non-converging set of
// constraints.
// For more details, infer https://issues.apache.org/jira/browse/SPARK-17733
val aliasMap = AttributeMap((expressions ++ children.flatMap(_.expressions)).collect {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using AttributeSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, AttributeSet is a better choice here.

@SparkQA
Copy link

SparkQA commented Oct 9, 2016

Test build #66597 has finished for PR 15319 at commit 1558d4c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66631 has finished for PR 15319 at commit 1558d4c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

Not sure why these testcases are failing, they passed in my local envirement.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #3308 has finished for PR 15319 at commit 1558d4c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66658 has finished for PR 15319 at commit 1558d4c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #3319 has finished for PR 15319 at commit 1558d4c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 15, 2016

Test build #67007 has finished for PR 15319 at commit 1558d4c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2016

Test build #67020 has finished for PR 15319 at commit 388443d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

ping @sameeragarwal

@jiangxb1987
Copy link
Contributor Author

This PR is ready for review, would anyone look at it?

var inferredConstraints = Set.empty[Expression]
constraints.foreach {
case eq @ EqualTo(l: Attribute, r: Attribute) =>
inferredConstraints ++= (constraints - eq).map(_ transform {
case a: Attribute if a.semanticEquals(l) => r
case a: Attribute if a.semanticEquals(l) && !aliasSet.contains(r) => r
Copy link
Member

@sameeragarwal sameeragarwal Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangxb1987 isn't this a fairly restrictive way to solve this problem? You are essentially not inferring any additional constraints from those that contain aliases. For e.g., if we have a subquery SELECT a AS a1, b AS b1 WHERE a1 = 1 AND a1 = b1, this change would never allow us to infer a filter/constraint on b = 1. Can we identify and just disallow recursive constraints?

Copy link
Contributor Author

@jiangxb1987 jiangxb1987 Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you a lot for your advice! Perhaps We should generate the following two sets:

  1. Set of Alias which have references to other expressions, for instance, Alias(f(b, c), "a"), or Alias(a, "a1");
  2. Generate sets of equivalence classes out of EqualTo operators in constraints, e.g., when we have a = b and c = b and e = f, then the sets would be ((a, b, c), (e, f)).
    Here, for any expressions to be used to infer new constraints, we should check that either it's not in our AliasSet, or its reference doesn't contain any expressions in the corresponding equivalence classes set.

I'll update this check rule ASAP. Thank you for helping!

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67195 has finished for PR 15319 at commit 909d2cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

@sameeragarwal I've updated the rule, please review when you have time! Thank you!

@sameeragarwal
Copy link
Member

Thanks @jiangxb1987, this equivalence class approach looks pretty solid. I'll take a closer look tomorrow!

Copy link
Member

@sameeragarwal sameeragarwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @jiangxb1987, I just left some comments around structure and styling but overall this approach looks good to me.

// Don't apply transform on constraints if the replacement will cause an recursive deduction,
// when that happens a non-converging set of constraints will be created and finally throw
// OOM Exception.
// For more details, refer to https://issues.apache.org/jira/browse/SPARK-17733
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem to give a lot of context about the underlying issue. How about we just add a top-level comment for this method summarizing the issue and remove this? Perhaps, something along the lines of the following:

  /**
   * Infers an additional set of constraints from a given set of equality constraints.
   * For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), this returns an
   * additional constraint of the form `b = 5`.
   *
   * [SPARK-17733] We explicitly prevent producing recursive constraints of the form `a = f(a, b)`
   * as they are often useless and can lead to a non-converging set of constraints.
   */
  private def inferAdditionalConstraints(constraints: Set[Expression]): Set[Expression]

// when that happens a non-converging set of constraints will be created and finally throw
// OOM Exception.
// For more details, refer to https://issues.apache.org/jira/browse/SPARK-17733
val aliasMap = AttributeMap((expressions ++ children.flatMap(_.expressions)).collect {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since aliasMap is referenced at a number of places, let's just make this a private lazy val and move it outside of this method in QueryPlan.

*/
private def generateEqualExpressionSets(
constraints: Set[Expression],
aliasMap: AttributeMap[Expression]): Seq[Set[Expression]] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can now remove aliasMap from here

case a: Alias => (a.toAttribute, a.child)
})

val equalExprSets = generateEqualExpressionSets(constraints, aliasMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about calling them val constraintClasses?

* expression sets: (Set(a, b, c), Set(e, f)). This will be used to search all expressions equal
* to an selected attribute.
*/
private def generateEqualExpressionSets(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about generateEquivalentConstraintClasses or generateEquivalentConstraintSet?

.join(t2, Inner, Some("t.a".attr === "t2.a".attr && "t.d".attr === "t2.a".attr))
.analyze
val currectAnswer = t1.where(IsNotNull('a) && IsNotNull('b)
&& 'a <=> 'a && 'b <=> 'b &&'a === 'b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2 spaces

.join(t2, Inner, Some("t.a".attr === "t2.a".attr && "t.int_col".attr === "t2.a".attr))
.analyze
val currectAnswer = t1.where(IsNotNull('a) && IsNotNull(Coalesce(Seq('a, 'b)))
&&'a === Coalesce(Seq('a, 'b)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2 spaces

&& Coalesce(Seq('b, 'b)) <=> Coalesce(Seq('b, 'b)) && 'b <=> 'b)
.select('a, 'b.as('d), Coalesce(Seq('a, 'b)).as('int_col)).as("t")
.join(t2.where(IsNotNull('a) && IsNotNull(Coalesce(Seq('a, 'a)))
&& 'a === Coalesce(Seq('a, 'a)) && 'a <=> Coalesce(Seq('a, 'a)) && 'a <=> 'a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2 spaces

.select('a, 'b.as('d), Coalesce(Seq('a, 'b)).as('int_col)).as("t")
.join(t2.where(IsNotNull('a) && IsNotNull(Coalesce(Seq('a, 'a)))
&& 'a === Coalesce(Seq('a, 'a)) && 'a <=> Coalesce(Seq('a, 'a)) && 'a <=> 'a
&& Coalesce(Seq('a, 'a)) <=> Coalesce(Seq('a, 'a))), Inner,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2 spaces

@@ -2678,4 +2678,45 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
}
}
}

test("SPARK-17733 InferFiltersFromConstraints rule never terminates for query") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you already have a unit test for cases like these, how about we remove this now? This test was randomly generated to catch issues like this and in its current form, it isn't very obvious how this query has anything to do with InferFiltersFromConstraints.

@jiangxb1987
Copy link
Contributor Author

Thank you @sameeragarwal ! I've updated these codes following your advice, please have a look at them when you have time.

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67495 has finished for PR 15319 at commit 905eaa1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67497 has finished for PR 15319 at commit 45308d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member

LGTM, Thanks!

@hvanhovell
Copy link
Contributor

hvanhovell commented Oct 26, 2016

Merging to master/2.0! Thanks!

asfgit pushed a commit that referenced this pull request Oct 26, 2016
…for query

## What changes were proposed in this pull request?

The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias):
`a = b, a = f(b, c)`
Applying both these rules in the next iteration would infer:
`f(b, c) = f(f(b, c), c)`
This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM.

~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~
To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that.

## How was this patch tested?

Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #15319 from jiangxb1987/constraints.

(cherry picked from commit 3c02357)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
@asfgit asfgit closed this in 3c02357 Oct 26, 2016
@jiangxb1987 jiangxb1987 deleted the constraints branch October 27, 2016 02:03
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
…for query

## What changes were proposed in this pull request?

The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias):
`a = b, a = f(b, c)`
Applying both these rules in the next iteration would infer:
`f(b, c) = f(f(b, c), c)`
This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM.

~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~
To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that.

## How was this patch tested?

Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes apache#15319 from jiangxb1987/constraints.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…for query

## What changes were proposed in this pull request?

The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias):
`a = b, a = f(b, c)`
Applying both these rules in the next iteration would infer:
`f(b, c) = f(f(b, c), c)`
This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM.

~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~
To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that.

## How was this patch tested?

Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes apache#15319 from jiangxb1987/constraints.
ghost pushed a commit to dbtsai/spark that referenced this pull request Sep 12, 2017
## What changes were proposed in this pull request?

Improve QueryPlanConstraints framework, make it robust and simple.
In apache#15319, constraints for expressions like `a = f(b, c)` is resolved.
However, for expressions like
```scala
a = f(b, c) && c = g(a, b)
```
The current QueryPlanConstraints framework will produce non-converging constraints.
Essentially, the problem is caused by having both the name and child of aliases in the same constraint set.   We infer constraints, and push down constraints as predicates in filters, later on these predicates are propagated as constraints, etc..
Simply using the alias names only can resolve these problems.  The size of constraints is reduced without losing any information. We can always get these inferred constraints on child of aliases when pushing down filters.

Also, the EqualNullSafe between name and child in propagating alias is meaningless
```scala
allConstraints += EqualNullSafe(e, a.toAttribute)
```
It just produces redundant constraints.

## How was this patch tested?

Unit test

Author: Wang Gengliang <ltnwgl@gmail.com>

Closes apache#19201 from gengliangwang/QueryPlanConstraints.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants