Skip to content

Conversation

@sameeragarwal
Copy link
Member

What changes were proposed in this pull request?

This PR adds support for inferring an additional set of data constraints based on attribute equality. For e.g., if an operator has constraints of the form (a = 5, a = b), we can now automatically infer an additional constraint of the form b = 5

How was this patch tested?

Tested that new constraints are properly inferred for filters (by adding a new test) and equi-joins (by modifying an existing test)

@sameeragarwal
Copy link
Member Author

cc @nongli

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52779 has finished for PR 11618 at commit 1149d20.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal sameeragarwal force-pushed the infer-isequal-constraints branch from 1149d20 to 64e3320 Compare March 10, 2016 00:25
@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52784 has finished for PR 11618 at commit 64e3320.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case a: Attribute if a.semanticEquals(r) => l
}))
case _ =>
Set.empty[Expression]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not foldLeft -- most of the time fold is used it can be rewritten into something imperative that is simpler (and a lot faster)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed!

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52819 has finished for PR 11618 at commit c57db5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

var inferredConstraints = Set.empty[Expression]
constraints.foreach {
case eq @ EqualTo(l: Attribute, r: Attribute) =>
inferredConstraints ++= (constraints - eq).map(_ transform {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why semantic equals instead of if a == l?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to infer equality based on the expression ids of their attribute references: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L44-L49. Is that not necessary in this case?

@nongli
Copy link
Contributor

nongli commented Mar 11, 2016

lgtm

@rxin
Copy link
Contributor

rxin commented Mar 11, 2016

Thanks - merging in master.

@asfgit asfgit closed this in c3a6269 Mar 11, 2016
@gatorsmile
Copy link
Member

Great! Based on this PR, it sounds like I can reopen my PR: #10490 Thanks!

@sameeragarwal
Copy link
Member Author

@gatorsmile instead of having a special rule for join, we can probably infer all possible filters based on constraints (... something along the lines of sameeragarwal@ce4c944). This should now subsume the predicate transitivity optimization right?

@gatorsmile
Copy link
Member

@sameeragarwal Yeah, it sounds like you already started working on it.

True, I also did a similar thing, but we need to add extra handling for predicate push down. How about you first delivering your current changes? I will do the remaining part? Thanks!

@sameeragarwal
Copy link
Member Author

Shouldn't the existing rule for PushPredicateThroughJoin automatically take care of predicate pushdown? By the way, please feel free to take over my branch if you'd like :)

@gatorsmile
Copy link
Member

A couple of issues I hit when I try to do it.

  1. The first issue is the one I submitted earlier regarding the IsNotNull of compound expressions.
  2. The second issue is the duplicate conditions could be added if we do not change the existing rules, e.g., PushPredicateThroughJoin.
  3. Another issue is about a=b && a=c, we can infer b=c. However, the last one is useless and will burn extra costs. Ideally, we should select the predicates based on the cardinality. However, that is missing. We are unable to do it now. I am still thinking how to do it if we do not have the cardinality info. So far, the best way in my mind is to push down the constraints but we do not add them into the conditions unless it is necessary to do it.

Maybe more. Still trying to write more test cases. Hopefully, I can find all of them.

Thank you! If you do not mind it, I will try it this weekend.

@sameeragarwal
Copy link
Member Author

sounds good, thank you. In my branch, I try to address (2) by not adding new conditions if the child node(s) already have the given constraint. For (3), please note that pushing b = c down isn't useless if a comes from the left side of the join and b and c come from the right. It is of course useless if b and c come from different sides of the join. I think we can have a slightly smarter filter inference rule for joins to identify the latter.

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This PR adds support for inferring an additional set of data constraints based on attribute equality. For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), we can now automatically infer an additional constraint of the form `b = 5`

## How was this patch tested?

Tested that new constraints are properly inferred for filters (by adding a new test) and equi-joins (by modifying an existing test)

Author: Sameer Agarwal <sameer@databricks.com>

Closes apache#11618 from sameeragarwal/infer-isequal-constraints.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants