Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Jul 1, 2016

What changes were proposed in this pull request?

Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example:
SELECT a FROM t WHERE rand() < 0.1 AND a = 1
And
SELECT a FROM t WHERE a = 1 AND rand() < 0.1
may call rand() for different times and therefore the output rows differ.

This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates.

How was this patch tested?

Expanded related testcases in FilterPushdownSuite.

…dicates currectly in non-deterministic condition.
@jiangxb1987
Copy link
Contributor Author

cc @liancheng @cloud-fan

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61691 has finished for PR 14012 at commit 856d86d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

add to whitelist

val (pushDown, stayUp) = splitConjunctivePredicates(condition).partition { cond =>
cond.references.subsetOf(partitionAttrs) && cond.deterministic &&
isPredicatePushdownAble = isPredicatePushdownAble && cond.deterministic
isPredicatePushdownAble && cond.references.subsetOf(partitionAttrs) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following can be easier to read:

val (candidates, containingNonDeterministic) =
  splitConjunctivePredicates(condition).span(_.deterministic)

val (pushDown, rest) = candidates.partition { cond =>
  cond.references.subsetOf(partitionAttrs) && partitionAttrs.forall(_.isInstanceOf[Attribute])
}

val stayUp = rest ++ containingNonDeterministic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should move the partitionAttrs.forall(_.isInstanceOf[Attribute]) predicate out of the closure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update that!

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61748 has finished for PR 14012 at commit 856d86d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61751 has finished for PR 14012 at commit c005645.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61752 has finished for PR 14012 at commit a4e62f0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61753 has finished for PR 14012 at commit 33d45de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

cc @liancheng please review this PR, thanks!

project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild))

// Push [[Filter]] operators through [[Window]] operators. Parts of the predicate that can be
// pushed beneath must satisfy the following two conditions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Remove "two".

@liancheng
Copy link
Contributor

LGTM except for some minor comments. Thanks for improving this!

@liancheng
Copy link
Contributor

One more thing, please complete the PR title.

@jiangxb1987 jiangxb1987 changed the title [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown pre… [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates currectly in non-deterministic condition. Jul 8, 2016
// This is for ensuring all the partitioning expressions have been converted to alias
// in Analyzer. Thus, we do not need to check if the expressions in conditions are
// the same as the expressions used in partitioning columns.
if (partitionAttrs.forall(_.isInstanceOf[Attribute])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this check is necessary. partitionAttrs is AttributeSet and AttributeSet extends Traversable[Attribute].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll remove that check, thanks!

@cloud-fan
Copy link
Contributor

is it a typo in PR title? currectly -> correctly

@jiangxb1987 jiangxb1987 changed the title [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates currectly in non-deterministic condition. [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates correctly in non-deterministic condition. Jul 8, 2016
@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61985 has finished for PR 14012 at commit 98d369e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @return (pushDown, stayUp)
*/
private def splitPushdownPredicates(
condition: Expression)(specificRules: (Expression) => Boolean) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, looks like duplicating these codes is more readable, @liancheng what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea... @jiangxb1987 Sorry that I had to agree with @cloud-fan. Seems that factoring out this method makes the code harder to understand, mostly because the semantics of specificRules is quite convoluted. Could you please revert this part? Sorry again for the trouble!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted this part, thanks!

@jiangxb1987
Copy link
Contributor Author

@liancheng please find some time to review the latest updates, thanks!

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62100 has finished for PR 14012 at commit 9b2b5a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Thanks! Merged this to master.

@JoshRosen
Copy link
Contributor

Since this is a correctness fix, albeit a minor one, I'd like to backport it. I'm going to cherry-pick this to branch-2.0.

asfgit pushed a commit that referenced this pull request Sep 29, 2016
…dicates correctly in non-deterministic condition.

## What changes were proposed in this pull request?

Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example:
```SELECT a FROM t WHERE rand() < 0.1 AND a = 1```
And
```SELECT a FROM t WHERE a = 1 AND rand() < 0.1```
may call rand() for different times and therefore the output rows differ.

This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates.

## How was this patch tested?

Expanded related testcases in FilterPushdownSuite.

Author: 蒋星博 <jiangxingbo@meituan.com>

Closes #14012 from jiangxb1987/ppd.

(cherry picked from commit f376c37)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants