[SPARK-17076] [SQL] Cardinality estimation for join based on basic column statistics #16228

wzhfy · 2016-12-09T06:42:23Z

What changes were proposed in this pull request?

Support cardinality estimation and stats propagation for all join types.

Limitations:

For inner/outer joins without any equal condition, we estimate it like cartesian product.
For left semi/anti joins, since we can't apply the heuristics for inner join to it, for now we just propagate the statistics from left side. We should support them when other advanced stats (e.g. histograms) are available in spark.

How was this patch tested?

Add a new test suite.

wzhfy · 2016-12-09T06:44:16Z

cc @rxin @srinathshankar @cloud-fan

SparkQA · 2016-12-09T06:47:36Z

Test build #69907 has started for PR 16228 at commit f0bbb43.

wzhfy · 2016-12-09T06:51:31Z

I still leave two issues undecided:

Where can we turn on/off cbo estimation? I think we need to have such switch, because otherwise we will use statistics as long as they are in the metastore, but they can become stale.
Currently we use column name as the key for column statistics, which is problematic because if the output of join have columns from different tables with the same column name, they can't be distinguished. Can we use a combination string like table name + column name?

Tagar · 2016-12-09T07:05:04Z

That is great. Would it be easier to use FK, when it is available (see HMS has FKs since Hive 2.1: https://issues.apache.org/jira/browse/HIVE-13076), and if FK between columns is not defined, then use stats.

Also, do I understand correctly, that the assumption is if two tables being joined by columns with the same name, join columns have the same stats / set of values?

wzhfy · 2016-12-09T07:18:28Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala

+   * formula:
+   * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), V(B.k2)) * ... * max(V(A.kn), V(B.kn)))
+   * However, the denominator can become very large and excessively reduce the result, so we use a
+   * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) as the denominator.


Here, Hive uses an exponential decay to compute the denominator when number of join keys > number of join tables, i.e. ndv1 * ndv2^(1/2) * ndv3^(1/4)... I just use a more conservative strategy by max(ndv1, ndv2, ...). I'm not sure which one is better. Do you know any theoretical or empirical support for hive's strategy? @rxin @srinathshankar

They probably estimate for the number of distinct values in a vector of columns using uniform distribution.

SparkQA · 2016-12-09T07:32:50Z

Test build #69911 has started for PR 16228 at commit 64603b5.

rxin · 2016-12-09T08:42:32Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+    joinType match {
+      case LeftAnti | LeftSemi =>
+        // LeftSemi and LeftAnti won't ever be bigger than left
+        left.statistics.copy()


why do we need copy here?

Statistics is immutable, I think it's safe without copy.

cloud-fan · 2016-12-09T14:25:22Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala

+      rightKeys: Seq[Expression]): Seq[(AttributeReference, AttributeReference)] = {
+    leftKeys.zip(rightKeys).flatMap {
+      case (ExtractAttr(left), ExtractAttr(right)) => Some((left, right))
+      // Currently we don't deal with equal joins like key1 = key2 + 5.


If we are not ready for expressions, I think we should not handle Cast either, as it may be tricky to handle overflow correctly(e.g. cast long to int)

Yes, but I'm a little worried about losing estimation opportunities because of this rare case (in my understanding this kind of downgrading cast is not common)?

Can we define a new parent class for such downgrading cast? Will this change be big?

I don't think we should add a new expression just for the current implementation limitations. BTW, only handle Cast may also lose a lot of estimation opportunities, we should support expression before we release this feature.

wzhfy · 2016-12-10T02:13:21Z

@Tagar Thanks for sharing this information. Yes, it would be better to use PK/FK, but it won't be done in this pr, and we need to implement PK/FK constraints in Spark first.

the assumption is if two tables being joined by columns with the same name, join columns have the same stats / set of values?

It is true for inner join, but not true for outer joins, right?

Tagar · 2016-12-11T02:19:09Z

@wzhfy, thanks for the feedback.
For outer joins, cardinaility estimates should be:

left_outer_join_cardinality(table_A, table_B) = MAX(cardinality(A), inner_join_cardinality(table_A, table_B))
So with left outer join, you'll get join cardinality at least number of the rows on the left (table A);
right outer join is similar to left outer join (with exception s/A/B/g);
full_outer_join_cardinality(table_A, table_B) = cardinality(A) + cardinality(B) - inner_join_cardinality(table_A, table_B)).
Does this sound about right?

ioana-delaney · 2016-12-15T00:48:28Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala

+   * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), V(B.k2)) * ... * max(V(A.kn), V(B.kn)))
+   * However, the denominator can become very large and excessively reduce the result, so we use a
+   * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) as the denominator.
+   *


Can we also include a short description of how column stats are computed after the join?

Yes, of course

wzhfy · 2016-12-15T01:10:06Z

@Tagar For full outer join, how about cardinality = MAX(card(A) + card(B), innerCard(AB)) ?

ioana-delaney · 2016-12-15T01:20:53Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala

+        isBroadcastable = false))
+
+    case _ => None
+  }


I have a design comment. Given a join, this function computes predicate selectivity, plan cardinality, and column statistics. I wonder if it would make sense to encapsulate predicate selectivity computation in its own function i.e. selectivity is a property of a predicate and cardinality (i.e. number of rows) is a property of the data stream. Also, there might be different ways to compute selectivity of a predicate (e.g. uniform vs non uniform distribution) and therefore it might make sense to separate the computation of the two properties. Then, maybe in the future, selectivity hint might be used to overwrite the default CBO selectivity computation.

Thanks for the advice. We will consider to have such function when we check in the code for Filter estimation, then we can look at this more comprehensively. I think the pr will be submitted soon.

Tagar · 2016-12-15T03:56:27Z

@wzhfy, it's easier to check validity of these type of expressions when you look at extreme cases.
Your formula for full outer join cardinality,

cardinality = MAX(card(A) + card(B), innerCard(AB))

in one of extreme cases when set(A) and set(B) are the same sets, then calculated cardinality would be 2 times more of the actual cardinality.

While

full_outer_join_cardinality(table_A, table_B) = cardinality(A) + cardinality(B) - inner_join_cardinality(table_A, table_B))

will produce correct result.

ps. I find this visualization http://www.radacad.com/wp-content/uploads/2015/07/joins.jpg very helpful.
https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle A U B = A + B - A \ B

Hope this helps. Thanks!

wzhfy · 2016-12-15T05:10:06Z

@Tagar We can always find extreme cases to which these formula can't apply. In my opinion, it's better to over-estimate than under-estimate, which can lead to OOM problems, e.g. broadcast a very large result.

If A is a big table and B is a small one, every A.k has a match in B (a common case for PK and FK), then

cardinality(A) + cardinality(B) - inner_join_cardinality(table_A, table_B))

becomes card(B), which is dramatically smaller than the real outer join card. Even more, it can be negative if all A.k and B.k has the same value, the inner join part becomes a cartesian product.

This formula,

cardinality = MAX(card(A) + card(B), innerCard(AB))

although over estimates sometimes, it's still obviously better than the original one in spark: card(A) * card(B).

Tagar · 2016-12-15T19:00:12Z

@wzhfy, I think overestimating cardinality could be as bad as underestimating.
For example, Optimizer could prematurely switch to SortMergeJoin when it could used broadcast hash join.
But I agree, this PR is a great improvement over current cardinality estimates.

wzhfy · 2016-12-17T05:41:58Z

To solve the two issues I mentioned above, a separate PR is sent here.
We need to rebase this one after that one is resolved.

SparkQA · 2016-12-23T03:32:24Z

Test build #70533 has finished for PR 16228 at commit c3e3a48.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class LeftSemiAntiEstimation(join: Join)

wzhfy · 2016-12-23T08:31:37Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala

@Tagar I take your advice about full outer join, but with a little change by lower bounding it using innerRows.

SparkQA · 2016-12-23T10:50:48Z

Test build #70544 has finished for PR 16228 at commit 2c9d6c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-26T11:24:30Z

Test build #70589 has finished for PR 16228 at commit de63b59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-29T06:32:37Z

Test build #70708 has started for PR 16228 at commit df839f8.

wzhfy · 2016-12-29T08:22:57Z

retest this please

SparkQA · 2016-12-29T10:43:36Z

Test build #70717 has finished for PR 16228 at commit df839f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T12:04:37Z

Test build #71307 has finished for PR 16228 at commit ffb9eee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-01-13T13:23:21Z

This pr is rebased and ready for review. @rxin

cloud-fan · 2017-02-13T20:48:02Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+      // 2. Estimate the number of output rows
+      val leftRows = leftStats.rowCount.get
+      val rightRows = rightStats.rowCount.get
+      val innerRows = ceil(BigDecimal(leftRows * rightRows) * selectivity)


nit: joinedRows

cloud-fan · 2017-02-13T21:00:26Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+      }
+
+      // 3. Update statistics based on the output of join
+      val intersectedStats = if (selectivity == 0) {


what's the difference between selectivity == 0 and outputRows == 0? does it only matter for outer joins?

for outer joins, if selectivity is 0, then the number of output rows is same as the number of left/right side rows. And the column stats should also be same as the left/right side columns, while the other side columns are all null.

let's name it joinKeyStats

yea good point, thanks

cloud-fan · 2017-02-13T21:10:02Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+      leftKeys: Seq[Expression],
+      rightKeys: Seq[Expression]): Seq[(AttributeReference, AttributeReference)] = {
+    leftKeys.zip(rightKeys).flatMap {
+      case (lk: AttributeReference, rk: AttributeReference) => Some((lk, rk))


we can check column stats existence here, so that we don't need to do columnStatsExist((leftStats, leftKey), (rightStats, rightKey)) again and again later.

cloud-fan · 2017-02-13T21:22:06Z

...t/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationTestBase.scala

  protected val conf = SimpleCatalystConf(caseSensitiveAnalysis = true, cboEnabled = true)

+  def getColSize(attribute: Attribute, colStat: ColumnStat): Long = attribute.dataType match {
+    case StringType => colStat.avgLen + 8 + 4


explain the + 8 + 4?

cloud-fan · 2017-02-13T21:25:51Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/Range.scala

+  def apply(min: Option[Any], max: Option[Any], dataType: DataType): Range = dataType match {
+    case StringType | BinaryType => new DefaultRange()
+    case _ if min.isEmpty || max.isEmpty => new NullRange()
+    case _ => toNumericRange(min.get, max.get, dataType)


This doesn't work for empty column stats you defined in https://github.com/apache/spark/pull/16228/files#diff-6387e7aaeb7d8e0cb1457b9d0fe5cd00R270

that's why I was worried about the empty stats, it breaks some assumptions, like numeric type stats must have max/min.

ok, let's remove empty column stats. When we know rowCount=0, we can derive the column stats is empty, we don't need to keep it. what do you think?

cloud-fan · 2017-02-13T21:27:30Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala

+
+  /** Set up tables and its columns for testing */
+  private val columnInfo: AttributeMap[ColumnStat] = AttributeMap(Seq(
+    attr("key11") -> ColumnStat(distinctCount = 5, min = Some(1), max = Some(5), nullCount = 0,


how about key-1-5, key-5-9, etc.? then we can know the key value range directly from the name.

cloud-fan · 2017-02-13T21:35:00Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala

+    // table2 (key21 int, key22 int): (1, 2), (2, 3), (2, 4)
+    // key12 and key22 are disjoint
+    val join = Join(table1, table2, Inner, Some(
+      And(EqualTo(nameToAttr("key11"), nameToAttr("key21")),


actually we can just use key12 = key22, so that it's more different from the test inner join with multiple equi-join keys

…t is empty.

wzhfy · 2017-02-14T23:11:23Z

This pr is updated, please review @cloud-fan

cloud-fan · 2017-02-15T00:09:07Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+        val minNdv = leftKeyStats.distinctCount.min(rightKeyStats.distinctCount)
+        val (newMin1, newMax1, newMin2, newMax2) =
+          Range.intersect(lRange, rRange, leftKey.dataType, rightKey.dataType)
+        intersectedStats.put(leftKey, intersectedColumnStat(leftKeyStats, minNdv,


logically the join keys should have same column stats, we can write it more explicitly

assert(leftKey.dataType.sameType(rightKey.dataType)) val stats = ColumnStats(minNdv, newMin, newMax, nullCount = 0) // and some more logic to update the avg/max length. intersectedStats.put(leftKey, stats) intersectedStats.put(rightKey, stats)

cloud-fan · 2017-02-15T00:09:56Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/Range.scala

+      r1: Range,
+      r2: Range,
+      dt1: DataType,
+      dt2: DataType): (Option[Any], Option[Any], Option[Any], Option[Any]) = {


we can simplify this, we will only calculate intersection for same-type ranges.

cloud-fan · 2017-02-15T00:11:24Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+            updateAttrStats(outputRows, fromLeft, inputAttrStats, joinKeyStats) ++
+              fromRight.map(a => (a, inputAttrStats(a)))
+          case FullOuter =>
+            attributesWithStat.map(a => (a, inputAttrStats(a)))


this is just inputAttrStats right?

cloud-fan · 2017-02-15T00:17:44Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+
+  /**
+   * Propagate or update column stats for output attributes.
+   * 1. For empty output, we don't need to keep any column stats.


when we hit this method, the outputRows will never be 0 right?

cloud-fan · 2017-02-15T00:20:24Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+        if (joinKeyStats.contains(a)) {
+          outputAttrStats += a -> joinKeyStats(a)
+        } else {
+          val oldCS = oldAttrStats(a)


nit: oldColumnStats

cloud-fan · 2017-02-15T00:23:25Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+      val inputAttrStats = AttributeMap(
+        leftStats.attributeStats.toSeq ++ rightStats.attributeStats.toSeq)
+      // Propagate the original column stats
+      val outputAttrStats = getOutputMap(inputAttrStats, join.output)


is it just inputsAttrStats?

cloud-fan · 2017-02-15T00:24:42Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+    if (rowCountsExist(conf, join.left)) {
+      val leftStats = join.left.stats(conf)
+      // Propagate the original column stats for cartesian product
+      val outputAttrStats = getOutputMap(leftStats.attributeStats, join.output)


is it just leftStats.attributeStats?

cloud-fan · 2017-02-15T00:31:49Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala

+    )
+
+    /** Columns in a table with two rows */
+    val columnInfo2 = mutable.LinkedHashMap[Attribute, ColumnStat](


are these totally same with the columnInfo1? may we can create a method to do this

cloud-fan · 2017-02-15T00:32:57Z

LGTM except some minor comments, thanks for working on it!

SparkQA · 2017-02-15T01:49:42Z

Test build #72903 has finished for PR 16228 at commit e8930d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-15T02:38:50Z

Test build #72916 has finished for PR 16228 at commit 8e2d5ae.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-15T05:25:42Z

Test build #72917 has finished for PR 16228 at commit 8182123.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-15T16:22:05Z

thanks, merging to master!

…umn statistics ## What changes were proposed in this pull request? Support cardinality estimation and stats propagation for all join types. Limitations: - For inner/outer joins without any equal condition, we estimate it like cartesian product. - For left semi/anti joins, since we can't apply the heuristics for inner join to it, for now we just propagate the statistics from left side. We should support them when other advanced stats (e.g. histograms) are available in spark. ## How was this patch tested? Add a new test suite. Author: Zhenhua Wang <wzh_zju@163.com> Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#16228 from wzhfy/joinEstimate.

wzhfy commented Dec 9, 2016

View reviewed changes

rxin reviewed Dec 9, 2016

View reviewed changes

cloud-fan reviewed Dec 9, 2016

View reviewed changes

ioana-delaney reviewed Dec 15, 2016

View reviewed changes

wzhfy force-pushed the joinEstimate branch from 64603b5 to c3e3a48 Compare December 23, 2016 03:25

wzhfy changed the title ~~[WIP] [SPARK-17076] [SQL] Cardinality estimation for join based on basic column statistics~~ [SPARK-17076] [SQL] Cardinality estimation for join based on basic column statistics Dec 23, 2016

wzhfy commented Dec 23, 2016

View reviewed changes

wzhfy force-pushed the joinEstimate branch from 2c9d6c7 to de63b59 Compare December 26, 2016 09:09

wzhfy force-pushed the joinEstimate branch from df839f8 to ffb9eee Compare January 13, 2017 09:29

cloud-fan reviewed Feb 13, 2017

View reviewed changes

wzhfy added 3 commits February 13, 2017 21:58

1. remove empty column stats, 2. deal with outer joins when inner par…

05efa81

…t is empty.

other comments

331e5d0

add test cases for disjoint outer joins

e8930d2

cloud-fan reviewed Feb 15, 2017

View reviewed changes

more comments

981de6e

wzhfy force-pushed the joinEstimate branch from 8e2d5ae to 981de6e Compare February 15, 2017 02:39

fix error

8182123

asfgit closed this in 601b9c3 Feb 15, 2017

[SPARK-17076] [SQL] Cardinality estimation for join based on basic column statistics #16228

[SPARK-17076] [SQL] Cardinality estimation for join based on basic column statistics #16228

Uh oh!

Conversation

wzhfy commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

wzhfy commented Dec 9, 2016

Uh oh!

SparkQA commented Dec 9, 2016

Uh oh!

wzhfy commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tagar commented Dec 9, 2016

Uh oh!

wzhfy Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy commented Dec 10, 2016

Uh oh!

Tagar commented Dec 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy commented Dec 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tagar commented Dec 15, 2016

Uh oh!

wzhfy commented Dec 15, 2016

Uh oh!

Tagar commented Dec 15, 2016

Uh oh!

wzhfy commented Dec 17, 2016

Uh oh!

SparkQA commented Dec 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 23, 2016

Uh oh!

SparkQA commented Dec 26, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

wzhfy commented Dec 29, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

wzhfy commented Jan 13, 2017

Uh oh!

wzhfy commented Dec 9, 2016 •

edited

Loading

wzhfy commented Dec 9, 2016 •

edited

Loading

wzhfy Dec 9, 2016 •

edited

Loading