[SPARK-18911] [SQL] Define CatalogStatistics to interact with metastore and convert it to Statistics in relations #16323

wzhfy · 2016-12-17T05:35:50Z

What changes were proposed in this pull request?

Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing.

We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used.

How was this patch tested?

add test cases

SparkQA · 2016-12-17T07:57:36Z

Test build #70295 has finished for PR 16323 at commit d1679a3.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
case class CatalogStatistics(

…atistics based on cbo switch

SparkQA · 2016-12-17T09:29:24Z

Test build #70300 has finished for PR 16323 at commit f9db620.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CatalogStatistics(

SparkQA · 2016-12-17T12:19:06Z

Test build #70304 has finished for PR 16323 at commit 72a16e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2016-12-18T07:18:49Z

cc @rxin @cloud-fan

cloud-fan · 2016-12-18T14:30:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+/**
+ * This class of Statistics is used in [[CatalogTable]] to interact with metastore.
+ */
+case class CatalogStatistics(


shall we define this class in the same file of CatalogTable?

cloud-fan · 2016-12-18T15:23:36Z

I think it's pretty safe to use table stats as the stats of the leaf node(table relation), including column stats. The actual dangerous one is when we going to estimate something, e.g. in Join or Aggregate.

So logically we should read the conf in Join or Aggregate, and decide if we want to estimate something or just do some naive calculation. However, a problem is, we can't get the conf in LogicalPlan.statistics.

A possible approach is, we can change LogicalPlan.statistics to def statistics(conf: CatalystConf). We need to update all the implementation and caller side though.

wzhfy · 2016-12-19T02:52:13Z

Change LogicalPlan.statistics to def statistics(conf: CatalystConf) could have two problems:

we can't override it as lazy val, and def means we need to estimate the plan every time statistics is called, which will be a performance hit.
we need to make sure we have conf everywhere def statistics(conf: CatalystConf) is used.

SparkQA · 2016-12-19T03:12:11Z

Test build #70328 has finished for PR 16323 at commit 5dbaade.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CatalogStatistics(

wzhfy · 2016-12-19T03:26:22Z

retest this please

cloud-fan · 2016-12-19T03:29:31Z

we can't override it as lazy val, and def means we need to estimate the plan every time statistics is called, which will be a performance hit.

I think we can do the cache manually:

@transient var estimatedStats: Statistics = null
@transient var simpleStats: Statistics = null
def statistics(conf: CatalystConf) = {
  if (conf.enableCbo) {
    if (estimatedStats == null) {
      estimatedStats = ...
    }
    estimatedStats
  } else {
    if (simpleStats == null) {
      simpleStats = ...
    }
    simpleStats
  }
}

we need to make sure we have conf everywhere def statistics(conf: CatalystConf) is used.

Do we have a problem here? I think all of the places needing to call statistics can access CatalystConf.

SparkQA · 2016-12-19T03:32:00Z

Test build #70330 has finished for PR 16323 at commit 5dbaade.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CatalogStatistics(

wzhfy · 2016-12-19T06:16:33Z

Ok, I think it's doable. But since it's not a small change, let's wait @rxin for his comment.

SparkQA · 2016-12-19T06:17:36Z

Test build #70339 has started for PR 16323 at commit bd5eacc.

wzhfy · 2016-12-19T08:13:26Z

retest this please

SparkQA · 2016-12-19T10:37:49Z

Test build #70347 has finished for PR 16323 at commit bd5eacc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-12-20T09:25:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

      locationUri, inputFormat, outputFormat, serde, compressed, properties))
  }

+  def withStats(cboStatsEnabled: Boolean): CatalogTable = {


Do we really need to get rid of the CatalogStatistics if the config is off? Actually I think you can decide whether to use it or not when doing estimation later, depending on this config. It seems no harm to always attach this info to CatalogTable.

Yes I also think that's better, but as @cloud-fan said, we can't get the config in def statistics, we have to modify many places to support this. I'm about to do such modifications, do you have any advices to minimize the changes?

I can think of two approaches:

We can keep the current naive version of statistics and add new statistics function which takes conf.

A default implementation of the new statistics function simply returns the naive version of statistics.

In Join or Aggregate, we can include more complex logic in the new statistics to return naive calculation or something estimation.

The caller always calls new statistics function and passes in current conf.

Add new statisticsCBO which doesn't take conf because it is called only cbo is enabled. So the caller decides to call non-cbo version statistics or cbo version statisticsCBO.

Thanks. I think the first one is better, the second one will lead to many if-else on caller sides.

wzhfy · 2016-12-21T05:12:57Z

Since adding a switch for cbo is not trivial, I want to do it in a separate pr, and let this one only deal with decoupling Statistics from CatalogTable. Do you agree? @cloud-fan

cloud-fan · 2016-12-21T06:13:22Z

SGTM

wzhfy · 2016-12-22T03:21:18Z

Updated. Please review @cloud-fan

SparkQA · 2016-12-22T05:44:04Z

Test build #70505 has finished for PR 16323 at commit d3227dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-22T16:18:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+  }
+
+  /** Readable string representation for the CatalogStatistics. */
+  def simpleString: String = {


why do you define a simpleString instead of override toString?

Because we don't print column stats in it, it's not a "complete" string representation. Column stats can be too much and make CatalogTable unreadable.

cloud-fan · 2016-12-22T16:20:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+  def simpleString: String = {
+    Seq(s"sizeInBytes=$sizeInBytes",
+      if (rowCount.isDefined) s"rowCount=${rowCount.get}" else ""
+    ).filter(_.nonEmpty).mkString(", ")


val rowCountString = if (rowCount.isDefined) s", ${rowCount.get} rows" else "" s"$sizeInBytes bytes$rowCountString"

cloud-fan · 2016-12-22T16:23:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+      if (colStats.contains(attr.name)) {
+        matched.put(attr, colStats(attr.name))
+      }
+    }


attributes.flatMap(a => colStats.get(a.name).map(a -> _)).toMap

cloud-fan · 2016-12-22T16:25:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+   * Convert [[CatalogStatistics]] to [[Statistics]], and match column stats to attributes based
+   * on column names.
+   */
+  def convert(attributes: Seq[Attribute]): Statistics = {


This is a bad name, it doesn't tell anything, without looking at the doc.
How about def toPlanStats(planOuput: ...)?

That's a lot better, thanks!

viirya · 2016-12-23T03:25:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala



+/**
+ * This class of statistics is used in [[CatalogTable]] to interact with metastore.


Can you add few words explaining why don't use Statistics for CatalogTable?

viirya · 2016-12-23T03:32:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

    sizeInBytes: BigInt,
    rowCount: Option[BigInt] = None,
-    colStats: Map[String, ColumnStat] = Map.empty,
+    attributeStats: AttributeMap[ColumnStat] = AttributeMap(Nil),


Will we estimate statistics for all attributes in logical plan?

I meant if an attribute is not coming from a leaf node but from a later plan like Join, do we still have ColumnStat for it?

If not, I think we don't need to call this parameter as attributeStats, instead of original colStats.

We will estimate attributes in logical plan from the bottom up.

SparkQA · 2016-12-23T11:37:30Z

Test build #70547 has finished for PR 16323 at commit 573b560.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-23T17:21:39Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+
+  private def checkStatsConversion(tableName: String, isDatasourceTable: Boolean): Unit = {
+    // Create an empty table and run analyze command on it.
+    val col = "c1"


nit: c1 is so simple that we can write in directly instead of using a variable

cloud-fan · 2016-12-23T17:22:31Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+    // Create an empty table and run analyze command on it.
+    val col = "c1"
+    val createTableSql = if (isDatasourceTable) {
+      s"CREATE TABLE $tableName ($col INT) USING PARQUET"


let's create a table with 2 columns, and only analyze one column, to see if the attributeStats only contains one entry.

cloud-fan · 2016-12-24T02:34:23Z

LGTM, pending jenkins

SparkQA · 2016-12-24T04:04:55Z

Test build #70561 has finished for PR 16323 at commit 978bb11.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-24T04:15:14Z

retest this please

SparkQA · 2016-12-24T06:40:38Z

Test build #70564 has finished for PR 16323 at commit 978bb11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-24T07:35:16Z

thanks, merging to master!

…e and convert it to Statistics in relations ## What changes were proposed in this pull request? Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing. We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used. ## How was this patch tested? add test cases Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes apache#16323 from wzhfy/nameToAttr.

wzhfy mentioned this pull request Dec 17, 2016

[SPARK-17076] [SQL] Cardinality estimation for join based on basic column statistics #16228

Closed

use CatalogStatistics to interact with metastore and convert it to St…

f9db620

…atistics based on cbo switch

wzhfy force-pushed the nameToAttr branch from d1679a3 to f9db620 Compare December 17, 2016 08:00

fix tests: keep output same when copying LogicalRelation

72a16e5

cloud-fan reviewed Dec 18, 2016

View reviewed changes

move CatalogStatistics into the same file of CatalogTable

5dbaade

fix import

bd5eacc

viirya reviewed Dec 20, 2016

View reviewed changes

remove the logic of cbo switch and alwarys keep stats in CatalogTable

d3227dc

wzhfy changed the title ~~[SPARK-18911] [SQL] Define CatalogStatistics to interact with metastore and convert it to Statistics based on cbo switch~~ [SPARK-18911] [SQL] Define CatalogStatistics to interact with metastore and convert it to Statistics in relations Dec 22, 2016

cloud-fan reviewed Dec 22, 2016

View reviewed changes

viirya reviewed Dec 23, 2016

View reviewed changes

fix comments

573b560

cloud-fan reviewed Dec 23, 2016

View reviewed changes

improve test case: analyze only one of two columns

978bb11

asfgit closed this in 3cff816 Dec 24, 2016



		/**
		* This class of statistics is used in [[CatalogTable]] to interact with metastore.

[SPARK-18911] [SQL] Define CatalogStatistics to interact with metastore and convert it to Statistics in relations #16323

[SPARK-18911] [SQL] Define CatalogStatistics to interact with metastore and convert it to Statistics in relations #16323

Uh oh!

Conversation

wzhfy commented Dec 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 17, 2016

Uh oh!

SparkQA commented Dec 17, 2016

Uh oh!

SparkQA commented Dec 17, 2016

Uh oh!

wzhfy commented Dec 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wzhfy commented Dec 19, 2016

Uh oh!

SparkQA commented Dec 19, 2016

Uh oh!

wzhfy commented Dec 19, 2016

Uh oh!

cloud-fan commented Dec 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 19, 2016

Uh oh!

wzhfy commented Dec 19, 2016

Uh oh!

SparkQA commented Dec 19, 2016

Uh oh!

wzhfy commented Dec 19, 2016

Uh oh!

SparkQA commented Dec 19, 2016

Uh oh!

viirya Dec 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy commented Dec 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Dec 21, 2016

Uh oh!

wzhfy commented Dec 22, 2016

Uh oh!

SparkQA commented Dec 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy commented Dec 17, 2016 •

edited

Loading

cloud-fan commented Dec 18, 2016 •

edited

Loading

cloud-fan commented Dec 19, 2016 •

edited

Loading

viirya Dec 20, 2016 •

edited

Loading

wzhfy commented Dec 21, 2016 •

edited

Loading

cloud-fan Dec 22, 2016 •

edited

Loading

cloud-fan Dec 22, 2016 •

edited

Loading