Skip to content

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Dec 17, 2016

What changes were proposed in this pull request?

Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing.

We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used.

How was this patch tested?

add test cases

@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70295 has finished for PR 16323 at commit d1679a3.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • case class CatalogStatistics(

@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70300 has finished for PR 16323 at commit f9db620.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CatalogStatistics(

@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70304 has finished for PR 16323 at commit 72a16e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 18, 2016

cc @rxin @cloud-fan

/**
* This class of Statistics is used in [[CatalogTable]] to interact with metastore.
*/
case class CatalogStatistics(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we define this class in the same file of CatalogTable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 18, 2016

I think it's pretty safe to use table stats as the stats of the leaf node(table relation), including column stats. The actual dangerous one is when we going to estimate something, e.g. in Join or Aggregate.

So logically we should read the conf in Join or Aggregate, and decide if we want to estimate something or just do some naive calculation. However, a problem is, we can't get the conf in LogicalPlan.statistics.

A possible approach is, we can change LogicalPlan.statistics to def statistics(conf: CatalystConf). We need to update all the implementation and caller side though.

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 19, 2016

Change LogicalPlan.statistics to def statistics(conf: CatalystConf) could have two problems:

  1. we can't override it as lazy val, and def means we need to estimate the plan every time statistics is called, which will be a performance hit.
  2. we need to make sure we have conf everywhere def statistics(conf: CatalystConf) is used.

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70328 has finished for PR 16323 at commit 5dbaade.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CatalogStatistics(

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 19, 2016

retest this please

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 19, 2016

we can't override it as lazy val, and def means we need to estimate the plan every time statistics is called, which will be a performance hit.

I think we can do the cache manually:

@transient var estimatedStats: Statistics = null
@transient var simpleStats: Statistics = null
def statistics(conf: CatalystConf) = {
  if (conf.enableCbo) {
    if (estimatedStats == null) {
      estimatedStats = ...
    }
    estimatedStats
  } else {
    if (simpleStats == null) {
      simpleStats = ...
    }
    simpleStats
  }
}

we need to make sure we have conf everywhere def statistics(conf: CatalystConf) is used.

Do we have a problem here? I think all of the places needing to call statistics can access CatalystConf.

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70330 has finished for PR 16323 at commit 5dbaade.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CatalogStatistics(

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 19, 2016

Ok, I think it's doable. But since it's not a small change, let's wait @rxin for his comment.

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70339 has started for PR 16323 at commit bd5eacc.

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 19, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70347 has finished for PR 16323 at commit bd5eacc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

locationUri, inputFormat, outputFormat, serde, compressed, properties))
}

def withStats(cboStatsEnabled: Boolean): CatalogTable = {
Copy link
Member

@viirya viirya Dec 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to get rid of the CatalogStatistics if the config is off? Actually I think you can decide whether to use it or not when doing estimation later, depending on this config. It seems no harm to always attach this info to CatalogTable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I also think that's better, but as @cloud-fan said, we can't get the config in def statistics, we have to modify many places to support this. I'm about to do such modifications, do you have any advices to minimize the changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of two approaches:

We can keep the current naive version of statistics and add new statistics function which takes conf.

A default implementation of the new statistics function simply returns the naive version of statistics.

In Join or Aggregate, we can include more complex logic in the new statistics to return naive calculation or something estimation.

The caller always calls new statistics function and passes in current conf.

Add new statisticsCBO which doesn't take conf because it is called only cbo is enabled. So the caller decides to call non-cbo version statistics or cbo version statisticsCBO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think the first one is better, the second one will lead to many if-else on caller sides.

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 21, 2016

Since adding a switch for cbo is not trivial, I want to do it in a separate pr, and let this one only deal with decoupling Statistics from CatalogTable. Do you agree? @cloud-fan

@cloud-fan
Copy link
Contributor

SGTM

@wzhfy wzhfy changed the title [SPARK-18911] [SQL] Define CatalogStatistics to interact with metastore and convert it to Statistics based on cbo switch [SPARK-18911] [SQL] Define CatalogStatistics to interact with metastore and convert it to Statistics in relations Dec 22, 2016
@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 22, 2016

Updated. Please review @cloud-fan

@SparkQA
Copy link

SparkQA commented Dec 22, 2016

Test build #70505 has finished for PR 16323 at commit d3227dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/** Readable string representation for the CatalogStatistics. */
def simpleString: String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you define a simpleString instead of override toString?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't print column stats in it, it's not a "complete" string representation. Column stats can be too much and make CatalogTable unreadable.

def simpleString: String = {
Seq(s"sizeInBytes=$sizeInBytes",
if (rowCount.isDefined) s"rowCount=${rowCount.get}" else ""
).filter(_.nonEmpty).mkString(", ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val rowCountString = if (rowCount.isDefined) s", ${rowCount.get} rows" else ""
s"$sizeInBytes bytes$rowCountString" 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if (colStats.contains(attr.name)) {
matched.put(attr, colStats(attr.name))
}
}
Copy link
Contributor

@cloud-fan cloud-fan Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attributes.flatMap(a => colStats.get(a.name).map(a -> _)).toMap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* Convert [[CatalogStatistics]] to [[Statistics]], and match column stats to attributes based
* on column names.
*/
def convert(attributes: Seq[Attribute]): Statistics = {
Copy link
Contributor

@cloud-fan cloud-fan Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad name, it doesn't tell anything, without looking at the doc.
How about def toPlanStats(planOuput: ...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot better, thanks!



/**
* This class of statistics is used in [[CatalogTable]] to interact with metastore.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add few words explaining why don't use Statistics for CatalogTable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

sizeInBytes: BigInt,
rowCount: Option[BigInt] = None,
colStats: Map[String, ColumnStat] = Map.empty,
attributeStats: AttributeMap[ColumnStat] = AttributeMap(Nil),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we estimate statistics for all attributes in logical plan?

I meant if an attribute is not coming from a leaf node but from a later plan like Join, do we still have ColumnStat for it?

If not, I think we don't need to call this parameter as attributeStats, instead of original colStats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will estimate attributes in logical plan from the bottom up.

@SparkQA
Copy link

SparkQA commented Dec 23, 2016

Test build #70547 has finished for PR 16323 at commit 573b560.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


private def checkStatsConversion(tableName: String, isDatasourceTable: Boolean): Unit = {
// Create an empty table and run analyze command on it.
val col = "c1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: c1 is so simple that we can write in directly instead of using a variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// Create an empty table and run analyze command on it.
val col = "c1"
val createTableSql = if (isDatasourceTable) {
s"CREATE TABLE $tableName ($col INT) USING PARQUET"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create a table with 2 columns, and only analyze one column, to see if the attributeStats only contains one entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fixed

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@SparkQA
Copy link

SparkQA commented Dec 24, 2016

Test build #70561 has finished for PR 16323 at commit 978bb11.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 24, 2016

Test build #70564 has finished for PR 16323 at commit 978bb11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 3cff816 Dec 24, 2016
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Dec 24, 2016
…e and convert it to Statistics in relations

## What changes were proposed in this pull request?

Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing.

We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used.

## How was this patch tested?

add test cases

Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>

Closes apache#16323 from wzhfy/nameToAttr.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…e and convert it to Statistics in relations

## What changes were proposed in this pull request?

Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing.

We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used.

## How was this patch tested?

add test cases

Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>

Closes apache#16323 from wzhfy/nameToAttr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants