-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18911] [SQL] Define CatalogStatistics to interact with metastore and convert it to Statistics in relations #16323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #70295 has finished for PR 16323 at commit
|
…atistics based on cbo switch
|
Test build #70300 has finished for PR 16323 at commit
|
|
Test build #70304 has finished for PR 16323 at commit
|
|
cc @rxin @cloud-fan |
| /** | ||
| * This class of Statistics is used in [[CatalogTable]] to interact with metastore. | ||
| */ | ||
| case class CatalogStatistics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we define this class in the same file of CatalogTable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
I think it's pretty safe to use table stats as the stats of the leaf node(table relation), including column stats. The actual dangerous one is when we going to estimate something, e.g. in So logically we should read the conf in A possible approach is, we can change |
|
Change LogicalPlan.statistics to def statistics(conf: CatalystConf) could have two problems:
|
|
Test build #70328 has finished for PR 16323 at commit
|
|
retest this please |
I think we can do the cache manually:
Do we have a problem here? I think all of the places needing to call |
|
Test build #70330 has finished for PR 16323 at commit
|
|
Ok, I think it's doable. But since it's not a small change, let's wait @rxin for his comment. |
|
Test build #70339 has started for PR 16323 at commit |
|
retest this please |
|
Test build #70347 has finished for PR 16323 at commit
|
| locationUri, inputFormat, outputFormat, serde, compressed, properties)) | ||
| } | ||
|
|
||
| def withStats(cboStatsEnabled: Boolean): CatalogTable = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to get rid of the CatalogStatistics if the config is off? Actually I think you can decide whether to use it or not when doing estimation later, depending on this config. It seems no harm to always attach this info to CatalogTable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I also think that's better, but as @cloud-fan said, we can't get the config in def statistics, we have to modify many places to support this. I'm about to do such modifications, do you have any advices to minimize the changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can think of two approaches:
We can keep the current naive version of statistics and add new statistics function which takes conf.
A default implementation of the new statistics function simply returns the naive version of statistics.
In Join or Aggregate, we can include more complex logic in the new statistics to return naive calculation or something estimation.
The caller always calls new statistics function and passes in current conf.
Add new statisticsCBO which doesn't take conf because it is called only cbo is enabled. So the caller decides to call non-cbo version statistics or cbo version statisticsCBO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think the first one is better, the second one will lead to many if-else on caller sides.
|
Since adding a switch for cbo is not trivial, I want to do it in a separate pr, and let this one only deal with decoupling Statistics from CatalogTable. Do you agree? @cloud-fan |
|
SGTM |
|
Updated. Please review @cloud-fan |
|
Test build #70505 has finished for PR 16323 at commit
|
| } | ||
|
|
||
| /** Readable string representation for the CatalogStatistics. */ | ||
| def simpleString: String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you define a simpleString instead of override toString?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't print column stats in it, it's not a "complete" string representation. Column stats can be too much and make CatalogTable unreadable.
| def simpleString: String = { | ||
| Seq(s"sizeInBytes=$sizeInBytes", | ||
| if (rowCount.isDefined) s"rowCount=${rowCount.get}" else "" | ||
| ).filter(_.nonEmpty).mkString(", ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val rowCountString = if (rowCount.isDefined) s", ${rowCount.get} rows" else ""
s"$sizeInBytes bytes$rowCountString"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| if (colStats.contains(attr.name)) { | ||
| matched.put(attr, colStats(attr.name)) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attributes.flatMap(a => colStats.get(a.name).map(a -> _)).toMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| * Convert [[CatalogStatistics]] to [[Statistics]], and match column stats to attributes based | ||
| * on column names. | ||
| */ | ||
| def convert(attributes: Seq[Attribute]): Statistics = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bad name, it doesn't tell anything, without looking at the doc.
How about def toPlanStats(planOuput: ...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot better, thanks!
|
|
||
|
|
||
| /** | ||
| * This class of statistics is used in [[CatalogTable]] to interact with metastore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add few words explaining why don't use Statistics for CatalogTable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| sizeInBytes: BigInt, | ||
| rowCount: Option[BigInt] = None, | ||
| colStats: Map[String, ColumnStat] = Map.empty, | ||
| attributeStats: AttributeMap[ColumnStat] = AttributeMap(Nil), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we estimate statistics for all attributes in logical plan?
I meant if an attribute is not coming from a leaf node but from a later plan like Join, do we still have ColumnStat for it?
If not, I think we don't need to call this parameter as attributeStats, instead of original colStats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will estimate attributes in logical plan from the bottom up.
|
Test build #70547 has finished for PR 16323 at commit
|
|
|
||
| private def checkStatsConversion(tableName: String, isDatasourceTable: Boolean): Unit = { | ||
| // Create an empty table and run analyze command on it. | ||
| val col = "c1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: c1 is so simple that we can write in directly instead of using a variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| // Create an empty table and run analyze command on it. | ||
| val col = "c1" | ||
| val createTableSql = if (isDatasourceTable) { | ||
| s"CREATE TABLE $tableName ($col INT) USING PARQUET" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's create a table with 2 columns, and only analyze one column, to see if the attributeStats only contains one entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fixed
|
LGTM, pending jenkins |
|
Test build #70561 has finished for PR 16323 at commit
|
|
retest this please |
|
Test build #70564 has finished for PR 16323 at commit
|
|
thanks, merging to master! |
…e and convert it to Statistics in relations ## What changes were proposed in this pull request? Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing. We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used. ## How was this patch tested? add test cases Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes apache#16323 from wzhfy/nameToAttr.
…e and convert it to Statistics in relations ## What changes were proposed in this pull request? Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing. We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used. ## How was this patch tested? add test cases Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes apache#16323 from wzhfy/nameToAttr.
What changes were proposed in this pull request?
Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing.
We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used.
How was this patch tested?
add test cases