[SPARK-19678][SQL] remove MetastoreRelation#17015
[SPARK-19678][SQL] remove MetastoreRelation#17015cloud-fan wants to merge 3 commits intoapache:masterfrom
Conversation
|
Test build #73212 has finished for PR 17015 at commit
|
There was a problem hiding this comment.
do we still need this? data source tables always do metastore partition pruning.
There was a problem hiding this comment.
Checked the history. It sounds like @liancheng can answer whether this is still needed or not. : )
There was a problem hiding this comment.
Conceptually, table and relation are the same. How about keeping the original name CatalogRelation?
483fcee to
1c8e0c4
Compare
|
Test build #73233 has finished for PR 17015 at commit
|
c7a22e6 to
a383a13
Compare
|
Test build #73234 has finished for PR 17015 at commit
|
|
Test build #73235 has finished for PR 17015 at commit
|
|
|
There was a problem hiding this comment.
dataCols and partitionCols are needed only because CatalogRelation extends MultiInstanceRelation?
There was a problem hiding this comment.
yes, and at least we need a Seq[Attribute] as output.
247d3df to
b61910e
Compare
|
I don't think |
|
Test build #73241 has finished for PR 17015 at commit
|
|
Test build #73246 has finished for PR 17015 at commit
|
b61910e to
d9c172b
Compare
There was a problem hiding this comment.
Add two more asserts?
assert(tableMeta.partitionSchema == StructType.fromAttributes(partitionCols))
assert(tableMeta.dataSchema.asNullable == StructType.fromAttributes(dataCols))There was a problem hiding this comment.
Do we need to call asNullable for partitionSchema? In the original SimpleCatalogRelation, we did it for output
|
Test build #73291 has finished for PR 17015 at commit
|
There was a problem hiding this comment.
How about overriding cleanArgs?
override lazy val cleanArgs: Seq[Any] = Seq(tableMeta)We did the same thing in the LogicalRelation. Then, we do not need to implement sameResult for SubqueryAlias. The super function sameResult always remove SubqueryAlias in sameResult
There was a problem hiding this comment.
We might need a comment to explain it. (We don't support hive bucketed tables. This function getCached is only used for converting Hive tables to data source tables)
There was a problem hiding this comment.
I found, normally, we use fs for the file system. How about changing it to fsRelation?
There was a problem hiding this comment.
Like what we did above, adding the same comment?
// We don't support hive bucketed tables, only ones we write out.
There was a problem hiding this comment.
This is just being used in only one place. We can get rid of this?
There was a problem hiding this comment.
Like the comment in the original @param , how about replacing CatalogRelation by CatalogTable?
d9c172b to
d10bfbc
Compare
|
Test build #73536 has started for PR 17015 at commit |
|
retest this please |
| // For data source tables, we will create a `LogicalRelation` and won't call this method, for | ||
| // hive serde tables, we will always generate a statistics. | ||
| // TODO: unify the table stats generation. | ||
| tableMeta.stats.map(_.toPlanStats(output)).get |
There was a problem hiding this comment.
Yeah, the value should be always filled by DetermineTableStats, but maybe we still can issue an exception when it is None?
| // (see StatsSetupConst in Hive) that we can look at in the future. | ||
| // When table is external,`totalSize` is always zero, which will influence join strategy | ||
| // so when `totalSize` is zero, use `rawDataSize` instead | ||
| // when `rawDataSize` is also zero, use `HiveExternalCatalog.STATISTICS_TOTAL_SIZE`, |
There was a problem hiding this comment.
This is out of dated, I think
| @@ -90,10 +74,10 @@ object AnalyzeColumnCommand extends Logging { | |||
| */ | |||
| def computeColumnStats( | |||
There was a problem hiding this comment.
Now, this is not being used for testing. We can mark it as private.
| // Compute stats for each column | ||
| val (rowCount, newColStats) = | ||
| AnalyzeColumnCommand.computeColumnStats(sparkSession, tableIdent.table, relation, columnNames) | ||
| AnalyzeColumnCommand.computeColumnStats(sparkSession, tableIdentWithDB, columnNames) |
There was a problem hiding this comment.
object AnalyzeColumnCommand is not needed. We can move computeColumnStats into the case class AnalyzeColumnCommand
|
Test build #73566 has finished for PR 17015 at commit
|
| /** An attribute map for determining the ordinal for non-partition columns. */ | ||
| val columnOrdinals = AttributeMap(dataColKeys.zipWithIndex) | ||
|
|
||
| override def inputFiles: Array[String] = { |
There was a problem hiding this comment.
We also need to add this back.
There was a problem hiding this comment.
hive table may not be a file relation(storage handler), so we should not define the inputFiles. BTW this only output the directory names, not leaf files, which is inconsistent with data source table.
|
LGTM except a few comments. |
|
Test build #73584 has finished for PR 17015 at commit
|
|
LGTM |
|
This PR is pretty big and could cause many conflicts. Let me first merge this. We can resolve the comments if anybody has later. |
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
MetastoreRelationis used to represent table relation for hive tables, and provides some hive related information. We will resolveSimpleCatalogRelationtoMetastoreRelationfor hive tables, which is unnecessary as these 2 are the same essentially. This PR mergesSimpleCatalogRelationandMetastoreRelationHow was this patch tested?
existing tests