[SPARK-19678][SQL] remove MetastoreRelation by cloud-fan · Pull Request #17015 · apache/spark

cloud-fan · 2017-02-21T10:36:01Z

What changes were proposed in this pull request?

MetastoreRelation is used to represent table relation for hive tables, and provides some hive related information. We will resolve SimpleCatalogRelation to MetastoreRelation for hive tables, which is unnecessary as these 2 are the same essentially. This PR merges SimpleCatalogRelation and MetastoreRelation

How was this patch tested?

existing tests

cloud-fan · 2017-02-21T10:36:13Z

cc @gatorsmile @sameeragarwal

SparkQA · 2017-02-21T11:41:05Z

Test build #73212 has finished for PR 17015 at commit 483fcee.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-21T19:57:32Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala

do we still need this? data source tables always do metastore partition pruning.

Checked the history. It sounds like @liancheng can answer whether this is still needed or not. : )

#7421 (comment)

gatorsmile · 2017-02-21T20:02:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

Conceptually, table and relation are the same. How about keeping the original name CatalogRelation?

SparkQA · 2017-02-21T20:59:18Z

Test build #73233 has finished for PR 17015 at commit 1c8e0c4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T21:04:16Z

Test build #73234 has finished for PR 17015 at commit c7a22e6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T22:16:53Z

Test build #73235 has finished for PR 17015 at commit a383a13.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-21T22:41:37Z

CatalogRelation is always unresolved for data source tables, but already resolved for hive serde tables. Do you think we can have an unresolved UnresolvedCatalogRelation for both data source tables and hive serde tables?

gatorsmile · 2017-02-21T22:44:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

dataCols and partitionCols are needed only because CatalogRelation extends MultiInstanceRelation?

yes, and at least we need a Seq[Attribute] as output.

cloud-fan · 2017-02-22T01:04:17Z

I don't think UnresolvedCatalogRelation is necessary. Data source table can treat CatalogRelation as unresolved but it's its own business. When we replace CatalogRelation with LogicalRelation, it can happen even after the parent nodes are resolved.

SparkQA · 2017-02-22T01:16:52Z

Test build #73241 has finished for PR 17015 at commit 247d3df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T02:13:25Z

Test build #73246 has finished for PR 17015 at commit b61910e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-22T19:07:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

Add two more asserts?

assert(tableMeta.partitionSchema == StructType.fromAttributes(partitionCols)) assert(tableMeta.dataSchema.asNullable == StructType.fromAttributes(dataCols))

gatorsmile · 2017-02-22T19:28:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

Do we need to call asNullable for partitionSchema? In the original SimpleCatalogRelation, we did it for output

SparkQA · 2017-02-22T20:39:36Z

Test build #73291 has finished for PR 17015 at commit d9c172b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-23T00:30:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

How about overriding cleanArgs?

override lazy val cleanArgs: Seq[Any] = Seq(tableMeta)

We did the same thing in the LogicalRelation. Then, we do not need to implement sameResult for SubqueryAlias. The super function sameResult always remove SubqueryAlias in sameResult

gatorsmile · 2017-02-23T19:08:05Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

gatorsmile · 2017-02-23T20:58:57Z