Skip to content

[SPARK-19678][SQL] remove MetastoreRelation#17015

Closed
cloud-fan wants to merge 3 commits intoapache:masterfrom
cloud-fan:table-relation
Closed

[SPARK-19678][SQL] remove MetastoreRelation#17015
cloud-fan wants to merge 3 commits intoapache:masterfrom
cloud-fan:table-relation

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

MetastoreRelation is used to represent table relation for hive tables, and provides some hive related information. We will resolve SimpleCatalogRelation to MetastoreRelation for hive tables, which is unnecessary as these 2 are the same essentially. This PR merges SimpleCatalogRelation and MetastoreRelation

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile @sameeragarwal

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73212 has finished for PR 17015 at commit 483fcee.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this? data source tables always do metastore partition pruning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the history. It sounds like @liancheng can answer whether this is still needed or not. : )

#7421 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, table and relation are the same. How about keeping the original name CatalogRelation?

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73233 has finished for PR 17015 at commit 1c8e0c4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan force-pushed the table-relation branch 2 times, most recently from c7a22e6 to a383a13 Compare February 21, 2017 21:04
@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73234 has finished for PR 17015 at commit c7a22e6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73235 has finished for PR 17015 at commit a383a13.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

CatalogRelation is always unresolved for data source tables, but already resolved for hive serde tables. Do you think we can have an unresolved UnresolvedCatalogRelation for both data source tables and hive serde tables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataCols and partitionCols are needed only because CatalogRelation extends MultiInstanceRelation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and at least we need a Seq[Attribute] as output.

@cloud-fan cloud-fan force-pushed the table-relation branch 2 times, most recently from 247d3df to b61910e Compare February 22, 2017 00:41
@cloud-fan
Copy link
Contributor Author

cloud-fan commented Feb 22, 2017

I don't think UnresolvedCatalogRelation is necessary. Data source table can treat CatalogRelation as unresolved but it's its own business. When we replace CatalogRelation with LogicalRelation, it can happen even after the parent nodes are resolved.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73241 has finished for PR 17015 at commit 247d3df.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73246 has finished for PR 17015 at commit b61910e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add two more asserts?

  assert(tableMeta.partitionSchema == StructType.fromAttributes(partitionCols))
  assert(tableMeta.dataSchema.asNullable == StructType.fromAttributes(dataCols))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call asNullable for partitionSchema? In the original SimpleCatalogRelation, we did it for output

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73291 has finished for PR 17015 at commit d9c172b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile Feb 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about overriding cleanArgs?

override lazy val cleanArgs: Seq[Any] = Seq(tableMeta)

We did the same thing in the LogicalRelation. Then, we do not need to implement sameResult for SubqueryAlias. The super function sameResult always remove SubqueryAlias in sameResult

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need a comment to explain it. (We don't support hive bucketed tables. This function getCached is only used for converting Hive tables to data source tables)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found, normally, we use fs for the file system. How about changing it to fsRelation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like what we did above, adding the same comment?

// We don't support hive bucketed tables, only ones we write out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just being used in only one place. We can get rid of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep it the same?

Copy link
Member

@gatorsmile gatorsmile Feb 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the comment in the original @param , how about replacing CatalogRelation by CatalogTable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this test case?

@SparkQA
Copy link

SparkQA commented Feb 27, 2017

Test build #73536 has started for PR 17015 at commit d10bfbc.

@gatorsmile
Copy link
Member

retest this please

// For data source tables, we will create a `LogicalRelation` and won't call this method, for
// hive serde tables, we will always generate a statistics.
// TODO: unify the table stats generation.
tableMeta.stats.map(_.toPlanStats(output)).get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the value should be always filled by DetermineTableStats, but maybe we still can issue an exception when it is None?

// (see StatsSetupConst in Hive) that we can look at in the future.
// When table is external,`totalSize` is always zero, which will influence join strategy
// so when `totalSize` is zero, use `rawDataSize` instead
// when `rawDataSize` is also zero, use `HiveExternalCatalog.STATISTICS_TOTAL_SIZE`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of dated, I think

@@ -90,10 +74,10 @@ object AnalyzeColumnCommand extends Logging {
*/
def computeColumnStats(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, this is not being used for testing. We can mark it as private.

// Compute stats for each column
val (rowCount, newColStats) =
AnalyzeColumnCommand.computeColumnStats(sparkSession, tableIdent.table, relation, columnNames)
AnalyzeColumnCommand.computeColumnStats(sparkSession, tableIdentWithDB, columnNames)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object AnalyzeColumnCommand is not needed. We can move computeColumnStats into the case class AnalyzeColumnCommand

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73566 has finished for PR 17015 at commit d10bfbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/** An attribute map for determining the ordinal for non-partition columns. */
val columnOrdinals = AttributeMap(dataColKeys.zipWithIndex)

override def inputFiles: Array[String] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add this back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive table may not be a file relation(storage handler), so we should not define the inputFiles. BTW this only output the directory names, not leaf files, which is inconsistent with data source table.

@gatorsmile
Copy link
Member

LGTM except a few comments.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73584 has finished for PR 17015 at commit 781f910.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

This PR is pretty big and could cause many conflicts. Let me first merge this. We can resolve the comments if anybody has later.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants