Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Aug 8, 2016

What changes were proposed in this pull request?

The existing CREATE TABLE LIKE command has multiple issues:

  • The generated table is non-empty when the source table is a data source table. The major reason is the data source table is using the table property path to store the location of table contents. Currently, we keep it unchanged. Thus, we still create the same table with the same location.
  • The table type of the generated table is EXTERNAL when the source table is an external Hive Serde table. Currently, we explicitly set it to MANAGED, but Hive is checking the table property EXTERNAL to decide whether the table is EXTERNAL or not. (See https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408) Thus, the created table is still EXTERNAL.
  • When the source table is a VIEW, the metadata of the generated table contains the original view text and view original text. So far, this does not break anything, but it could cause something wrong in Hive. (For example, https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)
  • The issue regarding the table comment. To follow what Hive does, the table comment should be cleaned, but the column comments should be still kept.
  • The INDEX table is not supported. Thus, we should throw an exception in this case.
  • owner should not be retained. ToHiveTable set it here no matter which value we set in CatalogTable. We set it to an empty string for avoiding the confusing output in Explain.
  • Add a support for temp tables
  • Like Hive, we should not copy the table properties from the source table to the created table, especially for the statistics-related properties, which could be wrong in the created table.
  • unsupportedFeatures should not be copied from the source table. The created table does not have these unsupported features.
  • When the type of source table is a view, the target table is using the default format of data source tables: spark.sql.sources.default.

This PR is to fix the above issues.

How was this patch tested?

Improve the test coverage by adding more test cases

@gatorsmile gatorsmile changed the title [SPARK-16943] [SPAR] [SQL] Fix multiple bugs in CREATE TABLE LIKE command [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in CREATE TABLE LIKE command Aug 8, 2016
@gatorsmile gatorsmile changed the title [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in CREATE TABLE LIKE command [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in CREATE TABLE LIKE command [WIP] Aug 8, 2016
@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63339 has finished for PR 14531 at commit 1eb40e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

cc @yhuai @cloud-fan @rxin @liancheng @andrewor14 @hvanhovell

Question: Should we support CREATE TABLE LIKE on INDEX TABLE?

Hive does support it. However, it sounds like this is not considered in all the DDL statements, e.g., SHOW CREATE TABLE. If we want to support it, we need test cases for verifying what we did is right. Please let me know whether we should do it or not.

Thanks!

@yhuai
Copy link
Contributor

yhuai commented Aug 8, 2016

We do not support index tables at all (you can not create such a table). Let's not add the support right now.

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63372 has finished for PR 14531 at commit d0e9217.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63385 has finished for PR 14531 at commit 29e17a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in CREATE TABLE LIKE command [WIP] [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in CREATE TABLE LIKE command Aug 9, 2016
// (metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1095-L1105)
// Table comment is stored as a table property. To clean it, we also should remove them.
val newTableProp =
sourceTableDesc.properties.filterKeys(key => key != "EXTERNAL" && key != "comment")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive's behaviour is weird, does it prefer the EXTERNAL table property rather than the table type field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds true! See another fix by @yhuai in toHiveTable:

// For EXTERNAL_TABLE, we also need to set EXTERNAL field in the table properties.
// Otherwise, Hive metastore will change the table to a MANAGED_TABLE.
// (metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1095-L1105)
hiveTable.setTableType(table.tableType match {
case CatalogTableType.EXTERNAL =>
hiveTable.setProperty("EXTERNAL", "TRUE")
HiveTableType.EXTERNAL_TABLE

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 9, 2016

Can we improve the class doc for CreateTableLikeCommand to explain the expected behaviours? thanks!

@gatorsmile
Copy link
Member Author

Sure, will do it. Thanks!

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63393 has finished for PR 14531 at commit 8434777.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sourceTableDesc.storage.properties
}
val newTableDesc =
sourceTableDesc.copy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easier to create a new table? How many fields we need to retain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost the same. The following attributes need to be kept: storage, schema, provider, partitionColumnNames, bucketSpec, properties, and unsupportedFeatures.

Found another bug... owner should be replaced...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which fields in storage we should retain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more clear to list all the fields that need to retain in class doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to create a new CatalogTable here and copy the fields explicitly, to match the class doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do it.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63413 has finished for PR 14531 at commit b820be8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63454 has finished for PR 14531 at commit 45b51d1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63457 has finished for PR 14531 at commit 6180e80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* the table comment is always empty but the column comments are identical to the ones defined
* in the source table.
*
* The CatalogTable attributes copied from the source table include storage(inputFormat,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are all of them, we can use are instead of include

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@rxin
Copy link
Contributor

rxin commented Aug 10, 2016

Why can't the source table be a temp table? Also why not copy the table comment? Is it the same behavior in Hive / Postgres?

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64725 has finished for PR 14531 at commit ba1b69d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


// Storage format
val newStorage =
if (sourceTableType == CatalogTableType.VIEW) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to create data source table if the source table is view.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. No need to strictly follow Hive.

FYI, we are having two different default formats: #14430.

@gatorsmile gatorsmile changed the title [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in CREATE TABLE LIKE command [SPARK-17353] [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in CREATE TABLE LIKE command Sep 1, 2016
"the view text and original text in the created table must be empty")
// The location of created table should not be empty. Although Spark SQL does not set it,
// when creating it, Hive populates it.
assert(targetTable.storage.locationUri.nonEmpty,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we checking here? a specific behaviour of hive metastore? I mean, other ExternalCatalog may not need to populate it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can remove these Hive-specific checking.

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64754 has finished for PR 14531 at commit 4ce96e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64763 has finished for PR 14531 at commit 4bcb306.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 1f06a5b Sep 1, 2016
@cloud-fan
Copy link
Contributor

merging to master! @gatorsmile can you send a new PR to backport it to 2.0? thanks!

@gatorsmile
Copy link
Member Author

Sure, will do it. Thanks!

asfgit pushed a commit that referenced this pull request Sep 6, 2016
…le bugs in CREATE TABLE LIKE command

### What changes were proposed in this pull request?
This PR is to backport #14531.

The existing `CREATE TABLE LIKE` command has multiple issues:

- The generated table is non-empty when the source table is a data source table. The major reason is the data source table is using the table property `path` to store the location of table contents. Currently, we keep it unchanged. Thus, we still create the same table with the same location.

- The table type of the generated table is `EXTERNAL` when the source table is an external Hive Serde table. Currently, we explicitly set it to `MANAGED`, but Hive is checking the table property `EXTERNAL` to decide whether the table is `EXTERNAL` or not. (See https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408) Thus, the created table is still `EXTERNAL`.

- When the source table is a `VIEW`, the metadata of the generated table contains the original view text and view original text. So far, this does not break anything, but it could cause something wrong in Hive. (For example, https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)

- The issue regarding the table `comment`. To follow what Hive does, the table comment should be cleaned, but the column comments should be still kept.

- The `INDEX` table is not supported. Thus, we should throw an exception in this case.

- `owner` should not be retained. `ToHiveTable` set it [here](https://github.com/apache/spark/blob/e679bc3c1cd418ef0025d2ecbc547c9660cac433/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L793) no matter which value we set in `CatalogTable`. We set it to an empty string for avoiding the confusing output in Explain.

- Add a support for temp tables

- Like Hive, we should not copy the table properties from the source table to the created table, especially for the statistics-related properties, which could be wrong in the created table.

- `unsupportedFeatures` should not be copied from the source table. The created table does not have these unsupported features.

- When the type of source table is a view, the target table is using the default format of data source tables: `spark.sql.sources.default`.

This PR is to fix the above issues.

### How was this patch tested?
Improve the test coverage by adding more test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14946 from gatorsmile/createTableLike20.
@sitalkedia
Copy link

@rxin, @gatorsmile - This PR breaks our jobs with CREATE TABLE LIKE command since the table properties from the source tables are not propagated to the new table anymore. We have table properties like table-retention which is expected to be copied over when the new table is created. How do you think we can fix this issue?

@gatorsmile
Copy link
Member Author

@sitalkedia So far, you can set table properties to the new table by using the DDL command.

@rxin @cloud-fan @yhuai Let me know if you need me to submit a PR to make such a change. I think what @sitalkedia said is valid.

@cloud-fan
Copy link
Contributor

IIUC, before 2.0, we use hive to run CREATE TABLE LIKE, and hive doesn't include the table properties. So this PR actually fixed a regression in 2.0, I think we should keep this behaviour.

cc @rxin @yhuai

@rxin
Copy link
Contributor

rxin commented Oct 5, 2016

@cloud-fan are you sure Hive doesn't copy the table properties? How would @sitalkedia's case work if it does not copy?

@cloud-fan
Copy link
Contributor

I suspect @sitalkedia built his application based on 2.0 and got broken in 2.0.1, @sitalkedia is that true?

@gatorsmile can you double check that Hive doesn't copy the table properties?

@gatorsmile
Copy link
Member Author

@cloud-fan Hive does not copy the table properties in CREATE TABLE LIKE

@sitalkedia
Copy link

@gatorsmile - I looked into Hive code base a bit and it looks like Hive provides a way to specify a whitelisted set of properties which are copied for the newly created table. Spark should also provide similar flexibility to mimic hive behavior - https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L4140

@gatorsmile
Copy link
Member Author

@sitalkedia Yeah, I saw it. Thank you for investigation. Normally, we do not want to add many configuration flags. It hurts the usability. Let @rxin make a decision whether we should add another flag or not.

@rxin
Copy link
Contributor

rxin commented Oct 16, 2016

@gatorsmile / @sitalkedia that idea sounds good (similar to Hive's)

@gatorsmile
Copy link
Member Author

Sure, will submit a PR for it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants