-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17353] [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in CREATE TABLE LIKE command #14531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #63339 has finished for PR 14531 at commit
|
|
cc @yhuai @cloud-fan @rxin @liancheng @andrewor14 @hvanhovell Question: Should we support Hive does support it. However, it sounds like this is not considered in all the DDL statements, e.g., Thanks! |
|
We do not support index tables at all (you can not create such a table). Let's not add the support right now. |
|
Test build #63372 has finished for PR 14531 at commit
|
|
Test build #63385 has finished for PR 14531 at commit
|
| // (metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1095-L1105) | ||
| // Table comment is stored as a table property. To clean it, we also should remove them. | ||
| val newTableProp = | ||
| sourceTableDesc.properties.filterKeys(key => key != "EXTERNAL" && key != "comment") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive's behaviour is weird, does it prefer the EXTERNAL table property rather than the table type field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds true! See another fix by @yhuai in toHiveTable:
spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
Lines 764 to 770 in e679bc3
| // For EXTERNAL_TABLE, we also need to set EXTERNAL field in the table properties. | |
| // Otherwise, Hive metastore will change the table to a MANAGED_TABLE. | |
| // (metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1095-L1105) | |
| hiveTable.setTableType(table.tableType match { | |
| case CatalogTableType.EXTERNAL => | |
| hiveTable.setProperty("EXTERNAL", "TRUE") | |
| HiveTableType.EXTERNAL_TABLE |
|
Can we improve the class doc for |
|
Sure, will do it. Thanks! |
|
Test build #63393 has finished for PR 14531 at commit
|
| sourceTableDesc.storage.properties | ||
| } | ||
| val newTableDesc = | ||
| sourceTableDesc.copy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easier to create a new table? How many fields we need to retain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost the same. The following attributes need to be kept: storage, schema, provider, partitionColumnNames, bucketSpec, properties, and unsupportedFeatures.
Found another bug... owner should be replaced...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which fields in storage we should retain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more clear to list all the fields that need to retain in class doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to create a new CatalogTable here and copy the fields explicitly, to match the class doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do it.
|
Test build #63413 has finished for PR 14531 at commit
|
|
Test build #63454 has finished for PR 14531 at commit
|
|
Test build #63457 has finished for PR 14531 at commit
|
| * the table comment is always empty but the column comments are identical to the ones defined | ||
| * in the source table. | ||
| * | ||
| * The CatalogTable attributes copied from the source table include storage(inputFormat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are all of them, we can use are instead of include
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
|
Why can't the source table be a temp table? Also why not copy the table comment? Is it the same behavior in Hive / Postgres? |
|
Test build #64725 has finished for PR 14531 at commit
|
|
|
||
| // Storage format | ||
| val newStorage = | ||
| if (sourceTableType == CatalogTableType.VIEW) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to create data source table if the source table is view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. No need to strictly follow Hive.
FYI, we are having two different default formats: #14430.
| "the view text and original text in the created table must be empty") | ||
| // The location of created table should not be empty. Although Spark SQL does not set it, | ||
| // when creating it, Hive populates it. | ||
| assert(targetTable.storage.locationUri.nonEmpty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we checking here? a specific behaviour of hive metastore? I mean, other ExternalCatalog may not need to populate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can remove these Hive-specific checking.
|
Test build #64754 has finished for PR 14531 at commit
|
|
LGTM, pending jenkins |
|
Test build #64763 has finished for PR 14531 at commit
|
|
merging to master! @gatorsmile can you send a new PR to backport it to 2.0? thanks! |
|
Sure, will do it. Thanks! |
…le bugs in CREATE TABLE LIKE command ### What changes were proposed in this pull request? This PR is to backport #14531. The existing `CREATE TABLE LIKE` command has multiple issues: - The generated table is non-empty when the source table is a data source table. The major reason is the data source table is using the table property `path` to store the location of table contents. Currently, we keep it unchanged. Thus, we still create the same table with the same location. - The table type of the generated table is `EXTERNAL` when the source table is an external Hive Serde table. Currently, we explicitly set it to `MANAGED`, but Hive is checking the table property `EXTERNAL` to decide whether the table is `EXTERNAL` or not. (See https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408) Thus, the created table is still `EXTERNAL`. - When the source table is a `VIEW`, the metadata of the generated table contains the original view text and view original text. So far, this does not break anything, but it could cause something wrong in Hive. (For example, https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406) - The issue regarding the table `comment`. To follow what Hive does, the table comment should be cleaned, but the column comments should be still kept. - The `INDEX` table is not supported. Thus, we should throw an exception in this case. - `owner` should not be retained. `ToHiveTable` set it [here](https://github.com/apache/spark/blob/e679bc3c1cd418ef0025d2ecbc547c9660cac433/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L793) no matter which value we set in `CatalogTable`. We set it to an empty string for avoiding the confusing output in Explain. - Add a support for temp tables - Like Hive, we should not copy the table properties from the source table to the created table, especially for the statistics-related properties, which could be wrong in the created table. - `unsupportedFeatures` should not be copied from the source table. The created table does not have these unsupported features. - When the type of source table is a view, the target table is using the default format of data source tables: `spark.sql.sources.default`. This PR is to fix the above issues. ### How was this patch tested? Improve the test coverage by adding more test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14946 from gatorsmile/createTableLike20.
|
@rxin, @gatorsmile - This PR breaks our jobs with CREATE TABLE LIKE command since the table properties from the source tables are not propagated to the new table anymore. We have table properties like table-retention which is expected to be copied over when the new table is created. How do you think we can fix this issue? |
|
@sitalkedia So far, you can set table properties to the new table by using the DDL command. @rxin @cloud-fan @yhuai Let me know if you need me to submit a PR to make such a change. I think what @sitalkedia said is valid. |
|
@cloud-fan are you sure Hive doesn't copy the table properties? How would @sitalkedia's case work if it does not copy? |
|
I suspect @sitalkedia built his application based on 2.0 and got broken in 2.0.1, @sitalkedia is that true? @gatorsmile can you double check that Hive doesn't copy the table properties? |
|
@cloud-fan Hive does not copy the table properties in CREATE TABLE LIKE |
|
@gatorsmile - I looked into Hive code base a bit and it looks like Hive provides a way to specify a whitelisted set of properties which are copied for the newly created table. Spark should also provide similar flexibility to mimic hive behavior - https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L4140 |
|
@sitalkedia Yeah, I saw it. Thank you for investigation. Normally, we do not want to add many configuration flags. It hurts the usability. Let @rxin make a decision whether we should add another flag or not. |
|
@gatorsmile / @sitalkedia that idea sounds good (similar to Hive's) |
|
Sure, will submit a PR for it. Thanks! |
What changes were proposed in this pull request?
The existing
CREATE TABLE LIKEcommand has multiple issues:pathto store the location of table contents. Currently, we keep it unchanged. Thus, we still create the same table with the same location.EXTERNALwhen the source table is an external Hive Serde table. Currently, we explicitly set it toMANAGED, but Hive is checking the table propertyEXTERNALto decide whether the table isEXTERNALor not. (See https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408) Thus, the created table is stillEXTERNAL.VIEW, the metadata of the generated table contains the original view text and view original text. So far, this does not break anything, but it could cause something wrong in Hive. (For example, https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)comment. To follow what Hive does, the table comment should be cleaned, but the column comments should be still kept.INDEXtable is not supported. Thus, we should throw an exception in this case.ownershould not be retained.ToHiveTableset it here no matter which value we set inCatalogTable. We set it to an empty string for avoiding the confusing output in Explain.unsupportedFeaturesshould not be copied from the source table. The created table does not have these unsupported features.spark.sql.sources.default.This PR is to fix the above issues.
How was this patch tested?
Improve the test coverage by adding more test cases