Skip to content

Commit f92d874

Browse files
gatorsmilecloud-fan
authored andcommitted
[SPARK-17353][SPARK-16943][SPARK-16942][BACKPORT-2.0][SQL] Fix multiple bugs in CREATE TABLE LIKE command
### What changes were proposed in this pull request? This PR is to backport #14531. The existing `CREATE TABLE LIKE` command has multiple issues: - The generated table is non-empty when the source table is a data source table. The major reason is the data source table is using the table property `path` to store the location of table contents. Currently, we keep it unchanged. Thus, we still create the same table with the same location. - The table type of the generated table is `EXTERNAL` when the source table is an external Hive Serde table. Currently, we explicitly set it to `MANAGED`, but Hive is checking the table property `EXTERNAL` to decide whether the table is `EXTERNAL` or not. (See https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408) Thus, the created table is still `EXTERNAL`. - When the source table is a `VIEW`, the metadata of the generated table contains the original view text and view original text. So far, this does not break anything, but it could cause something wrong in Hive. (For example, https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406) - The issue regarding the table `comment`. To follow what Hive does, the table comment should be cleaned, but the column comments should be still kept. - The `INDEX` table is not supported. Thus, we should throw an exception in this case. - `owner` should not be retained. `ToHiveTable` set it [here](https://github.com/apache/spark/blob/e679bc3c1cd418ef0025d2ecbc547c9660cac433/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L793) no matter which value we set in `CatalogTable`. We set it to an empty string for avoiding the confusing output in Explain. - Add a support for temp tables - Like Hive, we should not copy the table properties from the source table to the created table, especially for the statistics-related properties, which could be wrong in the created table. - `unsupportedFeatures` should not be copied from the source table. The created table does not have these unsupported features. - When the type of source table is a view, the target table is using the default format of data source tables: `spark.sql.sources.default`. This PR is to fix the above issues. ### How was this patch tested? Improve the test coverage by adding more test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14946 from gatorsmile/createTableLike20.
1 parent e387c8b commit f92d874

File tree

4 files changed

+289
-15
lines changed

4 files changed

+289
-15
lines changed

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -263,8 +263,7 @@ class SessionCatalog(
263263
CatalogColumn(
264264
name = c.name,
265265
dataType = c.dataType.catalogString,
266-
nullable = c.nullable,
267-
comment = Option(c.name)
266+
nullable = c.nullable
268267
)
269268
},
270269
properties = Map(),

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

Lines changed: 54 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,10 @@ import org.apache.spark.sql.catalyst.catalog.{CatalogColumn, CatalogTable, Catal
3333
import org.apache.spark.sql.catalyst.catalog.CatalogTableType._
3434
import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
3535
import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference}
36+
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
3637
import org.apache.spark.sql.catalyst.plans.logical.{Command, LogicalPlan, UnaryNode}
3738
import org.apache.spark.sql.catalyst.util.quoteIdentifier
39+
import org.apache.spark.sql.execution.command.CreateDataSourceTableUtils._
3840
import org.apache.spark.sql.execution.datasources.PartitioningUtils
3941
import org.apache.spark.sql.types._
4042
import org.apache.spark.util.Utils
@@ -56,7 +58,12 @@ case class CreateHiveTableAsSelectLogicalPlan(
5658
}
5759

5860
/**
59-
* A command to create a table with the same definition of the given existing table.
61+
* A command to create a MANAGED table with the same definition of the given existing table.
62+
* In the target table definition, the table comment is always empty but the column comments
63+
* are identical to the ones defined in the source table.
64+
*
65+
* The CatalogTable attributes copied from the source table are storage(inputFormat, outputFormat,
66+
* serde, compressed, properties), schema, provider, partitionColumnNames, bucketSpec.
6067
*
6168
* The syntax of using this command in SQL is:
6269
* {{{
@@ -75,18 +82,54 @@ case class CreateTableLikeCommand(
7582
throw new AnalysisException(
7683
s"Source table in CREATE TABLE LIKE does not exist: '$sourceTable'")
7784
}
78-
if (catalog.isTemporaryTable(sourceTable)) {
79-
throw new AnalysisException(
80-
s"Source table in CREATE TABLE LIKE cannot be temporary: '$sourceTable'")
85+
val sourceTableDesc = catalog.getTableMetadata(sourceTable)
86+
87+
if (DDLUtils.isDatasourceTable(sourceTableDesc) ||
88+
sourceTableDesc.tableType == CatalogTableType.VIEW) {
89+
val outputSchema =
90+
StructType(sourceTableDesc.schema.map { c =>
91+
val builder = new MetadataBuilder
92+
c.comment.map(comment => builder.putString("comment", comment))
93+
StructField(
94+
c.name,
95+
CatalystSqlParser.parseDataType(c.dataType),
96+
c.nullable,
97+
metadata = builder.build())
98+
})
99+
val (schema, provider) = if (DDLUtils.isDatasourceTable(sourceTableDesc)) {
100+
(DDLUtils.getSchemaFromTableProperties(sourceTableDesc).getOrElse(outputSchema),
101+
sourceTableDesc.properties(CreateDataSourceTableUtils.DATASOURCE_PROVIDER))
102+
} else { // VIEW
103+
(outputSchema, sparkSession.sessionState.conf.defaultDataSourceName)
104+
}
105+
createDataSourceTable(
106+
sparkSession = sparkSession,
107+
tableIdent = targetTable,
108+
userSpecifiedSchema = Some(schema),
109+
partitionColumns = Array.empty[String],
110+
bucketSpec = None,
111+
provider = provider,
112+
options = Map("path" -> catalog.defaultTablePath(targetTable)),
113+
isExternal = false)
114+
} else {
115+
val newStorage =
116+
sourceTableDesc.storage.copy(
117+
locationUri = None,
118+
serdeProperties = sourceTableDesc.storage.serdeProperties)
119+
val newTableDesc =
120+
CatalogTable(
121+
identifier = targetTable,
122+
tableType = CatalogTableType.MANAGED,
123+
storage = newStorage,
124+
schema = sourceTableDesc.schema,
125+
partitionColumnNames = sourceTableDesc.partitionColumnNames,
126+
sortColumnNames = sourceTableDesc.sortColumnNames,
127+
bucketColumnNames = sourceTableDesc.bucketColumnNames,
128+
numBuckets = sourceTableDesc.numBuckets)
129+
130+
catalog.createTable(newTableDesc, ifNotExists)
81131
}
82132

83-
val tableToCreate = catalog.getTableMetadata(sourceTable).copy(
84-
identifier = targetTable,
85-
tableType = CatalogTableType.MANAGED,
86-
createTime = System.currentTimeMillis,
87-
lastAccessTime = -1).withNewStorage(locationUri = None)
88-
89-
catalog.createTable(tableToCreate, ifNotExists)
90133
Seq.empty[Row]
91134
}
92135
}

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -412,7 +412,9 @@ private[hive] class HiveClientImpl(
412412
serdeProperties = Option(h.getTTable.getSd.getSerdeInfo.getParameters)
413413
.map(_.asScala.toMap).orNull
414414
),
415-
properties = properties.filter(kv => kv._1 != "comment"),
415+
// For EXTERNAL_TABLE, the table properties has a particular field "EXTERNAL". This is added
416+
// in the function toHiveTable.
417+
properties = properties.filter(kv => kv._1 != "comment" && kv._1 != "EXTERNAL"),
416418
comment = properties.get("comment"),
417419
viewOriginalText = Option(h.getViewOriginalText),
418420
viewText = Option(h.getViewExpandedText),

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

Lines changed: 231 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,11 @@ import org.scalatest.BeforeAndAfterEach
2424

2525
import org.apache.spark.internal.config._
2626
import org.apache.spark.sql.{AnalysisException, QueryTest, Row, SaveMode}
27-
import org.apache.spark.sql.catalyst.catalog.{CatalogDatabase, CatalogTableType}
27+
import org.apache.spark.sql.catalyst.catalog.{CatalogDatabase, CatalogTable, CatalogTableType}
2828
import org.apache.spark.sql.catalyst.TableIdentifier
29+
import org.apache.spark.sql.execution.command.{CreateDataSourceTableUtils, DDLUtils}
30+
import org.apache.spark.sql.execution.command.CreateDataSourceTableUtils._
31+
import org.apache.spark.sql.execution.datasources.CaseInsensitiveMap
2932
import org.apache.spark.sql.hive.test.TestHiveSingleton
3033
import org.apache.spark.sql.internal.SQLConf
3134
import org.apache.spark.sql.test.SQLTestUtils
@@ -651,6 +654,233 @@ class HiveDDLSuite
651654
}
652655
}
653656

657+
658+
test("CREATE TABLE LIKE a temporary view") {
659+
val sourceViewName = "tab1"
660+
val targetTabName = "tab2"
661+
withTempView(sourceViewName) {
662+
withTable(targetTabName) {
663+
spark.range(10).select('id as 'a, 'id as 'b, 'id as 'c, 'id as 'd)
664+
.createTempView(sourceViewName)
665+
sql(s"CREATE TABLE $targetTabName LIKE $sourceViewName")
666+
667+
val sourceTable = spark.sessionState.catalog.getTableMetadata(
668+
TableIdentifier(sourceViewName, None))
669+
val targetTable = spark.sessionState.catalog.getTableMetadata(
670+
TableIdentifier(targetTabName, Some("default")))
671+
672+
checkCreateTableLike(sourceTable, targetTable)
673+
}
674+
}
675+
}
676+
677+
test("CREATE TABLE LIKE a data source table") {
678+
val sourceTabName = "tab1"
679+
val targetTabName = "tab2"
680+
withTable(sourceTabName, targetTabName) {
681+
spark.range(10).select('id as 'a, 'id as 'b, 'id as 'c, 'id as 'd)
682+
.write.format("json").saveAsTable(sourceTabName)
683+
sql(s"CREATE TABLE $targetTabName LIKE $sourceTabName")
684+
685+
val sourceTable =
686+
spark.sessionState.catalog.getTableMetadata(TableIdentifier(sourceTabName, Some("default")))
687+
val targetTable =
688+
spark.sessionState.catalog.getTableMetadata(TableIdentifier(targetTabName, Some("default")))
689+
// The table type of the source table should be a Hive-managed data source table
690+
assert(DDLUtils.isDatasourceTable(sourceTable))
691+
assert(sourceTable.tableType == CatalogTableType.MANAGED)
692+
693+
checkCreateTableLike(sourceTable, targetTable)
694+
}
695+
}
696+
697+
test("CREATE TABLE LIKE an external data source table") {
698+
val sourceTabName = "tab1"
699+
val targetTabName = "tab2"
700+
withTable(sourceTabName, targetTabName) {
701+
withTempPath { dir =>
702+
val path = dir.getCanonicalPath
703+
spark.range(10).select('id as 'a, 'id as 'b, 'id as 'c, 'id as 'd)
704+
.write.format("parquet").save(path)
705+
sql(s"CREATE TABLE $sourceTabName USING parquet OPTIONS (PATH '$path')")
706+
sql(s"CREATE TABLE $targetTabName LIKE $sourceTabName")
707+
708+
// The source table should be an external data source table
709+
val sourceTable = spark.sessionState.catalog.getTableMetadata(
710+
TableIdentifier(sourceTabName, Some("default")))
711+
val targetTable = spark.sessionState.catalog.getTableMetadata(
712+
TableIdentifier(targetTabName, Some("default")))
713+
// The table type of the source table should be an external data source table
714+
assert(DDLUtils.isDatasourceTable(sourceTable))
715+
assert(sourceTable.tableType == CatalogTableType.EXTERNAL)
716+
717+
checkCreateTableLike(sourceTable, targetTable)
718+
}
719+
}
720+
}
721+
722+
test("CREATE TABLE LIKE a managed Hive serde table") {
723+
val catalog = spark.sessionState.catalog
724+
val sourceTabName = "tab1"
725+
val targetTabName = "tab2"
726+
withTable(sourceTabName, targetTabName) {
727+
sql(s"CREATE TABLE $sourceTabName TBLPROPERTIES('prop1'='value1') AS SELECT 1 key, 'a'")
728+
sql(s"CREATE TABLE $targetTabName LIKE $sourceTabName")
729+
730+
val sourceTable = catalog.getTableMetadata(TableIdentifier(sourceTabName, Some("default")))
731+
assert(sourceTable.tableType == CatalogTableType.MANAGED)
732+
assert(sourceTable.properties.get("prop1").nonEmpty)
733+
val targetTable = catalog.getTableMetadata(TableIdentifier(targetTabName, Some("default")))
734+
735+
checkCreateTableLike(sourceTable, targetTable)
736+
}
737+
}
738+
739+
test("CREATE TABLE LIKE an external Hive serde table") {
740+
val catalog = spark.sessionState.catalog
741+
withTempDir { tmpDir =>
742+
val basePath = tmpDir.getCanonicalPath
743+
val sourceTabName = "tab1"
744+
val targetTabName = "tab2"
745+
withTable(sourceTabName, targetTabName) {
746+
assert(tmpDir.listFiles.isEmpty)
747+
sql(
748+
s"""
749+
|CREATE EXTERNAL TABLE $sourceTabName (key INT comment 'test', value STRING)
750+
|COMMENT 'Apache Spark'
751+
|PARTITIONED BY (ds STRING, hr STRING)
752+
|LOCATION '$basePath'
753+
""".stripMargin)
754+
for (ds <- Seq("2008-04-08", "2008-04-09"); hr <- Seq("11", "12")) {
755+
sql(
756+
s"""
757+
|INSERT OVERWRITE TABLE $sourceTabName
758+
|partition (ds='$ds',hr='$hr')
759+
|SELECT 1, 'a'
760+
""".stripMargin)
761+
}
762+
sql(s"CREATE TABLE $targetTabName LIKE $sourceTabName")
763+
764+
val sourceTable = catalog.getTableMetadata(TableIdentifier(sourceTabName, Some("default")))
765+
assert(sourceTable.tableType == CatalogTableType.EXTERNAL)
766+
assert(sourceTable.comment == Option("Apache Spark"))
767+
val targetTable = catalog.getTableMetadata(TableIdentifier(targetTabName, Some("default")))
768+
769+
checkCreateTableLike(sourceTable, targetTable)
770+
}
771+
}
772+
}
773+
774+
test("CREATE TABLE LIKE a view") {
775+
val sourceTabName = "tab1"
776+
val sourceViewName = "view"
777+
val targetTabName = "tab2"
778+
withTable(sourceTabName, targetTabName) {
779+
withView(sourceViewName) {
780+
spark.range(10).select('id as 'a, 'id as 'b, 'id as 'c, 'id as 'd)
781+
.write.format("json").saveAsTable(sourceTabName)
782+
sql(s"CREATE VIEW $sourceViewName AS SELECT * FROM $sourceTabName")
783+
sql(s"CREATE TABLE $targetTabName LIKE $sourceViewName")
784+
785+
val sourceView = spark.sessionState.catalog.getTableMetadata(
786+
TableIdentifier(sourceViewName, Some("default")))
787+
// The original source should be a VIEW with an empty path
788+
assert(sourceView.tableType == CatalogTableType.VIEW)
789+
assert(sourceView.viewText.nonEmpty && sourceView.viewOriginalText.nonEmpty)
790+
val targetTable = spark.sessionState.catalog.getTableMetadata(
791+
TableIdentifier(targetTabName, Some("default")))
792+
793+
checkCreateTableLike(sourceView, targetTable)
794+
}
795+
}
796+
}
797+
798+
private def getTablePath(table: CatalogTable): Option[String] = {
799+
if (DDLUtils.isDatasourceTable(table)) {
800+
new CaseInsensitiveMap(table.storage.serdeProperties).get("path")
801+
} else {
802+
table.storage.locationUri
803+
}
804+
}
805+
806+
private def checkCreateTableLike(sourceTable: CatalogTable, targetTable: CatalogTable): Unit = {
807+
// The created table should be a MANAGED table with empty view text and original text.
808+
assert(targetTable.tableType == CatalogTableType.MANAGED,
809+
"the created table must be a Hive managed table")
810+
assert(targetTable.viewText.isEmpty && targetTable.viewOriginalText.isEmpty,
811+
"the view text and original text in the created table must be empty")
812+
assert(targetTable.comment.isEmpty,
813+
"the comment in the created table must be empty")
814+
assert(targetTable.unsupportedFeatures.isEmpty,
815+
"the unsupportedFeatures in the create table must be empty")
816+
817+
val metastoreGeneratedProperties = Seq(
818+
"CreateTime",
819+
"transient_lastDdlTime",
820+
"grantTime",
821+
"lastUpdateTime",
822+
"last_modified_by",
823+
"last_modified_time",
824+
"Owner:",
825+
"COLUMN_STATS_ACCURATE",
826+
"numFiles",
827+
"numRows",
828+
"rawDataSize",
829+
"totalSize",
830+
"totalNumberFiles",
831+
"maxFileSize",
832+
"minFileSize"
833+
)
834+
assert(targetTable.properties.filterKeys { key =>
835+
!metastoreGeneratedProperties.contains(key) && !key.startsWith(DATASOURCE_PREFIX)
836+
}.isEmpty,
837+
"the table properties of source tables should not be copied in the created table")
838+
839+
if (DDLUtils.isDatasourceTable(sourceTable) ||
840+
sourceTable.tableType == CatalogTableType.VIEW) {
841+
assert(DDLUtils.isDatasourceTable(targetTable),
842+
"the target table should be a data source table")
843+
} else {
844+
assert(!DDLUtils.isDatasourceTable(targetTable),
845+
"the target table should be a Hive serde table")
846+
}
847+
848+
if (sourceTable.tableType == CatalogTableType.VIEW) {
849+
// Source table is a temporary/permanent view, which does not have a provider. The created
850+
// target table uses the default data source format
851+
assert(targetTable.properties(CreateDataSourceTableUtils.DATASOURCE_PROVIDER) ==
852+
spark.sessionState.conf.defaultDataSourceName)
853+
} else if (DDLUtils.isDatasourceTable(sourceTable)) {
854+
assert(targetTable.properties(CreateDataSourceTableUtils.DATASOURCE_PROVIDER) ==
855+
sourceTable.properties(CreateDataSourceTableUtils.DATASOURCE_PROVIDER))
856+
}
857+
858+
val sourceTablePath = getTablePath(sourceTable)
859+
val targetTablePath = getTablePath(targetTable)
860+
assert(targetTablePath.nonEmpty, "target table path should not be empty")
861+
assert(sourceTablePath != targetTablePath,
862+
"source table/view path should be different from target table path")
863+
864+
// The source table contents should not been seen in the target table.
865+
assert(spark.table(sourceTable.identifier).count() != 0, "the source table should be nonempty")
866+
assert(spark.table(targetTable.identifier).count() == 0, "the target table should be empty")
867+
868+
// Their schema should be identical
869+
checkAnswer(
870+
sql(s"DESC ${sourceTable.identifier}").select("col_name", "data_type"),
871+
sql(s"DESC ${targetTable.identifier}").select("col_name", "data_type"))
872+
873+
withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") {
874+
// Check whether the new table can be inserted using the data from the original table
875+
sql(s"INSERT INTO TABLE ${targetTable.identifier} SELECT * FROM ${sourceTable.identifier}")
876+
}
877+
878+
// After insertion, the data should be identical
879+
checkAnswer(
880+
sql(s"SELECT * FROM ${sourceTable.identifier}"),
881+
sql(s"SELECT * FROM ${targetTable.identifier}"))
882+
}
883+
654884
test("Analyze data source tables(LogicalRelation)") {
655885
withTable("t1") {
656886
withTempPath { dir =>

0 commit comments

Comments
 (0)