-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17729] [SQL] Enable creating hive bucketed tables #17644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3799d18
6315dda
303f442
df3fd7a
b2784ba
0aa8539
bf306da
865711f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,10 +27,11 @@ import org.apache.hadoop.conf.Configuration | |
| import org.apache.hadoop.fs.Path | ||
| import org.apache.hadoop.hive.conf.HiveConf | ||
| import org.apache.hadoop.hive.metastore.{TableType => HiveTableType} | ||
| import org.apache.hadoop.hive.metastore.api.{Database => HiveDatabase, FieldSchema} | ||
| import org.apache.hadoop.hive.metastore.api.{Database => HiveDatabase, FieldSchema, Order} | ||
| import org.apache.hadoop.hive.metastore.api.{SerDeInfo, StorageDescriptor} | ||
| import org.apache.hadoop.hive.ql.Driver | ||
| import org.apache.hadoop.hive.ql.metadata.{Hive, Partition => HivePartition, Table => HiveTable} | ||
| import org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.HIVE_COLUMN_ORDER_ASC | ||
| import org.apache.hadoop.hive.ql.processors._ | ||
| import org.apache.hadoop.hive.ql.session.SessionState | ||
| import org.apache.hadoop.security.UserGroupInformation | ||
|
|
@@ -373,10 +374,30 @@ private[hive] class HiveClientImpl( | |
| Option(client.getTable(dbName, tableName, false)).map { h => | ||
| // Note: Hive separates partition columns and the schema, but for us the | ||
| // partition columns are part of the schema | ||
| val cols = h.getCols.asScala.map(fromHiveColumn) | ||
| val partCols = h.getPartCols.asScala.map(fromHiveColumn) | ||
| val schema = StructType(h.getCols.asScala.map(fromHiveColumn) ++ partCols) | ||
| val schema = StructType(cols ++ partCols) | ||
|
|
||
| val bucketSpec = if (h.getNumBuckets > 0) { | ||
| val sortColumnOrders = h.getSortCols.asScala | ||
| // Currently Spark only supports columns to be sorted in ascending order | ||
| // but Hive can support both ascending and descending order. If all the columns | ||
| // are sorted in ascending order, only then propagate the sortedness information | ||
| // to downstream processing / optimizations in Spark | ||
| // TODO: In future we can have Spark support columns sorted in descending order | ||
| val allAscendingSorted = sortColumnOrders.forall(_.getOrder == HIVE_COLUMN_ORDER_ASC) | ||
|
|
||
| val sortColumnNames = if (allAscendingSorted) { | ||
| sortColumnOrders.map(_.getCol) | ||
| } else { | ||
| Seq() | ||
| } | ||
| Option(BucketSpec(h.getNumBuckets, h.getBucketCols.asScala, sortColumnNames)) | ||
| } else { | ||
| None | ||
| } | ||
|
|
||
| // Skew spec, storage handler, and bucketing info can't be mapped to CatalogTable (yet) | ||
| // Skew spec and storage handler can't be mapped to CatalogTable (yet) | ||
| val unsupportedFeatures = ArrayBuffer.empty[String] | ||
|
|
||
| if (!h.getSkewedColNames.isEmpty) { | ||
|
|
@@ -387,10 +408,6 @@ private[hive] class HiveClientImpl( | |
| unsupportedFeatures += "storage handler" | ||
| } | ||
|
|
||
| if (!h.getBucketCols.isEmpty) { | ||
| unsupportedFeatures += "bucketing" | ||
| } | ||
|
|
||
| if (h.getTableType == HiveTableType.VIRTUAL_VIEW && partCols.nonEmpty) { | ||
| unsupportedFeatures += "partitioned view" | ||
| } | ||
|
|
@@ -408,9 +425,11 @@ private[hive] class HiveClientImpl( | |
| }, | ||
| schema = schema, | ||
| partitionColumnNames = partCols.map(_.name), | ||
| // We can not populate bucketing information for Hive tables as Spark SQL has a different | ||
| // implementation of hash function from Hive. | ||
| bucketSpec = None, | ||
| // If the table is written by Spark, we will put bucketing information in table properties, | ||
| // and will always overwrite the bucket spec in hive metastore by the bucketing information | ||
| // in table properties. This means, if we have bucket spec in both hive metastore and | ||
| // table properties, we will trust the one in table properties. | ||
| bucketSpec = bucketSpec, | ||
| owner = h.getOwner, | ||
| createTime = h.getTTable.getCreateTime.toLong * 1000, | ||
| lastAccessTime = h.getLastAccessTime.toLong * 1000, | ||
|
|
@@ -870,6 +889,23 @@ private[hive] object HiveClientImpl { | |
| hiveTable.setViewOriginalText(t) | ||
| hiveTable.setViewExpandedText(t) | ||
| } | ||
|
|
||
| table.bucketSpec match { | ||
| case Some(bucketSpec) if DDLUtils.isHiveTable(table) => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After thinking it more, I think this one is less important. Our main goal is to allow spark to read bucketed tables written by hive, but not to allow hive to read bucketed tables written by spark. How about we remove it for now and add it later after more discussion?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If Hive is NOT able to read the datasource tables which are bucketed, thats OK as its not compatible with hive. But for hive native tables, the interoperability amongst spark and hive is what I want.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok makes sense |
||
| hiveTable.setNumBuckets(bucketSpec.numBuckets) | ||
| hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava) | ||
|
|
||
| if (bucketSpec.sortColumnNames.nonEmpty) { | ||
| hiveTable.setSortCols( | ||
| bucketSpec.sortColumnNames | ||
| .map(col => new Order(col, HIVE_COLUMN_ORDER_ASC)) | ||
| .toList | ||
| .asJava | ||
| ) | ||
| } | ||
| case _ => | ||
| } | ||
|
|
||
| hiveTable | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -307,6 +307,27 @@ case class InsertIntoHiveTable( | |
| } | ||
| } | ||
|
|
||
| table.bucketSpec match { | ||
| case Some(bucketSpec) => | ||
| // Writes to bucketed hive tables are allowed only if user does not care about maintaining | ||
| // table's bucketing ie. both "hive.enforce.bucketing" and "hive.enforce.sorting" are | ||
| // set to false | ||
| val enforceBucketingConfig = "hive.enforce.bucketing" | ||
| val enforceSortingConfig = "hive.enforce.sorting" | ||
|
|
||
| val message = s"Output Hive table ${table.identifier} is bucketed but Spark" + | ||
| "currently does NOT populate bucketed output which is compatible with Hive." | ||
|
|
||
| if (hadoopConf.get(enforceBucketingConfig, "true").toBoolean || | ||
| hadoopConf.get(enforceSortingConfig, "true").toBoolean) { | ||
| throw new AnalysisException(message) | ||
| } else { | ||
| logWarning(message + s" Inserting data anyways since both $enforceBucketingConfig and " + | ||
| s"$enforceSortingConfig are set to false.") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we remove the bucket properties of the table in this case? what does hive do?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
With Spark, currently the data would be written out in a non conformant way despite of that config being set or not. This PR will go to the model of Hive 1.x. I am working on a next PR which would make things like Hive 2.0.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so after insertion(if not enforcing), the table is still a buckted table but read it will cause wrong result?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In hive: It would lead to wrong result. In spark (over master and also after this PR): the table scan operation does not take bucketing into account so it would be read as a regular table. So, it won't be read "wrong", its just that we wont take advantage of bucketing. |
||
| } | ||
| case _ => // do nothing since table has no bucketing | ||
| } | ||
|
|
||
| val committer = FileCommitProtocol.instantiate( | ||
| sparkSession.sessionState.conf.fileCommitProtocolClass, | ||
| jobId = java.util.UUID.randomUUID().toString, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasapatil do you still remember why we update it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. This was done because before this PR the test case was removing bucketed columns in the alter operation. We decided to disallow removing bucketing columns and support this if needed in future.
Here is the discussion that we both had about this : #17644 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. If I revert the change to this test case, it does not fail anymore. This is bad because the table properties still say that its bucketed over
col1butcol1is not in the modified table schema. I am taking a look to see what changed.