Support for registering Delta tables in the HiveMetastore #85

marmbrus · 2019-07-08T17:13:49Z

While delta tracks its own metadata in the transaction log, the Hive Metastore is still important as it enables users to find tables without knowing the path to the data.

This ticket tracks adding the ability to run CREATE TABLE to create a new metastore table, or to register an existing table in the metastore.

The text was updated successfully, but these errors were encountered:

flvndh · 2019-07-25T08:04:17Z

Is there any chance that this would be part of the 0.4.0 milestone?

tdas · 2019-07-25T09:25:57Z

Hard to say. It all depends on the timing of the Spark 3.0.0 release. We are working with the Spark community to add the necessary Datasource V2 APIs that would allow us to plug into all the DDLs like CREATE TABLE, ALTER TABLE, etc. and make them work with Hive metastore + Delta combined setup. Those new APIs are targetted for Spark 3.0.0.

yucai · 2019-10-15T08:55:37Z

@marmbrus @tdas

    spark.range(10).write.format("delta").save("/tmp/events")
    spark.sql("create table events using delta location '/tmp/events'")
    spark.sql("select * from events").show(100)

Registering an existing table in the metastore seems work already, so this feature is mainly for creating new table like #177?

dgreenshtein · 2019-11-13T12:10:57Z

@marmbrus @tdas
    spark.range(10).write.format("delta").save("/tmp/events")
    spark.sql("create table events using delta location '/tmp/events'")
    spark.sql("select * from events").show(100)
Registering an existing table in the metastore seems work already, so this feature is mainly for creating new table like #177?

@yucai I had exactly the exception as it is appears in #177 when I run your code

dgreenshtein · 2019-11-13T12:16:01Z

Is there any workaround to register Delta table in Hive metastore using Spark API?

yucai · 2019-11-13T13:07:05Z

@dgreenshtein it works for me perfectly. I run them like below:

branch-2.4 git:(branch-2.4) ✗ ./bin/spark-shell --packages io.delta:delta-core_2.11:0.3.0

dgreenshtein · 2019-11-13T15:09:21Z

@dgreenshtein it works for me perfectly. I run them like below:
branch-2.4 git:(branch-2.4) ✗ ./bin/spark-shell --packages io.delta:delta-core_2.11:0.3.0

hm, the only difference is delta.io version. I am running io.delta:delta-core_2.11:0.4.0.

Did not work with 0.3.0 as well.

MLikeWater · 2019-11-17T07:24:24Z

@dgreenshtein
In fact, when we use the "create table xxx" sql statement, prompt as follows

WARN hive.HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `default`.`events` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Just created metastore_db locally, can't access Hive metastore

CarreauClement · 2020-01-28T21:09:01Z

Any news on this ?

Also trying to register my delta table to hive metastore without success.

I'm running Spark 2.4.4 as follows :

./spark-shell --packages io.delta:delta-core_2.11:0.5.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension

And trying to run

spark.range(10).write.format("delta").save("/tmp/events")
spark.sql("create table events using delta location '/tmp/events'")
spark.sql("select * from events").show(100)

I'm having the following error :
hive.HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta.

Any idea ?

mukulmurthy · 2020-01-28T21:21:27Z

No updates; we can't support this until Spark 3.0 is released. It's on our future roadmap to support once 3.0 comes out.

CarreauClement · 2020-01-28T21:34:39Z

Is there any metastore that is currently supported with the 0.5.0 version ? Does Glue or any other work ? Is there a standalone metastore for delta lake ? I can't seem to find it in the documentation and I need to register my tables to perform the auto compaction optimizations.

Thanks

mukulmurthy · 2020-01-28T21:48:04Z

If you're running on Databricks (which I assume you mean by running auto-compaction optimizations, which are only available there), Hive and Glue metastores are both supported. See https://docs.databricks.com/delta/delta-batch.html#create-a-table.

This project is for discussion around open-source Delta Lake. If you have any questions about Databricks, feel free to open a support ticket or ask on Slack.

CarreauClement · 2020-01-28T21:51:54Z

I'm using the open-source version. Do I have any metastore available ? Is there a list of features only available on the databricks distribution ? Thanks

mukulmurthy · 2020-01-28T21:59:17Z

Delta Lake on Databricks has some performance optimizations as a result of being part of the Databricks Runtime; we're aiming for full API compatibility in OSS Delta Lake (though for some things like metastore support that requires changes only coming in Spark 3.0). Auto compaction is only available in Databricks; if you're talking about the Hive-ACID compaction, that won't work with Delta Lake. You can implement your own compaction using something like https://docs.delta.io/latest/best-practices.html#compact-files.

If you have any further questions, please create a new issue.

abiratsis · 2020-03-08T18:38:35Z

We just migrated to Databricks Delta from parquet using Hive metastore. So far everything seems to
work fine, when I try to print out the location of the new Delta table using DESCRIBE EXTENDED my_table the location is correct although it is different than the one found in the hiveMetastore database. When I access the hiveMetastore database I can successfully identify the target table (also provider is correctly set to Delta). To retrieve the previous information I am executing a join between sds, dbs, tbls and table_params tables from hiveMetastore db, filtering by table name as shown next:

val sdsDF = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.SDS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val tblsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.TBLS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val dbsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.DBS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val paramsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.TABLE_PARAMS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val resDf = sdsDF.join(tblsDf, "SD_ID")
     .join(dbsDf, "DB_ID") 
     .join(paramsDf, "TBL_ID") 
     .where('TBL_NAME.rlike("mytable"))
     .select($"TBL_NAME", $"TBL_TYPE", $"NAME".as("DB_NAME"), $"DB_LOCATION_URI", $"LOCATION".as("TABLE_LOCATION"), $"PARAM_KEY", $"PARAM_VALUE")

All the previous are executed from a databricks notebook.

My question is why I am getting two different locations even if the table name is the same? Where is the correct location for the Delta tables stored if not on hiveMetastore db?

tdas · 2020-03-10T23:22:57Z

What are the two different locations? can you show the output of resDF?

abiratsis · 2020-03-11T10:16:53Z

I can't show the actual values although they are completely different S3 paths @tdas none of them is somehow related to the other. i.e root of the other etc. Is there anything I should try? Where should the Delta table location be stored in this case?

tdas · 2020-03-11T18:13:28Z

Neither of them matches the actual one? Either way, this issue is not related to the root issue of metastore support, so maybe we should make a different issue for this. Though I doubt we can do anything without taking a deeper look. It might better to actually contact databricks support so that we can look at what is going on.

zsxwing · 2020-03-11T18:47:46Z

One path should look like dbfs:/user/hive/warehouse/XXXXXX-__PLACEHOLDER__. I don't know why Spark needs to write this placeholder path, maybe some codes require something to be there. The other path should be the real table location.

tdas · 2020-03-20T23:00:47Z

Just a status update on the support for defining Delta-format tables in Hive Metastore. We are going to add support for defining tables and all the associated DDL commands (CREATE, ALTER, DROP, etc.) in Delta Lake 0.7.0 when we will add support for Apache Spark 3.0. Delta Lake 0.7.0 is expected to come out roughly in June/July, whenever the Apache Spark community votes and decides to release 3.0.

abiratsis · 2020-03-21T18:03:21Z

My excuses for the late reply, I have some updates regarding this one. I tried to execute the next code:

val sdsDF = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.SDS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val tblsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.TBLS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val dbsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.DBS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val paramsDf = spark.read
  .format("jdbc")
  .option("url", activeConnection.url)
  .option("dbtable", "hiveMetastore.TABLE_PARAMS")
  .option("user", activeConnection.user)
  .option("password", activeConnection.pwd)
  .load()

val resDf = sdsDF.join(tblsDf, "SD_ID")
     .join(dbsDf, "DB_ID") 
     .join(paramsDf, "TBL_ID") 
     .where('TBL_NAME === "my_table" && 'NAME === "db_production")
     .select($"TBL_NAME", $"TBL_TYPE", $"NAME".as("DB_NAME"), $"DB_LOCATION_URI", $"LOCATION".as("TABLE_LOCATION"), $"PARAM_KEY", $"PARAM_VALUE")

Which is similar to the previous code with the difference that in addition I join with hiveMetastore.TABLE_PARAMS. Here is how the output looks like:

+--------+--------------+-------------+-------------------+--------------------------------------------+---------------------------------+-----------------------------+
|TBL_NAME|TBL_TYPE      |DB_NAME      |DB_LOCATION        |TABLE_LOCATION                              |PARAM_KEY                        |PARAM_VALUE                  |
+--------+--------------+-------------+-------------------+--------------------------------------------+---------------------------------+-----------------------------+
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|EXTERNAL                         |TRUE                         |
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|spark.sql.create.version         |2.4.3                        |
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|spark.sql.sources.provider       |DELTA                        |
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|spark.sql.sources.schema.numParts|1                            |
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|spark.sql.sources.schema.part.0  |{"type":"struct","fields":[]}|
|my_table|EXTERNAL_TABLE|db_production|s3://invalid_db_uri|s3://invalid_db_uri/my_table-__PLACEHOLDER__|transient_lastDdlTime            |1582204169                   |
+--------+--------------+-------------+-------------------+--------------------------------------------+---------------------------------+-----------------------------+

As you can see DB_LOCATION and TABLE_LOCATION have invalid values and they don't correspond to the actual S3 path. @zsxwing as you can see the path is still S3 not dbfs. @tdas when I do DESCRIBE EXTENDED my_table the path is the correct Delta path. Actually, this is one of my question why I don't see the same/valid path in the hiveMetastore?

Thank you both

lnmohankumar · 2020-05-08T13:30:14Z

Is there any tentative timeline to have the functionality to creat the DeltaTable API for manged tables?

tdas · 2020-06-08T21:07:09Z

We are hoping to make a 0.7.0-preview release on Spark 3.0/RCx in the next couple of weeks. Most of the code changes for metastore support have merged into master. Hence, I am closing this ticket.

BigDataDaddy · 2020-07-05T14:04:37Z

Was this issue closed on the hope of some code getting integrated into a Spark 3.0 preview release and not from a verified and tested publicly available general release?

If that really poor release management!!!

Please specify exactly what release combinations have been proven to work.

tdas · 2020-07-06T16:43:51Z

This has been released in Delta 0.7.0 on Spark 3.0. Please see the attached milestone and the corresponding release notes in https://github.com/delta-io/delta/releases

…partition filter (delta-io#85)

0brstmgns · 2022-09-22T08:32:08Z

Currently using spark-core_2.12:3.1.2, spark-hive_2.12:3.1.2 and delta-core_2.12:1.0.1.

When deploying the table definitions like

spark.sql("""CREATE TABLE XY USING DELTA LOCATION 'XY'""")

I still get the message:

WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table <table_name> into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Is this the correct behavior?

When querying the table like below using HiveContext I get a result, so it looks like the table is at least registered in Hive...

spark-shell --packages io.delta:delta-core_2.12:1.0.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

val sqlContext = new org.apache.spark.sql.hive.HiveContext(spark.sparkContext)
val result = sqlContext.sql("select count(*) FROM <table_name>")

result.show()

zsxwing · 2022-09-22T16:30:45Z

WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table <table_name> into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Please ignore this error. This is just saying you cannot use Hive to query this table. As long as you are using Spark, it should be fine.

cometta · 2022-09-25T05:25:09Z

instead of executing this command spark.sql("""CREATE TABLE XY USING DELTA LOCATION 'XY'""") to create table in hive metastore, is there other command that supported when creating table using spark.sql(..) so that i can query with hive to query the table?

zsxwing · 2022-09-27T17:11:54Z

instead of executing this command spark.sql("""CREATE TABLE XY USING DELTA LOCATION 'XY'""") to create table in hive metastore, is there other command that supported when creating table using spark.sql(..) so that i can query with hive to query the table?

No. This is not supported today. We have an open issue for this in #1045. Please add new comments in this open ticket instead. Thanks!

danzafar · 2022-11-11T23:43:53Z

Where is the correct location for the Delta tables stored if not on hiveMetastore db?

To follow up on the side-issue here for future readers, you can get it from the SERDE_PARAMS table in the Hive Metastore, there is an entry path that corresponds to the external location.

caoergou · 2023-09-25T12:52:00Z

WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table <table_name> into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Please ignore this error. This is just saying you cannot use Hive to query this table. As long as you are using Spark, it should be fine.

It appears that this error should not be disregarded.
We encountered significant traffic issues related to this last Friday.
Here are the details:

We are using Delta as the storage layer and Hive 3.x as the metastore.
We obtained a dataframe with a complex structure as follows:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

val innerStruct = StructType(Array(
  StructField("image", StringType),
  StructField("content", StringType)
))

val outerStruct = StructType(Array(
  StructField("some_detail", innerStruct)
))

val schema = StructType(Array(
  StructField("detail", outerStruct)
))

val data = Seq(
  Row(Row(Row("image_example", "content_text_example")))
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

df.write.format("delta").save("s3://path")

We then created a delta table:

CREATE TABLE test.test_table_name USING delta LOCATION 's3://path'

However, we ended up with a record in the Hive metastore database sds with a Location row like:

hdfs://[our-emr-cluster-id]/xxxxx/yyy

instead of s3://path!
This is a significant issue because when you want to release the current cluster and create a new one, you will encounter a major bug!
You might think that your data and computing resources are separated, but you will find that the metastore still refers to the old, released cluster!
I sincerely hope to receive a response from the community on this issue.

alberttwong · 2024-03-05T00:07:33Z

Here's an example of getting it to work with Spark SQL, HMS, MinIO S3 and StarRocks. https://github.com/StarRocks/demo/tree/master/documentation-samples/deltalake

marmbrus added the enhancement New feature or request label Jul 8, 2019

marmbrus mentioned this issue Jul 9, 2019

Storage format for Delta #87

Open

mukulmurthy mentioned this issue Sep 19, 2019

Support for sql creation table #177

Closed

mukulmurthy added this to the Future Roadmap milestone Oct 2, 2019

parisni mentioned this issue Dec 13, 2019

Delta Lake support open source connector prestodb/presto#13778

Open

marmbrus mentioned this issue Jan 9, 2020

Cannot set metadata.description #299

Closed

tdas modified the milestones: Future Roadmap, 0.7.0 Mar 11, 2020

This was referenced Mar 20, 2020

Not able to view the delta table in tableau using spark sql #212

Closed

support for external table in AWS #73

Closed

pranavanand mentioned this issue Mar 24, 2020

Can I create hive table on top of delta? #18

Closed

tdas mentioned this issue Mar 25, 2020

Athena delta table load it via spark #365

Closed

pranavanand mentioned this issue Mar 26, 2020

Simple saveAsTable not working #307

Closed

zsxwing mentioned this issue Apr 3, 2020

External Hive table broken after delta merge #375

Closed

tdas mentioned this issue Apr 21, 2020

Deltatable does not recognise the newly created spark table #392

Closed

tdas pinned this issue Apr 23, 2020

lyogev mentioned this issue Apr 26, 2020

Vacuum is very slow #395

Closed

zsxwing mentioned this issue Apr 26, 2020

Support to convert an existing table in Hive metastore to Delta format #403

Closed

tdas mentioned this issue May 18, 2020

saveAsTable(...) does not merge schema on "append" SaveMode #414

Closed

liwensun mentioned this issue May 19, 2020

CreateDataSourceTableCommand: It is not recommended to create a table with overlapped data and partition columns #425

Closed

tdas closed this as completed Jun 8, 2020

tdas unpinned this issue Jun 26, 2020

LantaoJin added a commit to LantaoJin/delta that referenced this issue Mar 12, 2021

[CARMEL-3241] Should not apply union optimization when the filter is …

5cf3dca

…partition filter (delta-io#85)

caoergou mentioned this issue Sep 26, 2023

> > WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table <table_name> into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. #2106

Open

the-other-tim-brown mentioned this issue Feb 26, 2024

Spark SQL specific format, which is NOT compatible with Hive. apache/incubator-xtable#336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for registering Delta tables in the HiveMetastore #85

Support for registering Delta tables in the HiveMetastore #85

marmbrus commented Jul 8, 2019

flvndh commented Jul 25, 2019

tdas commented Jul 25, 2019 •

edited

Loading

yucai commented Oct 15, 2019

dgreenshtein commented Nov 13, 2019

dgreenshtein commented Nov 13, 2019

yucai commented Nov 13, 2019

dgreenshtein commented Nov 13, 2019 •

edited

Loading

MLikeWater commented Nov 17, 2019

CarreauClement commented Jan 28, 2020 •

edited

Loading

mukulmurthy commented Jan 28, 2020

CarreauClement commented Jan 28, 2020

mukulmurthy commented Jan 28, 2020

CarreauClement commented Jan 28, 2020

mukulmurthy commented Jan 28, 2020

abiratsis commented Mar 8, 2020

tdas commented Mar 10, 2020 •

edited

Loading

abiratsis commented Mar 11, 2020 •

edited

Loading

tdas commented Mar 11, 2020

zsxwing commented Mar 11, 2020

tdas commented Mar 20, 2020

abiratsis commented Mar 21, 2020 •

edited

Loading

lnmohankumar commented May 8, 2020

tdas commented Jun 8, 2020

BigDataDaddy commented Jul 5, 2020

tdas commented Jul 6, 2020

0brstmgns commented Sep 22, 2022

zsxwing commented Sep 22, 2022

cometta commented Sep 25, 2022 •

edited

Loading

zsxwing commented Sep 27, 2022

danzafar commented Nov 11, 2022

caoergou commented Sep 25, 2023 •

edited

Loading

alberttwong commented Mar 5, 2024

Support for registering Delta tables in the HiveMetastore #85

Support for registering Delta tables in the HiveMetastore #85

Comments

marmbrus commented Jul 8, 2019

flvndh commented Jul 25, 2019

tdas commented Jul 25, 2019 • edited Loading

yucai commented Oct 15, 2019

dgreenshtein commented Nov 13, 2019

dgreenshtein commented Nov 13, 2019

yucai commented Nov 13, 2019

dgreenshtein commented Nov 13, 2019 • edited Loading

MLikeWater commented Nov 17, 2019

CarreauClement commented Jan 28, 2020 • edited Loading

mukulmurthy commented Jan 28, 2020

CarreauClement commented Jan 28, 2020

mukulmurthy commented Jan 28, 2020

CarreauClement commented Jan 28, 2020

mukulmurthy commented Jan 28, 2020

abiratsis commented Mar 8, 2020

tdas commented Mar 10, 2020 • edited Loading

abiratsis commented Mar 11, 2020 • edited Loading

tdas commented Mar 11, 2020

zsxwing commented Mar 11, 2020

tdas commented Mar 20, 2020

abiratsis commented Mar 21, 2020 • edited Loading

lnmohankumar commented May 8, 2020

tdas commented Jun 8, 2020

BigDataDaddy commented Jul 5, 2020

tdas commented Jul 6, 2020

0brstmgns commented Sep 22, 2022

zsxwing commented Sep 22, 2022

cometta commented Sep 25, 2022 • edited Loading

zsxwing commented Sep 27, 2022

danzafar commented Nov 11, 2022

caoergou commented Sep 25, 2023 • edited Loading

alberttwong commented Mar 5, 2024

tdas commented Jul 25, 2019 •

edited

Loading

dgreenshtein commented Nov 13, 2019 •

edited

Loading

CarreauClement commented Jan 28, 2020 •

edited

Loading

tdas commented Mar 10, 2020 •

edited

Loading

abiratsis commented Mar 11, 2020 •

edited

Loading

abiratsis commented Mar 21, 2020 •

edited

Loading

cometta commented Sep 25, 2022 •

edited

Loading

caoergou commented Sep 25, 2023 •

edited

Loading