-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for registering Delta tables in the HiveMetastore #85
Comments
Is there any chance that this would be part of the 0.4.0 milestone? |
Hard to say. It all depends on the timing of the Spark 3.0.0 release. We are working with the Spark community to add the necessary Datasource V2 APIs that would allow us to plug into all the DDLs like CREATE TABLE, ALTER TABLE, etc. and make them work with Hive metastore + Delta combined setup. Those new APIs are targetted for Spark 3.0.0. |
Registering an existing table in the metastore seems work already, so this feature is mainly for creating new table like #177? |
@yucai I had exactly the exception as it is appears in #177 when I run your code |
Is there any workaround to register Delta table in Hive metastore using Spark API? |
@dgreenshtein it works for me perfectly. I run them like below:
|
hm, the only difference is delta.io version. I am running io.delta:delta-core_2.11:0.4.0. Did not work with 0.3.0 as well. |
@dgreenshtein
Just created metastore_db locally, can't access Hive metastore |
Any news on this ? Also trying to register my delta table to hive metastore without success. I'm running Spark 2.4.4 as follows :
And trying to run
I'm having the following error : Any idea ? |
No updates; we can't support this until Spark 3.0 is released. It's on our future roadmap to support once 3.0 comes out. |
Is there any metastore that is currently supported with the 0.5.0 version ? Does Glue or any other work ? Is there a standalone metastore for delta lake ? I can't seem to find it in the documentation and I need to register my tables to perform the auto compaction optimizations. Thanks |
If you're running on Databricks (which I assume you mean by running auto-compaction optimizations, which are only available there), Hive and Glue metastores are both supported. See https://docs.databricks.com/delta/delta-batch.html#create-a-table. This project is for discussion around open-source Delta Lake. If you have any questions about Databricks, feel free to open a support ticket or ask on Slack. |
I'm using the open-source version. Do I have any metastore available ? Is there a list of features only available on the databricks distribution ? Thanks |
Delta Lake on Databricks has some performance optimizations as a result of being part of the Databricks Runtime; we're aiming for full API compatibility in OSS Delta Lake (though for some things like metastore support that requires changes only coming in Spark 3.0). Auto compaction is only available in Databricks; if you're talking about the Hive-ACID compaction, that won't work with Delta Lake. You can implement your own compaction using something like https://docs.delta.io/latest/best-practices.html#compact-files. If you have any further questions, please create a new issue. |
We just migrated to Databricks Delta from parquet using Hive metastore. So far everything seems to
All the previous are executed from a databricks notebook. My question is why I am getting two different locations even if the table name is the same? Where is the correct location for the Delta tables stored if not on hiveMetastore db? |
What are the two different locations? can you show the output of resDF? |
I can't show the actual values although they are completely different S3 paths @tdas none of them is somehow related to the other. i.e root of the other etc. Is there anything I should try? Where should the Delta table location be stored in this case? |
Neither of them matches the actual one? Either way, this issue is not related to the root issue of metastore support, so maybe we should make a different issue for this. Though I doubt we can do anything without taking a deeper look. It might better to actually contact databricks support so that we can look at what is going on. |
One path should look like |
Just a status update on the support for defining Delta-format tables in Hive Metastore. We are going to add support for defining tables and all the associated DDL commands (CREATE, ALTER, DROP, etc.) in Delta Lake 0.7.0 when we will add support for Apache Spark 3.0. Delta Lake 0.7.0 is expected to come out roughly in June/July, whenever the Apache Spark community votes and decides to release 3.0. |
My excuses for the late reply, I have some updates regarding this one. I tried to execute the next code: val sdsDF = spark.read
.format("jdbc")
.option("url", activeConnection.url)
.option("dbtable", "hiveMetastore.SDS")
.option("user", activeConnection.user)
.option("password", activeConnection.pwd)
.load()
val tblsDf = spark.read
.format("jdbc")
.option("url", activeConnection.url)
.option("dbtable", "hiveMetastore.TBLS")
.option("user", activeConnection.user)
.option("password", activeConnection.pwd)
.load()
val dbsDf = spark.read
.format("jdbc")
.option("url", activeConnection.url)
.option("dbtable", "hiveMetastore.DBS")
.option("user", activeConnection.user)
.option("password", activeConnection.pwd)
.load()
val paramsDf = spark.read
.format("jdbc")
.option("url", activeConnection.url)
.option("dbtable", "hiveMetastore.TABLE_PARAMS")
.option("user", activeConnection.user)
.option("password", activeConnection.pwd)
.load()
val resDf = sdsDF.join(tblsDf, "SD_ID")
.join(dbsDf, "DB_ID")
.join(paramsDf, "TBL_ID")
.where('TBL_NAME === "my_table" && 'NAME === "db_production")
.select($"TBL_NAME", $"TBL_TYPE", $"NAME".as("DB_NAME"), $"DB_LOCATION_URI", $"LOCATION".as("TABLE_LOCATION"), $"PARAM_KEY", $"PARAM_VALUE") Which is similar to the previous code with the difference that in addition I join with
As you can see DB_LOCATION and TABLE_LOCATION have invalid values and they don't correspond to the actual S3 path. @zsxwing as you can see the path is still S3 not dbfs. @tdas when I do Thank you both |
Is there any tentative timeline to have the functionality to creat the DeltaTable API for manged tables? |
We are hoping to make a 0.7.0-preview release on Spark 3.0/RCx in the next couple of weeks. Most of the code changes for metastore support have merged into master. Hence, I am closing this ticket. |
Was this issue closed on the hope of some code getting integrated into a Spark 3.0 preview release and not from a verified and tested publicly available general release? If that really poor release management!!! Please specify exactly what release combinations have been proven to work. |
This has been released in Delta 0.7.0 on Spark 3.0. Please see the attached milestone and the corresponding release notes in https://github.com/delta-io/delta/releases |
…partition filter (delta-io#85)
Currently using spark-core_2.12:3.1.2, spark-hive_2.12:3.1.2 and delta-core_2.12:1.0.1. When deploying the table definitions like
I still get the message: WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table <table_name> into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Is this the correct behavior? When querying the table like below using HiveContext I get a result, so it looks like the table is at least registered in Hive...
|
Please ignore this error. This is just saying you cannot use Hive to query this table. As long as you are using Spark, it should be fine. |
instead of executing this command spark.sql("""CREATE TABLE XY USING DELTA LOCATION 'XY'""") to create table in hive metastore, is there other command that supported when creating table using spark.sql(..) so that i can query with hive to query the table? |
instead of executing this command spark.sql("""CREATE TABLE XY USING DELTA LOCATION 'XY'""") to create table in hive metastore, is there other command that supported when creating table using spark.sql(..) so that i can query with hive to query the table? No. This is not supported today. We have an open issue for this in #1045. Please add new comments in this open ticket instead. Thanks! |
To follow up on the side-issue here for future readers, you can get it from the SERDE_PARAMS table in the Hive Metastore, there is an entry |
It appears that this error should not be disregarded.
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
val innerStruct = StructType(Array(
StructField("image", StringType),
StructField("content", StringType)
))
val outerStruct = StructType(Array(
StructField("some_detail", innerStruct)
))
val schema = StructType(Array(
StructField("detail", outerStruct)
))
val data = Seq(
Row(Row(Row("image_example", "content_text_example")))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
df.write.format("delta").save("s3://path") We then created a delta table: CREATE TABLE test.test_table_name USING delta LOCATION 's3://path' However, we ended up with a record in the Hive metastore database hdfs://[our-emr-cluster-id]/xxxxx/yyy instead of |
Here's an example of getting it to work with Spark SQL, HMS, MinIO S3 and StarRocks. https://github.com/StarRocks/demo/tree/master/documentation-samples/deltalake |
While delta tracks its own metadata in the transaction log, the Hive Metastore is still important as it enables users to find tables without knowing the path to the data.
This ticket tracks adding the ability to run
CREATE TABLE
to create a new metastore table, or to register an existing table in the metastore.The text was updated successfully, but these errors were encountered: