-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Design Doc] Catalog implementation for AWS Glue Data Catalog #1679
Comments
Thanks @moomindani - this is super interesting. Will review this as soon as possible! |
is there any work around to this to make Glue tables behave normally until this is implemented? |
Here's workaround. |
Hi guys, At the moment, at Decathlon we use EMR to create a lot of delta tables with
Just so you know our current workaround is to call Any news on this subject will be appreciated |
@dennyglee I have folks in the Federal government who can really benefit from this type of functionality being implemented and are interested in seeing this be completed. Anything I can do to upvote this? |
Iceberg catalogs have a similar approach: org.apache.iceberg.spark.SparkCatalog which is meant to complement the spark_catalog extending org.apache.iceberg.spark.SparkSessionCatalog... |
Any updates on this. I really this feature. |
I hope I can help others, as it helped me. I think it can be a good start solution because we just need to do 2 minor changes. All details in the PR above. I created the PR in the branch 2.3 because it is the compatible version with AWS Glue, but if needed it can be done in the master branch too. ** Attention **
|
@calixtofelipe Thanks! It seems that your PR looks very similar to my original PR (#1579) but at that time it introduced another issue. Let me share something we observed in the past soon. |
Hey @moomindani, in the original PR you changed the provider='delta' to 'parquet' and it generate the another issue because a lot of other places will check the provider (E.g: impact time travel capability). |
@calixtofelipe Thanks for clarifying it. I confirmed that your PR won't cause the same issue that I experienced.
Apologizes for multi posting the same in two places. Let's keep discussion in your PR side. |
FYI |
I didn't find this config ( |
You're right it's only in Databricks |
Authors: Noritaka Sekiyama, Bhavay Pahuja
Motivation
As this PR describes, there are two issues when creating Delta table definition in an external catalog.
col (array)
IllegalArgumentException: Can not create a Path from an empty string
occurs when database does not have its locationThese issues are causing bad customer experiences:
saveAsTable
, instead the schema needs to be maintained by the user manually.Initially, in the PR, we tried to solve that by introducing two extra parameters to change the behavior in creating Delta tables on the metastore. However, it caused extra issue.
e.g. schema evolution
did not work because additional functionalities are built on DeltaParquetFileFormat.Since Delta Tables do not provide native Hive Serde, it is not straightforward to solve all those issues with current implementation.
Proposal
This proposal is to add Spark native catalog implementation for AWS Glue Data Catalog into Delta package. The new catalog will be extension of DeltaCatalog.
AWS Glue Data Catalog is a serverless, Hive-metastore compatible service in AWS. It is widely used in different engines like Amazon EMR, Amazon Athena, AWS Glue, Amazon Redshift Spectrum, and also in OSS Spark, Hive, and so on.
The new catalog class (let’s call it
GlueDeltaCatalog
) implementscreateTable
,alterTable
,dropTable
, etc. In this approach,GlueDeltaCatalog
will be specified by the user as the spark catalog to be used for the spark session. This catalog will directly interact with AWS Glue Data Catalog bypassing the Hive client and thus its limitations. Glue AWS SDK Client will be used to connect to AWS Glue Data Catalog, and store the table details with correct schema information in the Glue database. The common methods for creating the delta lake table will be abstracted out in a newBaseDeltaCatalog
class andDeltaCatalog
andGlueDeltaCatalog
will extend this. If users want to, they can still rely on the old catalog and use Hive client to contact AWS Glue Data Catalog. Extra translation utility methods will be required to convert a Delta lake table to Glue Data Catalog table.In this approach, we will be able to call Delta Lake’s CreateDeltaTableCommand directly instead of using Hive metastore compatible class. This approach solves both issue 1 and issue 2, and also it does not cause issue 2’. Moreover, this approach allows Delta Lake to manage any other enhancement without being impacted by Hive metastore client side limitation.
Requirements
MUST:
saveAsTable
.createTable
,updateTable
,deleteTable
, and so on.PoC
We have verified that a simple PoC implementation of
GlueDeltaCatalog
was able to solve Issue 2.Note: Issue 1 requires extra implementation to automatically convert the schema information.
Below is a simple implementation of
GlueDeltaCatalog.createTable
requireDb
andtableExists
are commented as they were not implemented for this PoC.I hard coded the call to createTable for POC purposes in
CreateDeltaTableCommand.updateCatalog
As a result of this PoC
In a Glue database without having an explicit location, i was able to create a table using the following command:
As shown above, the table created got updated with correct location in the Glue database.
cc: @dennyglee
The text was updated successfully, but these errors were encountered: