Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design Doc] Catalog implementation for AWS Glue Data Catalog #1679

Open
moomindani opened this issue Apr 4, 2023 · 14 comments
Open

[Design Doc] Catalog implementation for AWS Glue Data Catalog #1679

moomindani opened this issue Apr 4, 2023 · 14 comments
Assignees
Labels
enhancement New feature or request

Comments

@moomindani
Copy link

Authors: Noritaka Sekiyama, Bhavay Pahuja

Motivation

As this PR describes, there are two issues when creating Delta table definition in an external catalog.

  • Issue 1: Schema cannot be recognized and automatically falls back to col (array)
  • Issue 2: The error IllegalArgumentException: Can not create a Path from an empty string occurs when database does not have its location

These issues are causing bad customer experiences:

  • Need to configure location for database even when it is not needed.
  • Schema cannot be saved through saveAsTable, instead the schema needs to be maintained by the user manually.

Initially, in the PR, we tried to solve that by introducing two extra parameters to change the behavior in creating Delta tables on the metastore. However, it caused extra issue.

  • Issue 2’: When we use parquet as tableProvider instead of delta, several capabilities e.g. schema evolution did not work because additional functionalities are built on DeltaParquetFileFormat.

Since Delta Tables do not provide native Hive Serde, it is not straightforward to solve all those issues with current implementation.

Proposal

This proposal is to add Spark native catalog implementation for AWS Glue Data Catalog into Delta package. The new catalog will be extension of DeltaCatalog.

AWS Glue Data Catalog is a serverless, Hive-metastore compatible service in AWS. It is widely used in different engines like Amazon EMR, Amazon Athena, AWS Glue, Amazon Redshift Spectrum, and also in OSS Spark, Hive, and so on.

The new catalog class (let’s call it GlueDeltaCatalog) implements createTable, alterTable, dropTable, etc. In this approach, GlueDeltaCatalog will be specified by the user as the spark catalog to be used for the spark session. This catalog will directly interact with AWS Glue Data Catalog bypassing the Hive client and thus its limitations. Glue AWS SDK Client will be used to connect to AWS Glue Data Catalog, and store the table details with correct schema information in the Glue database. The common methods for creating the delta lake table will be abstracted out in a new BaseDeltaCatalog class and DeltaCatalog and GlueDeltaCatalog will extend this. If users want to, they can still rely on the old catalog and use Hive client to contact AWS Glue Data Catalog. Extra translation utility methods will be required to convert a Delta lake table to Glue Data Catalog table.

GlueDeltaCatalog

In this approach, we will be able to call Delta Lake’s CreateDeltaTableCommand directly instead of using Hive metastore compatible class. This approach solves both issue 1 and issue 2, and also it does not cause issue 2’. Moreover, this approach allows Delta Lake to manage any other enhancement without being impacted by Hive metastore client side limitation.

Requirements

MUST:

  • Users can save correct table schema information in an external catalog using saveAsTable.
  • Users can create Delta table definitions under a database without database location.
  • Not break compatibility: Users can still perform standard catalog operations like createTable, updateTable, deleteTable, and so on.

PoC

We have verified that a simple PoC implementation of GlueDeltaCatalog was able to solve Issue 2.
Note: Issue 1 requires extra implementation to automatically convert the schema information.

Below is a simple implementation of GlueDeltaCatalog.createTable

  override def createTable(tableDefinition: CatalogTable, ignoreIfExists: Boolean): Unit = {
    glue = GlueClient.builder().build()
    assert(tableDefinition.identifier.database.isDefined)
    val db = tableDefinition.identifier.database.get
    val table = tableDefinition.identifier.table
//    requireDbExists(db)

//    if (tableExists(db, table) && !ignoreIfExists) {
//      throw new TableAlreadyExistsException(db = db, table = table)
//    }
    val needDefaultTableLocation = tableDefinition.tableType == MANAGED &&
      tableDefinition.storage.locationUri.isEmpty

    val tableLocation = if (needDefaultTableLocation) {
      Some(CatalogUtils.stringToURI(defaultTablePath(tableDefinition.identifier)))
    } else {
      tableDefinition.storage.locationUri
    }

    val tableWithDataSourceProps = tableDefinition.copy(
      storage = tableDefinition.storage.copy(locationUri = tableLocation))

    val tableInputBuilder = TableInput.builder
      .tableType(tableWithDataSourceProps.tableType.name)
      .parameters( mapAsJavaMap( tableWithDataSourceProps.properties))
      .name(tableWithDataSourceProps.identifier.table)
      .storageDescriptor( StorageDescriptor.builder
        .parameters( mapAsJavaMap( tableWithDataSourceProps.storage.properties ) )
        .location( tableWithDataSourceProps.location.toString )
        .build()
      )

    glue.createTable(CreateTableRequest.builder
      .databaseName(tableWithDataSourceProps.database)
      .tableInput(tableInputBuilder.build())
      .build)

  }

requireDb and tableExists are commented as they were not implemented for this PoC.
I hard coded the call to createTable for POC purposes in CreateDeltaTableCommand.updateCatalog

case TableCreationModes.Create =>
  (new GlueCatalog).createTable(
    cleaned,
    ignoreIfExists = existingTableOpt.isDefined)

As a result of this PoC
image (11)
In a Glue database without having an explicit location, i was able to create a table using the following command:

use tvkalyan_deltadb_withoutlocation;
create table e1 (id1 int, id2 int) using delta location "s3://bhavayp-emr-dev/delta-table-e1";

image (12)

As shown above, the table created got updated with correct location in the Glue database.

cc: @dennyglee

@dennyglee
Copy link
Contributor

Thanks @moomindani - this is super interesting. Will review this as soon as possible!

@mo2menelzeiny
Copy link

is there any work around to this to make Glue tables behave normally until this is implemented?

@moomindani
Copy link
Author

is there any work around to this to make Glue tables behave normally until this is implemented?

Here's workaround.
For Issue 1, you can use Glue crawler to create table definition instead of calling saveAsTable.
For Issue 2, you can just put some location to your database.

@BenLaot
Copy link

BenLaot commented Jul 26, 2023

Hi guys,
Regarding issue 1 :
Is there any chance to have this feature in a near future?

At the moment, at Decathlon we use EMR to create a lot of delta tables with saveAsTable while having AWS Glue as metastore.
We do not want to create crawlers because :

  1. We have one Glue Database by data product => a lot of crawlers to be created and maintained
  2. The data platform team will have to maintain the Glue crawlers while we want each data product team to manage their own data and metadata

Just so you know our current workaround is to call describe detail OUR_DELTA_TABLE, that somehow updates the table's schema with the good one and generates a new Glue table version.

Any news on this subject will be appreciated

@samihoda
Copy link

@dennyglee I have folks in the Federal government who can really benefit from this type of functionality being implemented and are interested in seeing this be completed. Anything I can do to upvote this?

@timvw
Copy link

timvw commented Aug 28, 2023

Iceberg catalogs have a similar approach: org.apache.iceberg.spark.SparkCatalog which is meant to complement the spark_catalog extending org.apache.iceberg.spark.SparkSessionCatalog...

@keenborder786
Copy link

Any updates on this. I really this feature.

@calixtofelipe
Copy link

calixtofelipe commented Nov 18, 2023

I hope I can help others, as it helped me. I think it can be a good start solution because we just need to do 2 minor changes. All details in the PR above. I created the PR in the branch 2.3 because it is the compatible version with AWS Glue, but if needed it can be done in the master branch too.
I'm glad to find this way to solve the issue, it helped me to simplify my table creation process and continue to use AWS glue catalog and I hope it can help other too.
In the PR #2310 I added samples how I tested the table creation (MANAGED/EXTERNAL)

** Attention **

@moomindani
Copy link
Author

I hope I can help others, as it helped me. I think it can be a good start solution because we just need to do 2 minor changes. All details in the PR above. I created the PR in the branch 2.3 because it is the compatible version with AWS Glue, but if needed it can be done in the master branch too. I'm glad to find this way to solve the issue, it helped me to simplify my table creation process and continue to use AWS glue catalog and I hope it can help other too. In the PR #2310 I added samples how I tested the table creation (MANAGED/EXTERNAL)

@calixtofelipe Thanks! It seems that your PR looks very similar to my original PR (#1579) but at that time it introduced another issue. Let me share something we observed in the past soon.

@calixtofelipe
Copy link

calixtofelipe commented Nov 20, 2023

@calixtofelipe Thanks! It seems that your PR looks very similar to my original PR (#1579) but at that time it introduced another issue. Let me share something we observed in the past soon.

Hey @moomindani, in the original PR you changed the provider='delta' to 'parquet' and it generate the another issue because a lot of other places will check the provider (E.g: impact time travel capability).
This new PR still change the schema but I didn't change the provider as you did in the original PR. I added the command to alter the metadata and this command will update the Hive metastore successfully without overwrite the schema to empty as apache/spark project does when we execute createTable function. I provided more context in the PR. So, as we are keeping the provider='delta' should be fine, at least in my tests seems working. Thanks for the comment.

@moomindani
Copy link
Author

@calixtofelipe Thanks for clarifying it. I confirmed that your PR won't cause the same issue that I experienced.
I added one more comments into your PR.

BTW, if I understand it correctly, your PR solves only the Issue 1 described in #1679, not the Issue 2. Is it correct?

Apologizes for multi posting the same in two places. Let's keep discussion in your PR side.

@florent-brosse
Copy link

florent-brosse commented Dec 21, 2023

FYI
I have a customer who needs to set spark.databricks.delta.catalog.update.hiveSchema.enabled = true because the schema isn't being correctly written into Glue unless this option is utilized.

@calixtofelipe
Copy link

FYI I have a customer who needs to set spark.databricks.delta.catalog.update.hiveSchema.enabled = true because the schema isn't being correctly written into Glue unless this option is utilized.

I didn't find this config (spark.databricks.delta.catalog.update.hiveSchema.enabled) in delta 2.3 or spark 3.3. Maybe, it is something created in a specific version.

@florent-brosse
Copy link

You're right it's only in Databricks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants