[Design Doc] Catalog implementation for AWS Glue Data Catalog #1679

moomindani · 2023-04-04T12:22:45Z

Authors: Noritaka Sekiyama, Bhavay Pahuja

Motivation

As this PR describes, there are two issues when creating Delta table definition in an external catalog.

Issue 1: Schema cannot be recognized and automatically falls back to col (array)
Issue 2: The error IllegalArgumentException: Can not create a Path from an empty string occurs when database does not have its location

These issues are causing bad customer experiences:

Need to configure location for database even when it is not needed.
Schema cannot be saved through saveAsTable, instead the schema needs to be maintained by the user manually.

Initially, in the PR, we tried to solve that by introducing two extra parameters to change the behavior in creating Delta tables on the metastore. However, it caused extra issue.

Issue 2’: When we use parquet as tableProvider instead of delta, several capabilities e.g. schema evolution did not work because additional functionalities are built on DeltaParquetFileFormat.

Since Delta Tables do not provide native Hive Serde, it is not straightforward to solve all those issues with current implementation.

Proposal

This proposal is to add Spark native catalog implementation for AWS Glue Data Catalog into Delta package. The new catalog will be extension of DeltaCatalog.

AWS Glue Data Catalog is a serverless, Hive-metastore compatible service in AWS. It is widely used in different engines like Amazon EMR, Amazon Athena, AWS Glue, Amazon Redshift Spectrum, and also in OSS Spark, Hive, and so on.

The new catalog class (let’s call it GlueDeltaCatalog) implements createTable, alterTable, dropTable, etc. In this approach, GlueDeltaCatalog will be specified by the user as the spark catalog to be used for the spark session. This catalog will directly interact with AWS Glue Data Catalog bypassing the Hive client and thus its limitations. Glue AWS SDK Client will be used to connect to AWS Glue Data Catalog, and store the table details with correct schema information in the Glue database. The common methods for creating the delta lake table will be abstracted out in a new BaseDeltaCatalog class and DeltaCatalog and GlueDeltaCatalog will extend this. If users want to, they can still rely on the old catalog and use Hive client to contact AWS Glue Data Catalog. Extra translation utility methods will be required to convert a Delta lake table to Glue Data Catalog table.

In this approach, we will be able to call Delta Lake’s CreateDeltaTableCommand directly instead of using Hive metastore compatible class. This approach solves both issue 1 and issue 2, and also it does not cause issue 2’. Moreover, this approach allows Delta Lake to manage any other enhancement without being impacted by Hive metastore client side limitation.

Requirements

MUST:

Users can save correct table schema information in an external catalog using saveAsTable.
Users can create Delta table definitions under a database without database location.
Not break compatibility: Users can still perform standard catalog operations like createTable, updateTable, deleteTable, and so on.

PoC

We have verified that a simple PoC implementation of GlueDeltaCatalog was able to solve Issue 2.
Note: Issue 1 requires extra implementation to automatically convert the schema information.

Below is a simple implementation of GlueDeltaCatalog.createTable

  override def createTable(tableDefinition: CatalogTable, ignoreIfExists: Boolean): Unit = {
    glue = GlueClient.builder().build()
    assert(tableDefinition.identifier.database.isDefined)
    val db = tableDefinition.identifier.database.get
    val table = tableDefinition.identifier.table
//    requireDbExists(db)

//    if (tableExists(db, table) && !ignoreIfExists) {
//      throw new TableAlreadyExistsException(db = db, table = table)
//    }
    val needDefaultTableLocation = tableDefinition.tableType == MANAGED &&
      tableDefinition.storage.locationUri.isEmpty

    val tableLocation = if (needDefaultTableLocation) {
      Some(CatalogUtils.stringToURI(defaultTablePath(tableDefinition.identifier)))
    } else {
      tableDefinition.storage.locationUri
    }

    val tableWithDataSourceProps = tableDefinition.copy(
      storage = tableDefinition.storage.copy(locationUri = tableLocation))

    val tableInputBuilder = TableInput.builder
      .tableType(tableWithDataSourceProps.tableType.name)
      .parameters( mapAsJavaMap( tableWithDataSourceProps.properties))
      .name(tableWithDataSourceProps.identifier.table)
      .storageDescriptor( StorageDescriptor.builder
        .parameters( mapAsJavaMap( tableWithDataSourceProps.storage.properties ) )
        .location( tableWithDataSourceProps.location.toString )
        .build()
      )

    glue.createTable(CreateTableRequest.builder
      .databaseName(tableWithDataSourceProps.database)
      .tableInput(tableInputBuilder.build())
      .build)

  }

requireDb and tableExists are commented as they were not implemented for this PoC.
I hard coded the call to createTable for POC purposes in CreateDeltaTableCommand.updateCatalog

case TableCreationModes.Create =>
  (new GlueCatalog).createTable(
    cleaned,
    ignoreIfExists = existingTableOpt.isDefined)

As a result of this PoC

In a Glue database without having an explicit location, i was able to create a table using the following command:

use tvkalyan_deltadb_withoutlocation;
create table e1 (id1 int, id2 int) using delta location "s3://bhavayp-emr-dev/delta-table-e1";

As shown above, the table created got updated with correct location in the Glue database.

cc: @dennyglee

The text was updated successfully, but these errors were encountered:

dennyglee · 2023-04-04T15:45:48Z

Thanks @moomindani - this is super interesting. Will review this as soon as possible!

mo2menelzeiny · 2023-05-09T11:12:12Z

is there any work around to this to make Glue tables behave normally until this is implemented?

moomindani · 2023-05-09T12:54:43Z

is there any work around to this to make Glue tables behave normally until this is implemented?

Here's workaround.
For Issue 1, you can use Glue crawler to create table definition instead of calling saveAsTable.
For Issue 2, you can just put some location to your database.

BenLaot · 2023-07-26T14:24:48Z

Hi guys,
Regarding issue 1 :
Is there any chance to have this feature in a near future?

At the moment, at Decathlon we use EMR to create a lot of delta tables with saveAsTable while having AWS Glue as metastore.
We do not want to create crawlers because :

We have one Glue Database by data product => a lot of crawlers to be created and maintained
The data platform team will have to maintain the Glue crawlers while we want each data product team to manage their own data and metadata

Just so you know our current workaround is to call describe detail OUR_DELTA_TABLE, that somehow updates the table's schema with the good one and generates a new Glue table version.

Any news on this subject will be appreciated

samihoda · 2023-07-26T16:10:52Z

@dennyglee I have folks in the Federal government who can really benefit from this type of functionality being implemented and are interested in seeing this be completed. Anything I can do to upvote this?

timvw · 2023-08-28T07:04:51Z

Iceberg catalogs have a similar approach: org.apache.iceberg.spark.SparkCatalog which is meant to complement the spark_catalog extending org.apache.iceberg.spark.SparkSessionCatalog...

keenborder786 · 2023-08-28T16:29:45Z

Any updates on this. I really this feature.

calixtofelipe · 2023-11-18T19:54:52Z

I hope I can help others, as it helped me. I think it can be a good start solution because we just need to do 2 minor changes. All details in the PR above. I created the PR in the branch 2.3 because it is the compatible version with AWS Glue, but if needed it can be done in the master branch too.
I'm glad to find this way to solve the issue, it helped me to simplify my table creation process and continue to use AWS glue catalog and I hope it can help other too.
In the PR #2310 I added samples how I tested the table creation (MANAGED/EXTERNAL)

** Attention **

The mentioned PR [Spark] Resolves #1679 issue glue catalog #2310 only solve the issue 1. The issue 2 still persist. The workaround to solve the issue 2 still is to define the location in the database.

moomindani · 2023-11-20T02:15:18Z

I hope I can help others, as it helped me. I think it can be a good start solution because we just need to do 2 minor changes. All details in the PR above. I created the PR in the branch 2.3 because it is the compatible version with AWS Glue, but if needed it can be done in the master branch too. I'm glad to find this way to solve the issue, it helped me to simplify my table creation process and continue to use AWS glue catalog and I hope it can help other too. In the PR #2310 I added samples how I tested the table creation (MANAGED/EXTERNAL)

@calixtofelipe Thanks! It seems that your PR looks very similar to my original PR (#1579) but at that time it introduced another issue. Let me share something we observed in the past soon.

calixtofelipe · 2023-11-20T13:04:32Z

@calixtofelipe Thanks! It seems that your PR looks very similar to my original PR (#1579) but at that time it introduced another issue. Let me share something we observed in the past soon.

Hey @moomindani, in the original PR you changed the provider='delta' to 'parquet' and it generate the another issue because a lot of other places will check the provider (E.g: impact time travel capability).
This new PR still change the schema but I didn't change the provider as you did in the original PR. I added the command to alter the metadata and this command will update the Hive metastore successfully without overwrite the schema to empty as apache/spark project does when we execute createTable function. I provided more context in the PR. So, as we are keeping the provider='delta' should be fine, at least in my tests seems working. Thanks for the comment.

moomindani · 2023-11-21T05:28:10Z

@calixtofelipe Thanks for clarifying it. I confirmed that your PR won't cause the same issue that I experienced.
I added one more comments into your PR.

BTW, if I understand it correctly, your PR solves only the Issue 1 described in #1679, not the Issue 2. Is it correct?

Apologizes for multi posting the same in two places. Let's keep discussion in your PR side.

florent-brosse · 2023-12-21T13:12:40Z

FYI
I have a customer who needs to set spark.databricks.delta.catalog.update.hiveSchema.enabled = true because the schema isn't being correctly written into Glue unless this option is utilized.

calixtofelipe · 2023-12-21T15:03:43Z

FYI I have a customer who needs to set spark.databricks.delta.catalog.update.hiveSchema.enabled = true because the schema isn't being correctly written into Glue unless this option is utilized.

I didn't find this config (spark.databricks.delta.catalog.update.hiveSchema.enabled) in delta 2.3 or spark 3.3. Maybe, it is something created in a specific version.

florent-brosse · 2024-01-11T14:20:59Z

You're right it's only in Databricks

moomindani added the enhancement New feature or request label Apr 4, 2023

moomindani mentioned this issue Apr 4, 2023

Resolves external catalog integration issue #1579

Closed

scottsand-db assigned dennyglee Apr 6, 2023

calixtofelipe mentioned this issue Nov 18, 2023

[Spark] Resolves #1679 issue glue catalog #2310

Open

1 task

clarkzinzow mentioned this issue Mar 9, 2024

[FEAT] [Catalogs] Create generic daft.read_catalog_table() API Eventual-Inc/Daft#1998

Open

sgomezvillamor mentioned this issue Apr 16, 2024

feat(ingestion/glue): delta schemas datahub-project/datahub#10299

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design Doc] Catalog implementation for AWS Glue Data Catalog #1679

[Design Doc] Catalog implementation for AWS Glue Data Catalog #1679

moomindani commented Apr 4, 2023

dennyglee commented Apr 4, 2023

mo2menelzeiny commented May 9, 2023

moomindani commented May 9, 2023

BenLaot commented Jul 26, 2023

samihoda commented Jul 26, 2023

timvw commented Aug 28, 2023 •

edited

Loading

keenborder786 commented Aug 28, 2023

calixtofelipe commented Nov 18, 2023 •

edited

Loading

moomindani commented Nov 20, 2023

calixtofelipe commented Nov 20, 2023 •

edited

Loading

moomindani commented Nov 21, 2023

florent-brosse commented Dec 21, 2023 •

edited

Loading

calixtofelipe commented Dec 21, 2023

florent-brosse commented Jan 11, 2024

[Design Doc] Catalog implementation for AWS Glue Data Catalog #1679

[Design Doc] Catalog implementation for AWS Glue Data Catalog #1679

Comments

moomindani commented Apr 4, 2023

Motivation

Proposal

Requirements

PoC

dennyglee commented Apr 4, 2023

mo2menelzeiny commented May 9, 2023

moomindani commented May 9, 2023

BenLaot commented Jul 26, 2023

samihoda commented Jul 26, 2023

timvw commented Aug 28, 2023 • edited Loading

keenborder786 commented Aug 28, 2023

calixtofelipe commented Nov 18, 2023 • edited Loading

moomindani commented Nov 20, 2023

calixtofelipe commented Nov 20, 2023 • edited Loading

moomindani commented Nov 21, 2023

florent-brosse commented Dec 21, 2023 • edited Loading

calixtofelipe commented Dec 21, 2023

florent-brosse commented Jan 11, 2024

timvw commented Aug 28, 2023 •

edited

Loading

calixtofelipe commented Nov 18, 2023 •

edited

Loading

calixtofelipe commented Nov 20, 2023 •

edited

Loading

florent-brosse commented Dec 21, 2023 •

edited

Loading