Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the S3DynamoDBLogStore to be compatible with ScyllaDB's Alternator. #2410

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

rbushri
Copy link
Contributor

@rbushri rbushri commented Dec 28, 2023

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • S3DynamoDBLogStore
  • Other (fill in here)

Description

This PR aims to create a cloud-agnostic solution for the Delta Lake on S3 Multiple Writers issue using ScyllaDB's Alternator. It offers an open-source solution for S3 and S3-compatible storage lacking the putIfAbsent functionality.
The implementation includes the addition of an abstraction layer for DynamoDB LogStore (io.delta.storage.BaseDynamoDBLogStore) and introduces two implementations:

  1. io.delta.storage.DynamoDBLogStore - for DynamoDB (no configuration changes for DynamoDB implementation).
  2. io.delta.storage.S3ScyllaDBLogStore - for Scylla DB
    The configuration details for ScyllaDB are as follows:
spark.delta.logStore.s3a.impl=io.delta.storage.S3ScyllaDBLogStore
spark.io.delta.storage.S3ScyllaDBLogStore.ddb.endpoint=<ScyllaDB's Alternator cluster endpoint>
spark.io.delta.storage.S3ScyllaDBLogStore.credentials.provider=<The AWSCredentialsProvider used by the client, default DefaultAWSCredentialsProviderChain>
spark.io.delta.storage.S3ScyllaDBLogStore.ddb.tableName=<The name of the Scylla table to use, default delta_log>

Resolves #2411, #1336, #1441

How was this patch tested?

Unit test - SUCCEEDED
Manual test :

  • Set up ScyllaDB's Alternator cluster on K8s.
  • Write and read to a delta table on S3 storage with the specified configuration:
    spark.delta.logStore.s3a.impl=io.delta.storage.S3ScyllaDBLogStore
    spark.io.delta.storage.S3ScyllaDBLogStore.ddb.endpoint=<ScyllaDB's Alternator cluster endpoint>
    
  • Verify that the delta_logs table is created in ScyllaDB, and Delta uses this table for reading and writing the logs.

Does this PR introduce any user-facing changes?

No

…ator.

Adding `spark.io.delta.storage.S3DynamoDBLogStore.ddb.endpoint`  configuration (not mandatory).
rbushrian added 2 commits January 2, 2024 16:09
… ScyllaDB for S3DynamoDBLogStore log store.
… ScyllaDB for S3DynamoDBLogStore log store.
@ItaiYaffeAkamai
Copy link

@scottsand-db , @mrk-its - there were numerous discussions on Delta OSS Slack around adding S3 concurrent write support, that would not rely on AWS DynamoDB, but rather an open-source database (including the one I started some months ago - https://delta-users.slack.com/archives/CJ70UCSHM/p1689589392090319).
I think @rbushri created a very simple and elegant solution that could potentially help many Delta OSS users.

Who would be the right person to review and approve this PR?
Thanks in advance!

@scottsand-db
Copy link
Collaborator

Hi @rbushri - this looks great! Seems like we should rename S3DynamoDBLogStore to something more top-level then, eh? SinceScyllaDB doesn't seem to be a child of DynamoDB, right?

Seems like we could have some generic abstract parent class, and keep the child class name S3DynamoDBLogStore for existing clients, and create a new S3ScyllaDBLogStore? What do you think?

@rbushri
Copy link
Contributor Author

rbushri commented Jan 10, 2024

@scottsand-db, thank you for your review. I've implemented the changes you suggested and updated the PR description. If these changes are acceptable to you, I'll proceed to update the documentation.

@rbushri
Copy link
Contributor Author

rbushri commented Jan 29, 2024

@scottsand-db, would you kindly consider reviewing the changes I made?

@scottsand-db
Copy link
Collaborator

@rbushri - yes! sorry, thinks have been very busy focusing on the delta 3.1 release! will take a look

@rbushri
Copy link
Contributor Author

rbushri commented Feb 8, 2024

Thanks @scottsand-db! I truly appreciate your review.

@rbushri rbushri requested a review from scottsand-db February 29, 2024 21:58
@rbushri
Copy link
Contributor Author

rbushri commented Mar 13, 2024

@scottsand-db, I've addressed your comments and updated the documentation. could you review it, please?

@chris-aeviator
Copy link

chris-aeviator commented Mar 28, 2024

@rbushri I'm impatiently awaiting this feature getting merged 🙏 . Does this include using delta with write locks via Scylla when interfacing with python (via delta-rs)?

UPDATE: I managed to patch in my custom endpoint into delta-rs and will contribute this to their repo once the code is more clean.

@rbushri
Copy link
Contributor Author

rbushri commented Apr 2, 2024

@scottsand-db, Would you kindly review the changes I've made?

@hattajr
Copy link

hattajr commented Jul 11, 2024

its been 4 months passed, any updates on this PR?

@scottsand-db
Copy link
Collaborator

Hey all, thanks for your patience. Will try to find time to review this in the coming week. Thanks!

@obakamai
Copy link

@scottsand-db Any progress with this ? Would be much appreciated if you could push this forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adjust S3DynamoDBLogStore to ScyllaDB's Alternator
6 participants