Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple concurrent spark streaming writes with Minio and HMS with transactional guarantees #1336

Open
KhASQ opened this issue Aug 15, 2022 · 2 comments
Assignees

Comments

@KhASQ
Copy link

KhASQ commented Aug 15, 2022

Hi

I am enjoying working with the delta format and I believe it is the right table format for my use case.

I have a question about the transactional guarantees with concurrent spark streaming writes in Minio with HMS

The pipeline is like this:
1- Multiple spark streaming jobs write "upsert" to a single delta table stored in Minio "S3 compatible object store"
2- Querying the delta tables using Trino with HMS

I am worried about the notes in delta docs
https://docs.delta.io/latest/delta-storage.html#-delta-storage-s3

“This multi-cluster writing solution is only safe when all writers use this LogStore implementation as well as the same DynamoDB table and region. If some drivers use out-of-the-box Delta Lake while others use this experimental LogStore, then data loss can occur.”

How I can implement the multi-cluster setup in my envairemnt without DynamoDB to have transactional guarantees?

@nkarpov nkarpov self-assigned this Aug 16, 2022
@nkarpov
Copy link
Collaborator

nkarpov commented Aug 16, 2022

Hi @KhASQ - there is only the S3+DynamoDB support today but other methods for providing mutual exclusion is a great ask.

This will likely require an additional implementation of LogStore as mentioned on the storage configuration page that integrates with whichever external system is responsible for providing the mutual exclusion.

Could you please share more details about your environment?

  1. Where is this deployed, self hosted/on-prem, cloud provider etc.
  2. What additional services/databases exist? (if not Dynamo), (metastores of any kind etc.)
  3. Would you be open to helping/contributing to the solution?

@ChristianPfarr
Copy link

im also experimenting with delta + minio and recognized that i am in need for dynamodb to cover all szenarios in the wild

I found scylladb with alternator as an replacement for dynamodb
-> https://www.scylladb.com/alternator/
but i dont find a way to specify endpoints in delta configs for this LogStore.

I think we would need an additional parameter or the Clientbuilder here
-> https://github.com/delta-io/delta/blob/master/storage-s3-dynamodb/src/main/java/io/delta/storage/S3DynamoDBLogStore.java#L306

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants