-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DynamoDBLogStore #339
DynamoDBLogStore #339
Conversation
do you have any testing framework that can be run as unit tests in the PR? I believe that with appropriate separation of contention-handling logic and dynamodb, each of them can be tested separately. for example, contention-handling logic can be tested with a pluggable in-memory store (that is, independent of dynamodb). Then we can ensure correctnes independent of any specific KV store. |
Thanks for taking a look!
Not yet.
You are right, I thought about that. BaseExternalLogStore logic can be tested independently of any KV store by subclassing it in tests. I'll work on that. |
db3e083
to
fc4fa03
Compare
db3e083
to
33c1fa0
Compare
33c1fa0
to
b2d4980
Compare
8a8b751
to
99b4cee
Compare
Hi @tdas, we implemented the recommended changes and added tests. Can you please take a look and tag appropriate reviewers? |
Hi, can we get some reviews on this PR? |
Is it possible to migrate an existing delta table to this format? Or does it have to be created from scratch with this implementation? |
It is perfectly fine to start with existing delta table and empty DynamoDB table, you don't need to do any migration. |
Sorry for the delay. We are working on adding a |
Hi @zsxwing, thanks for the reply. Can you share some rough estimates on when you'll be releasing those changes? Is it 1-2 months or rather some time in Q3 or Q4? We're currently running our prod workloads on the forked delta and I'd like to plan for the switch. |
@mmazek I would avoid changing the build script as it might break the release script and block 0.7.0. So it will likely happen in Q3. |
Hello, do you have any update on this? |
3085c67
to
9d35b1e
Compare
Apologies for leaving this PR open for so long. We are currently working on refactoring the project to support merging LogStore implementations in a separate module, so that we don't need to pull lots of unnecessary dependencies to the core project. In addition, we are also working on a stable public LogStore API to avoid developers building something on top of private APIs. This should be done soon as we are marking these tasks release blockers for the next release. Will ping you when they are ready. Thanks again for your contribution. |
Any news regarding this PR? Is it part of the 1.0.0? |
@emanuelh-cloud I'm going to continue work on that soon, stay tuned! |
9d35b1e
to
6cbb87d
Compare
I've moved code to contribs - could you take a look? Regarding tests:
|
contribs/src/main/scala/io/delta/storage/DynamoDBLogStore.scala
Outdated
Show resolved
Hide resolved
b9edd2b
to
897fb59
Compare
contribs/src/main/scala/io/delta/storage/BaseExternalLogStore.scala
Outdated
Show resolved
Hide resolved
0ca39a2
to
2a51441
Compare
e6ee259
to
6b054dd
Compare
3af698d
into
delta-io:dynamodb_logstore_scala_feature_branch
Any Updates #1498 Thanks |
Of course it is very old (based on delta 0.5), but adopting to current version of delta should be possible. |
## Description Taking inspiration from #339, this PR adds a Commit Owner Client which uses DynamoDB as the backend. Each Delta table managed by a DynamoDB instance will have one corresponding entry in a DynamoDB table. The table schema is as follows: * tableId: String --- The unique identifier for the entry. This is a UUID. * path: String --- The fully qualified path of the table in the file system. e.g. s3://bucket/path. * acceptingCommits: Boolean --- Whether the commit owner is accepting new commits. This will only * be set to false when the table is converted from managed commits to file system commits. * tableVersion: Number --- The version of the latest commit. * tableTimestamp: Number --- The inCommitTimestamp of the latest commit. * schemaVersion: Number --- The version of the schema used to store the data. * commits: --- The list of unbackfilled commits. - version: Number --- The version of the commit. - inCommitTimestamp: Number --- The inCommitTimestamp of the commit. - fsName: String --- The name of the unbackfilled file. - fsLength: Number --- The length of the unbackfilled file. - fsTimestamp: Number --- The modification time of the unbackfilled file. For a table to be managed by DynamoDB, `registerTable` must be called for that Delta table. This will create a new entry in the db for this Delta table. Every `commit` invocation appends the UUID delta file status to the `commits` list in the table entry. `commit` is performed through a conditional write in DynamoDB. ## How was this patch tested? Added a new suite called `DynamoDBCommitOwnerClient5BackfillSuite` which uses a mock DynamoDB client. + plus manual testing against a DynamoDB instance.
This PR addresses issue #41 - Support for AWS S3 (multiple clusters/drivers/JVMs)
It implements few ideas from #41 discussion:
in external DB. This class may be easly extended for specific DB backend
to be able to finish uncompleted write operation while reading
(ZooKeeper implementation is almost ready, I can create separate PR if anyone is interested)
DynamoDBLogStore requirements:
To enable DynamoDBLogStore set following spark property:
spark.delta.logStore.class=io.delta.storage.DynamoDBLogStore
Single dynamodb table is required. Default table name is 'delta_log',
it may be changed by setting spark property.
Required key schema:
Table may be created with following aws cli command:
Following spark properties are recognized:
Testing
Python integration test is included :examples/python/dynamodb_logstore.py
This solution has been also stress-tested on Amazon's EMR cluster
(mutiple test jobs writing thousands of parallel transactions to single delta table)
and no data loss has beed observed so far