Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage System] Support for AWS S3 (single cluster/driver) #39

Closed
tdas opened this issue May 10, 2019 · 7 comments
Closed

[Storage System] Support for AWS S3 (single cluster/driver) #39

tdas opened this issue May 10, 2019 · 7 comments
Labels
enhancement New feature or request
Milestone

Comments

@tdas
Copy link
Contributor

tdas commented May 10, 2019

This is the official issue for tracking support for AWS S3. Specifically, we want to enable Delta Lake to operate on AWS S3 with transactional guarantees when all writes go through a single Spark driver (that is, it must be a single SparkContext in a single Spark driver JVM process).

The major challenges for operating on S3 with transactional guarantees are as follows:

  1. Lack of atomic "put if not present" - Delta Lake's atomic visibility of transactional changes depends on committing to the transaction log atomically by creating a version file X only if it is not present. S3 file system does not provide a way to perform "put if absent", hence multiple concurrent writers can easily commit the same version file multiple times, thus overwriting another set of changes.

  2. Lack of consistent directory listing - Delta Lake relies on file directory listing to find the latest version file in the transactional log. S3 object listing does not provide the guarantee that listing attempts will return all the files written out in a directory. This, coupled with 1. can further lead to overwriting of the same version.

In this issue, we are going to solve the above problems for a single Spark cluster - if all the concurrent writes to a table go through a single cluster, then we can do the necessary locking and tracking latest version needed to avoid the above issues.

@tdas tdas added this to the 0.2.0 milestone May 10, 2019
@tdas tdas pinned this issue May 10, 2019
@tdas tdas added the enhancement New feature or request label May 10, 2019
This was referenced May 10, 2019
@tdas tdas changed the title Storage Support for AWS S3 (single cluster) [Storage System] Support for AWS S3 (single cluster) May 11, 2019
@binary132
Copy link

binary132 commented May 14, 2019

What about going through a locking interface which could be implemented within Spark or by a service, such as DynamoDB? Then future work could be merged to enable multi-cluster ACID.

@tdas
Copy link
Contributor Author

tdas commented May 14, 2019

Using something like DynamoDB as locking service is definitely one of ideas we will be playing with to implement the multi-cluster mode (tracked by #41 ). For this issue, we are focusing on releasing something quickly that enables the community to start using Delta Lake with S3.

@tdas tdas changed the title [Storage System] Support for AWS S3 (single cluster) [Storage System] Support for AWS S3 (single cluster/driver) May 22, 2019
@gourav-sg
Copy link

with EMR 5.24.0 we can have multiple master nodes (obviously they are fail-safe ones, but will be interesting to test).

@zsxwing
Copy link
Member

zsxwing commented Jun 12, 2019

This is resolved by c8169bd

@zsxwing zsxwing closed this as completed Jun 12, 2019
@gourav-sg
Copy link

What should be the best way to start testing this in AWS S3? I can also start updating few documentation around it so that others can refer to it as well, please do let me know.

@zsxwing
Copy link
Member

zsxwing commented Jun 13, 2019

@gourav-sg We are working the document right now. Once we update it, I will ping you in this ticket. Thanks!

@wendigo
Copy link

wendigo commented Sep 4, 2024

Since the S3 now supports conditional writes, is there a plan to support it as a reconciliation mechanism in delta?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants