-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage System] Support for AWS S3 (single cluster/driver) #39
Comments
What about going through a locking interface which could be implemented within Spark or by a service, such as DynamoDB? Then future work could be merged to enable multi-cluster ACID. |
Using something like DynamoDB as locking service is definitely one of ideas we will be playing with to implement the multi-cluster mode (tracked by #41 ). For this issue, we are focusing on releasing something quickly that enables the community to start using Delta Lake with S3. |
with EMR 5.24.0 we can have multiple master nodes (obviously they are fail-safe ones, but will be interesting to test). |
This is resolved by c8169bd |
What should be the best way to start testing this in AWS S3? I can also start updating few documentation around it so that others can refer to it as well, please do let me know. |
@gourav-sg We are working the document right now. Once we update it, I will ping you in this ticket. Thanks! |
…eption for DELETE op (delta-io#39)
Since the S3 now supports conditional writes, is there a plan to support it as a reconciliation mechanism in delta? |
This is the official issue for tracking support for AWS S3. Specifically, we want to enable Delta Lake to operate on AWS S3 with transactional guarantees when all writes go through a single Spark driver (that is, it must be a single SparkContext in a single Spark driver JVM process).
The major challenges for operating on S3 with transactional guarantees are as follows:
Lack of atomic "put if not present" - Delta Lake's atomic visibility of transactional changes depends on committing to the transaction log atomically by creating a version file X only if it is not present. S3 file system does not provide a way to perform "put if absent", hence multiple concurrent writers can easily commit the same version file multiple times, thus overwriting another set of changes.
Lack of consistent directory listing - Delta Lake relies on file directory listing to find the latest version file in the transactional log. S3 object listing does not provide the guarantee that listing attempts will return all the files written out in a directory. This, coupled with 1. can further lead to overwriting of the same version.
In this issue, we are going to solve the above problems for a single Spark cluster - if all the concurrent writes to a table go through a single cluster, then we can do the necessary locking and tracking latest version needed to avoid the above issues.
The text was updated successfully, but these errors were encountered: