[Storage System] Support for AWS S3 (single cluster/driver) #39

tdas · 2019-05-10T20:35:13Z

This is the official issue for tracking support for AWS S3. Specifically, we want to enable Delta Lake to operate on AWS S3 with transactional guarantees when all writes go through a single Spark driver (that is, it must be a single SparkContext in a single Spark driver JVM process).

The major challenges for operating on S3 with transactional guarantees are as follows:

Lack of atomic "put if not present" - Delta Lake's atomic visibility of transactional changes depends on committing to the transaction log atomically by creating a version file X only if it is not present. S3 file system does not provide a way to perform "put if absent", hence multiple concurrent writers can easily commit the same version file multiple times, thus overwriting another set of changes.
Lack of consistent directory listing - Delta Lake relies on file directory listing to find the latest version file in the transactional log. S3 object listing does not provide the guarantee that listing attempts will return all the files written out in a directory. This, coupled with 1. can further lead to overwriting of the same version.

In this issue, we are going to solve the above problems for a single Spark cluster - if all the concurrent writes to a table go through a single cluster, then we can do the necessary locking and tracking latest version needed to avoid the above issues.

binary132 · 2019-05-14T19:13:02Z

What about going through a locking interface which could be implemented within Spark or by a service, such as DynamoDB? Then future work could be merged to enable multi-cluster ACID.

tdas · 2019-05-14T19:47:35Z

Using something like DynamoDB as locking service is definitely one of ideas we will be playing with to implement the multi-cluster mode (tracked by #41 ). For this issue, we are focusing on releasing something quickly that enables the community to start using Delta Lake with S3.

gourav-sg · 2019-06-04T19:56:37Z

with EMR 5.24.0 we can have multiple master nodes (obviously they are fail-safe ones, but will be interesting to test).

zsxwing · 2019-06-12T22:01:57Z

This is resolved by c8169bd

gourav-sg · 2019-06-13T05:48:56Z

What should be the best way to start testing this in AWS S3? I can also start updating few documentation around it so that others can refer to it as well, please do let me know.

zsxwing · 2019-06-13T18:10:29Z

@gourav-sg We are working the document right now. Once we update it, I will ping you in this ticket. Thanks!

…eption for DELETE op (delta-io#39)

wendigo · 2024-09-04T14:04:28Z

Since the S3 now supports conditional writes, is there a plan to support it as a reconciliation mechanism in delta?

tdas added this to the 0.2.0 milestone May 10, 2019

tdas pinned this issue May 10, 2019

tdas added the enhancement New feature or request label May 10, 2019

This was referenced May 10, 2019

Delta support for s3 #11

Closed

S3 support for delta #15

Closed

tdas changed the title ~~Storage Support for AWS S3 (single cluster)~~ [Storage System] Support for AWS S3 (single cluster) May 11, 2019

tdas mentioned this issue May 11, 2019

[Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) #41

Closed

9 tasks

tdas changed the title ~~[Storage System] Support for AWS S3 (single cluster)~~ [Storage System] Support for AWS S3 (single cluster/driver) May 22, 2019

zsxwing closed this as completed Jun 12, 2019

tdas unpinned this issue Jun 19, 2019

mukulmurthy mentioned this issue Jan 2, 2020

[Storage System] Support for Google Cloud #294

Closed

mukulmurthy mentioned this issue Jan 16, 2020

[Storage System] Support for Oracle Cloud (Object Storage) #301

Closed

LantaoJin added a commit to LantaoJin/delta that referenced this issue May 27, 2020

[CARMEL-2561][FOLLOWUP] Add listPartitions for snapshot (delta-io#39)

9ee97e7

LantaoJin added a commit to LantaoJin/delta that referenced this issue Jun 15, 2021

[CARMEL-4848][CARMEL-4694][FOLLOWUP] Fix the metrics NoSuchElementExc…

0441277

…eption for DELETE op (delta-io#39)

prakharjain09 mentioned this issue May 22, 2024

[Spark] DynamoDBCommitOwner: add logging, get dynamic confs from sparkSession #3130

Merged

5 tasks

findinpath mentioned this issue Aug 27, 2024

Delta lake S3 exclusive write reconciliation trinodb/trino#23145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Storage System] Support for AWS S3 (single cluster/driver) #39

[Storage System] Support for AWS S3 (single cluster/driver) #39

tdas commented May 10, 2019 •

edited

Loading

binary132 commented May 14, 2019 •

edited

Loading

tdas commented May 14, 2019

gourav-sg commented Jun 4, 2019

zsxwing commented Jun 12, 2019

gourav-sg commented Jun 13, 2019

zsxwing commented Jun 13, 2019

wendigo commented Sep 4, 2024

[Storage System] Support for AWS S3 (single cluster/driver) #39

[Storage System] Support for AWS S3 (single cluster/driver) #39

Comments

tdas commented May 10, 2019 • edited Loading

binary132 commented May 14, 2019 • edited Loading

tdas commented May 14, 2019

gourav-sg commented Jun 4, 2019

zsxwing commented Jun 12, 2019

gourav-sg commented Jun 13, 2019

zsxwing commented Jun 13, 2019

wendigo commented Sep 4, 2024

tdas commented May 10, 2019 •

edited

Loading

binary132 commented May 14, 2019 •

edited

Loading