Skip to content

Commit

Permalink
docs: clarify locking mechanism requirement for S3 (#2558)
Browse files Browse the repository at this point in the history
- It was unclear to me that concurrent writing was available by default
for non-S3 backends, so I am making the language clearer.
- I have also added an extra section showing that R2 and maybe MinIO can
enable concurrent writing
- Fixed a couple of unrelated formatting issues in the page I edited

closes #2556 

#2069 also had the same confusion
  • Loading branch information
inigohidalgo authored Jun 1, 2024
1 parent fa4c3d8 commit d42b68d
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 7 deletions.
36 changes: 32 additions & 4 deletions docs/usage/writing/writing-to-s3-with-locking-provider.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# Writing to S3 with a locking provider

A locking mechanism is needed to prevent unsafe concurrent writes to a
delta lake directory when writing to S3.
Delta lake guarantees [ACID transactions](../../how-delta-lake-works/delta-lake-acid-transactions.md) when writing data. This is done by default when writing to all supported object stores except AWS S3. (Some S3 clients like CloudFlare R2 or MinIO may enable concurrent writing without a locking provider, refer to [this section](#enabling-concurrent-writes-for-alternative-clients) for more information).

When writing to S3, delta-rs provides a locking mechanism to ensure that concurrent writes are safe. This is done by default when writing to S3, but you can opt-out by setting the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to ``true``.

To enable safe concurrent writes to AWS S3, we must provide an external locking mechanism.

### DynamoDB
DynamoDB is the only available locking provider at the moment in delta-rs. To enable DynamoDB as the locking provider, you need to set the ``AWS_S3_LOCKING_PROVIDER`` to 'dynamodb' as a ``storage_options`` or as an environment variable.
Expand Down Expand Up @@ -43,8 +46,15 @@ Here is an example writing to s3 using this mechanism:
```python
from deltalake import write_deltalake
df = pd.DataFrame({'x': [1, 2, 3]})
storage_options = {'AWS_S3_LOCKING_PROVIDER': 'dynamodb', 'DELTA_DYNAMO_TABLE_NAME': 'custom_table_name'}
write_deltalake('s3a://path/to/table', df, 'storage_options'= storage_options)
storage_options = {
'AWS_S3_LOCKING_PROVIDER': 'dynamodb',
'DELTA_DYNAMO_TABLE_NAME': 'custom_table_name'
}
write_deltalake(
's3a://path/to/table',
df,
storage_options=storage_options
)
```

This locking mechanism is compatible with the one used by Apache Spark. The `tablePath` property, denoting the root url of the delta table itself, is part of the primary key, and all writers intending to write to the same table must match this property precisely. In Spark, S3 URLs are prefixed with `s3a://`, and a table in delta-rs must be configured accordingly.
Expand All @@ -71,12 +81,30 @@ choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to ``true`` in order to
You need to have permissions to get, put and delete objects in the S3 bucket you're storing your data in. Please note that you must be allowed to delete objects even if you're just appending to the deltalake, because there are temporary files into the log folder that are deleted after usage.

In AWS, those would be the required permissions:

- s3:GetObject
- s3:PutObject
- s3:DeleteObject

In DynamoDB, you need those permissions:

- dynamodb:GetItem
- dynamodb:Query
- dynamodb:PutItem
- dynamodb:UpdateItem

### Enabling concurrent writes for alternative clients

Unlike AWS S3, some S3 clients support atomic renames by passing some headers
in requests.

For CloudFlare R2 passing this in the storage_options will enable concurrent writes:

```python
storage_options = {
"copy_if_not_exists": "header: cf-copy-destination-if-none-match: *",
}
```

Something similar can be done with MinIO but the header to pass should be verified
in the MinIO documentation.
6 changes: 3 additions & 3 deletions python/deltalake/writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,9 +208,9 @@ def write_deltalake(
For higher protocol support use engine='rust', this will become the default
eventually.
A locking mechanism is needed to prevent unsafe concurrent writes to a
delta lake directory when writing to S3. For more information on the setup, follow
this usage guide: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/
To enable safe concurrent writes when writing to S3, an additional locking
mechanism must be supplied. For more information on enabling concurrent writing to S3, follow
[this guide](https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/)
Args:
table_or_uri: URI of a table or a DeltaTable object.
Expand Down

0 comments on commit d42b68d

Please sign in to comment.