Skip to content

Commit

Permalink
docs: move dynamo docs into new docs page (delta-io#2093)
Browse files Browse the repository at this point in the history
# Description
Adds the dynamo docs into our new docs, within the python
write_deltalake I am pointing to the guide since it's quite extensive
and only for S3 users.


@rtyler @dispanser
  • Loading branch information
ion-elgreco authored Jan 21, 2024
1 parent 5d020d4 commit 61ca275
Show file tree
Hide file tree
Showing 5 changed files with 78 additions and 25 deletions.
File renamed without changes.
67 changes: 67 additions & 0 deletions docs/usage/writing/writing-to-s3-with-locking-provider.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Writing to S3 with a locking provider

A locking mechanism is needed to prevent unsafe concurrent writes to a
delta lake directory when writing to S3.

### DynamoDB
DynamoDB is the only available locking provider at the moment in delta-rs. To enable DynamoDB as the locking provider, you need to set the ``AWS_S3_LOCKING_PROVIDER`` to 'dynamodb' as a ``storage_options`` or as an environment variable.

Additionally, you must create a DynamoDB table with the name ``delta_log``
so that it can be automatically recognized by delta-rs. Alternatively, you can
use a table name of your choice, but you must set the ``DELTA_DYNAMO_TABLE_NAME``
variable to match your chosen table name. The required schema for the DynamoDB
table is as follows:

```json
"Table": {
"AttributeDefinitions": [
{
"AttributeName": "fileName",
"AttributeType": "S"
},
{
"AttributeName": "tablePath",
"AttributeType": "S"
}
],
"TableName": "delta_log",
"KeySchema": [
{
"AttributeName": "tablePath",
"KeyType": "HASH"
},
{
"AttributeName": "fileName",
"KeyType": "RANGE"
}
],
}
```

Here is an example writing to s3 using this mechanism:

```python
from deltalake import write_deltalake
df = pd.DataFrame({'x': [1, 2, 3]})
storage_options = {'AWS_S3_LOCKING_PROVIDER': 'dynamodb', 'DELTA_DYNAMO_TABLE_NAME': 'custom_table_name'}
write_deltalake('s3a://path/to/table', df, 'storage_options'= storage_options)
```

This locking mechanism is compatible with the one used by Apache Spark. The `tablePath` property, denoting the root url of the delta table itself, is part of the primary key, and all writers intending to write to the same table must match this property precisely. In Spark, S3 URLs are prefixed with `s3a://`, and a table in delta-rs must be configured accordingly.

The following code allows creating the necessary table from the AWS cli:

```sh
aws dynamodb create-table \
--table-name delta_log \
--attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \
--key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5
```

You can find additional information in the [delta-rs-documentation](https://docs.delta.io/latest/delta-storage.html#multi-cluster-setup), which also includes recommendations on configuring a time-to-live (TTL) for the table to avoid growing the table indefinitely.


### Enable unsafe writes in S3 (opt-in)
If for some reason you don't want to use dynamodb as your locking mechanism you can
choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to ``true`` in order to enable S3 unsafe writes.
4 changes: 3 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,9 @@ nav:
- Examining a table: usage/examining-table.md
- Querying a table: usage/querying-delta-tables.md
- Managing a table: usage/managing-tables.md
- Writing a table: usage/writing-delta-tables.md
- Writing a table:
- usage/writing/index.md
- usage/writing/writing-to-s3-with-locking-provider.md
- Deleting rows from a table: usage/deleting-rows-from-delta-lake-table.md
- Optimize:
- Small file compaction: usage/optimize/small-file-compaction-with-optimize.md
Expand Down
6 changes: 3 additions & 3 deletions python/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Deltalake-python

[![PyPI](https://img.shields.io/pypi/v/deltalake.svg?style=flat-square)](https://pypi.org/project/deltalake/)
[![userdoc](https://img.shields.io/badge/docs-user-blue)](https://delta-io.github.io/delta-rs/python/)
[![apidoc](https://img.shields.io/badge/docs-api-blue)](https://delta-io.github.io/delta-rs/python/api_reference.html)
[![userdoc](https://img.shields.io/badge/docs-user-blue)](https://delta-io.github.io/delta-rs/)
[![apidoc](https://img.shields.io/badge/docs-api-blue)](https://delta-io.github.io/delta-rs/api/delta_table/)

Native [Delta Lake](https://delta.io/) Python binding based on
[delta-rs](https://github.com/delta-io/delta-rs) with
Expand All @@ -22,7 +22,7 @@ dt.files()
'part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet']
```

See the [user guide](https://delta-io.github.io/delta-rs/python/usage.html) for more examples.
See the [user guide](https://delta-io.github.io/delta-rs/usage/installation/) for more examples.

## Installation

Expand Down
26 changes: 5 additions & 21 deletions python/deltalake/writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,29 +171,13 @@ def write_deltalake(
If the table does not already exist, it will be created.
This function only supports writer protocol version 2 currently. When
attempting to write to an existing table with a higher min_writer_version,
this function will throw DeltaProtocolError.
Note that this function does NOT register this table in a data catalog.
The pyarrow writer supports protocol version 2 currently and won't be updated.
For higher protocol support use engine='rust', this will become the default
eventually.
A locking mechanism is needed to prevent unsafe concurrent writes to a
delta lake directory when writing to S3. DynamoDB is the only available
locking provider at the moment in delta-rs. To enable DynamoDB as the
locking provider, you need to set the `AWS_S3_LOCKING_PROVIDER` to 'dynamodb'
as a storage_option or as an environment variable.
Additionally, you must create a DynamoDB table with the name 'delta_rs_lock_table'
so that it can be automatically discovered by delta-rs. Alternatively, you can
use a table name of your choice, but you must set the `DELTA_DYNAMO_TABLE_NAME`
variable to match your chosen table name. The required schema for the DynamoDB
table is as follows:
- Key Schema: AttributeName=key, KeyType=HASH
- Attribute Definitions: AttributeName=key, AttributeType=S
Please note that this locking mechanism is not compatible with any other
locking mechanisms, including the one used by Spark.
delta lake directory when writing to S3. For more information on the setup, follow
this usage guide: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/
Args:
table_or_uri: URI of a table or a DeltaTable object.
Expand Down

0 comments on commit 61ca275

Please sign in to comment.