Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: improve S3 access docs #2589

Merged
merged 5 commits into from
Jun 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 14 additions & 12 deletions docs/usage/loading-table.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ options](https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfig
[gcs
options](https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html#variants).

``` python
```python
>>> storage_options = {"AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY":"THE_AWS_SECRET_ACCESS_KEY"}
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0", storage_options=storage_options)
```
Expand All @@ -28,25 +28,27 @@ properties.

**S3**:

> - s3://\<bucket\>/\<path\>
> - s3a://\<bucket\>/\<path\>
> - s3://\<bucket\>/\<path\>
> - s3a://\<bucket\>/\<path\>

Note that `delta-rs` does not read credentials from a local `.aws/config` or `.aws/creds` file. Credentials can be accessed from environment variables, ec2 metadata, profiles or web identity. You can also pass credentials to `storage_options` using `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.

**Azure**:

> - az://\<container\>/\<path\>
> - adl://\<container\>/\<path\>
> - abfs://\<container\>/\<path\>
> - az://\<container\>/\<path\>
> - adl://\<container\>/\<path\>
> - abfs://\<container\>/\<path\>

**GCS**:

> - gs://\<bucket\>/\<path\>
> - gs://\<bucket\>/\<path\>

Alternatively, if you have a data catalog you can load it by reference
to a database and table name. Currently only AWS Glue is supported.

For AWS Glue catalog, use AWS environment variables to authenticate.

``` python
```python
>>> from deltalake import DeltaTable
>>> from deltalake import DataCatalog
>>> database_name = "simple_database"
Expand All @@ -66,7 +68,7 @@ customize the storage interface used for reading the bulk data.

`deltalake` will work with any storage compliant with `pyarrow.fs.FileSystem`, however the root of the filesystem has to be adjusted to point at the root of the Delta table. We can achieve this by wrapping the custom filesystem into a `pyarrow.fs.SubTreeFileSystem`.

``` python
```python
import pyarrow.fs as fs
from deltalake import DeltaTable

Expand All @@ -81,7 +83,7 @@ When using the pyarrow factory method for file systems, the normalized
path is provided on creation. In case of S3 this would look something
like:

``` python
```python
import pyarrow.fs as fs
from deltalake import DeltaTable

Expand All @@ -98,14 +100,14 @@ ds = dt.to_pyarrow_dataset(filesystem=filesystem)
To load previous table states, you can provide the version number you
wish to load:

``` python
```python
>>> dt = DeltaTable("../rust/tests/data/simple_table", version=2)
```

Once you\'ve loaded a table, you can also change versions using either a
version number or datetime string:

``` python
```python
>>> dt.load_version(1)
>>> dt.load_with_datetime("2021-11-04 00:05:23.283+00:00")
```
Expand Down
23 changes: 13 additions & 10 deletions docs/usage/writing/writing-to-s3-with-locking-provider.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,17 @@

Delta lake guarantees [ACID transactions](../../how-delta-lake-works/delta-lake-acid-transactions.md) when writing data. This is done by default when writing to all supported object stores except AWS S3. (Some S3 clients like CloudFlare R2 or MinIO may enable concurrent writing without a locking provider, refer to [this section](#enabling-concurrent-writes-for-alternative-clients) for more information).

When writing to S3, delta-rs provides a locking mechanism to ensure that concurrent writes are safe. This is done by default when writing to S3, but you can opt-out by setting the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to ``true``.
When writing to S3, delta-rs provides a locking mechanism to ensure that concurrent writes are safe. This is done by default when writing to S3, but you can opt-out by setting the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to `true`.

To enable safe concurrent writes to AWS S3, we must provide an external locking mechanism.

### DynamoDB
DynamoDB is the only available locking provider at the moment in delta-rs. To enable DynamoDB as the locking provider, you need to set the ``AWS_S3_LOCKING_PROVIDER`` to 'dynamodb' as a ``storage_options`` or as an environment variable.

Additionally, you must create a DynamoDB table with the name ``delta_log``
DynamoDB is the only available locking provider at the moment in delta-rs. To enable DynamoDB as the locking provider, you need to set the `AWS_S3_LOCKING_PROVIDER` to 'dynamodb' as a `storage_options` or as an environment variable.

Additionally, you must create a DynamoDB table with the name `delta_log`
so that it can be automatically recognized by delta-rs. Alternatively, you can
use a table name of your choice, but you must set the ``DELTA_DYNAMO_TABLE_NAME``
use a table name of your choice, but you must set the `DELTA_DYNAMO_TABLE_NAME`
variable to match your chosen table name. The required schema for the DynamoDB
table is as follows:

Expand Down Expand Up @@ -59,7 +60,9 @@ write_deltalake(

This locking mechanism is compatible with the one used by Apache Spark. The `tablePath` property, denoting the root url of the delta table itself, is part of the primary key, and all writers intending to write to the same table must match this property precisely. In Spark, S3 URLs are prefixed with `s3a://`, and a table in delta-rs must be configured accordingly.

The following code allows creating the necessary table from the AWS cli:
Note that `delta-rs` does not read credentials from your local `.aws/config` or `.aws/creds` file. Credentials can be accessed from environment variables, ec2 metadata, profiles or web identity. You can pass credentials to `storage_options` using `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.

The following code allows creating the necessary DynamoDB table from the AWS cli:

```sh
aws dynamodb create-table \
Expand All @@ -69,15 +72,15 @@ aws dynamodb create-table \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5
```

You can find additional information in the [delta-rs-documentation](https://docs.delta.io/latest/delta-storage.html#multi-cluster-setup), which also includes recommendations on configuring a time-to-live (TTL) for the table to avoid growing the table indefinitely.

You can find additional information in the [Delta Lake documentation](https://docs.delta.io/latest/delta-storage.html#multi-cluster-setup), which also includes recommendations on configuring a time-to-live (TTL) for the table to avoid growing the table indefinitely.

### Enable unsafe writes in S3 (opt-in)
If for some reason you don't want to use dynamodb as your locking mechanism you can
choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to ``true`` in order to enable S3 unsafe writes.

If for some reason you don't want to use dynamodb as your locking mechanism you can
choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to `true` in order to enable S3 unsafe writes.

### Required permissions

You need to have permissions to get, put and delete objects in the S3 bucket you're storing your data in. Please note that you must be allowed to delete objects even if you're just appending to the deltalake, because there are temporary files into the log folder that are deleted after usage.

In AWS, those would be the required permissions:
Expand Down Expand Up @@ -107,4 +110,4 @@ storage_options = {
```

Something similar can be done with MinIO but the header to pass should be verified
in the MinIO documentation.
in the MinIO documentation.
Loading