Skip to content

Commit

Permalink
docs(python): update docs (#1155)
Browse files Browse the repository at this point in the history
# Description
The description of the main changes of your pull request

# Related Issue(s)

- closes #715
- closes #373


# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com>
  • Loading branch information
wjones127 and roeap authored Mar 3, 2023
1 parent 8dcf46e commit 88bdb55
Show file tree
Hide file tree
Showing 6 changed files with 206 additions and 57 deletions.
64 changes: 64 additions & 0 deletions python/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Contributing to Python deltalake package

## Workflow

Most of the workflow is based on the `Makefile` and the `maturin` CLI tool.

#### Setup your local environment with virtualenv

```bash
$ make setup-venv
```

#### Activate it
```bash
$ source ./venv/bin/activate
```

#### Ready to develop with maturin

[maturin](https://github.com/PyO3/maturin) is used to build the python package.
Install delta-rs in the current virtualenv

```bash
$ make develop
```

Then, list all the available tasks

```bash
$ make help
```

Format:

```bash
make format
```

Check:

```bash
make check-python
```

Unit test:

```bash
make unit-test
```

## Release process

1. Make a new PR to update the version in pyproject.toml.
2. Once merged, push a tag of the format `python-vX.Y.Z`. This will trigger CI
to create and publish release artifacts.
3. In GitHub, create a new release based on the new tag. For release notes,
use the generator at a starting point, but please revise them for brevity.
Remove anything that is dev-facing only (chores), and bring all important
changes to the top, leaving less important changes (such as dependabot
updates) at the bottom.
4. Once the artifacts are showing up in PyPI, announce the release in the delta-rs
Slack channel. Be sure to give a shout-out to the new contributors.


99 changes: 64 additions & 35 deletions python/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
Deltalake-python
================
# Deltalake-python

[![PyPI](https://img.shields.io/pypi/v/deltalake.svg?style=flat-square)](https://pypi.org/project/deltalake/)
[![userdoc](https://img.shields.io/badge/docs-user-blue)](https://delta-io.github.io/delta-rs/python/)
Expand All @@ -10,8 +9,22 @@ Native [Delta Lake](https://delta.io/) Python binding based on
[Pandas](https://pandas.pydata.org/) integration.


Installation
------------
## Example

```python
from deltalake import DeltaTable
dt = DeltaTable("../rust/tests/data/delta-0.2.0")
dt.version()
3
dt.files()
['part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet',
'part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet',
'part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet']
```

See the [user guide](https://delta-io.github.io/delta-rs/python/usage.html) for more examples.

## Installation

```bash
pip install deltalake
Expand All @@ -22,49 +35,65 @@ objection store communication. Please file Github issue to request for critical
openssl upgrade.


Develop
-------
## Build custom wheels

#### Setup your local environment with virtualenv
```bash
$ make setup-venv
```
Sometimes you may wish to build custom wheels. Maybe you want to try out some
unreleased features. Or maybe you want to tweak the optimization of the Rust code.

#### Activate it
```bash
$ source ./venv/bin/activate
To compile the package, you will need the Rust compiler and [maturin](https://github.com/PyO3/maturin):

```sh
curl https://sh.rustup.rs -sSf | sh -s
pip install maturin
```

#### Ready to develop with maturin
Then you can build wheels for your own platform like so:

[maturin](https://github.com/PyO3/maturin) is used to build the python package.
Install delta-rs in the current virtualenv
```sh
maturin build --release --out wheels
```

```bash
$ make develop
For a build that is optimized for the system you are on (but sacrificing portability):

```sh
RUSTFLAGS="-C target-cpu=native" maturin build --release --out wheels
```

Then, list all the available tasks
#### Cross compilation

```bash
$ make help
The above command only works for your current platform. To create wheels for other
platforms, you'll need to cross compile. Cross compilation requires installing
two additional components: to cross compile Rust code, you will need to install
the target with `rustup`; to cross compile the Python bindings, you will need
to install `ziglang`.

The following example is for manylinux2014. Other targets will require different
Rust `target` and Python `compatibility` tags.

```sh
rustup target add x86_64-unknown-linux-gnu
pip install ziglang
```

Build manylinux wheels
----------------------
Then you can build the wheel with:

```bash
docker run -e PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig -it -v `pwd`:/io apache/arrow-dev:amd64-centos-6.10-python-manylinux2010 bash
curl https://sh.rustup.rs -sSf | sh -s -- -y
source $HOME/.cargo/env
rustup default stable
cargo install --git https://github.com/PyO3/maturin.git --rev 98636cea89c328b3eba4ebb548124f75c8018200 maturin
cd /io/python
export PATH=/opt/python/cp37-cp37m/bin:/opt/python/cp38-cp38/bin:$PATH
maturin publish -b pyo3 --target x86_64-unknown-linux-gnu --no-sdist
```sh
maturin build --release --zig \
--target x86_64-unknown-linux-gnu \
--compatibility manylinux2014 \
--out wheels
```

#### PyPI release
If you expect to only run on more modern system, you can set a newer `target-cpu`
flag to Rust and use a newer compatibility tag for Linux. For example, here
we set compatibility with CPUs newer than Haswell (2013) and Linux OS with
glibc version of at least 2.24:

```sh
RUSTFLAGS="-C target-cpu=haswell" maturin build --release --zig \
--target x86_64-unknown-linux-gnu \
--compatibility manylinux_2_24 \
--out wheels
```

Publish a new GitHub release with name and tag version set to `python-vx.y.z`.
This will trigger our automated release pipeline.
See note about `RUSTFLAGS` from [the arrow-rs readme](https://github.com/apache/arrow-rs/blob/master/arrow/README.md#performance-tips).
18 changes: 9 additions & 9 deletions python/deltalake/writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,15 +85,15 @@ def write_deltalake(
configuration: Optional[Mapping[str, Optional[str]]] = None,
overwrite_schema: bool = False,
storage_options: Optional[Dict[str, str]] = None,
partitions_filters: Optional[List[Tuple[str, str, Any]]] = None,
partition_filters: Optional[List[Tuple[str, str, Any]]] = None,
) -> None:
"""Write to a Delta Lake table (Experimental)
"""Write to a Delta Lake table
If the table does not already exist, it will be created.
This function only supports protocol version 1 currently. If an attempting
to write to an existing table with a higher min_writer_version, this
function will throw DeltaTableProtocolError.
This function only supports writer protocol version 2 currently. When
attempting to write to an existing table with a higher min_writer_version,
this function will throw DeltaTableProtocolError.
Note that this function does NOT register this table in a data catalog.
Expand Down Expand Up @@ -133,7 +133,7 @@ def write_deltalake(
:param configuration: A map containing configuration options for the metadata action.
:param overwrite_schema: If True, allows updating the schema of the table.
:param storage_options: options passed to the native delta filesystem. Unused if 'filesystem' is defined.
:param partitions_filters: the partition filters that will be used for partition overwrite.
:param partition_filters: the partition filters that will be used for partition overwrite.
"""
if _has_pandas and isinstance(data, pd.DataFrame):
if schema is not None:
Expand Down Expand Up @@ -234,7 +234,7 @@ def check_data_is_aligned_with_partition_filtering(
if table is None:
return
existed_partitions = table._table.get_active_partitions()
allowed_partitions = table._table.get_active_partitions(partitions_filters)
allowed_partitions = table._table.get_active_partitions(partition_filters)
for column_index, column_name in enumerate(batch.schema.names):
if column_name in table.metadata().partition_columns:
for value in batch.column(column_index).unique():
Expand All @@ -254,7 +254,7 @@ def check_data_is_aligned_with_partition_filtering(
def validate_batch(batch: pa.RecordBatch) -> pa.RecordBatch:
checker.check_batch(batch)

if mode == "overwrite" and partitions_filters:
if mode == "overwrite" and partition_filters:
check_data_is_aligned_with_partition_filtering(batch)

return batch
Expand Down Expand Up @@ -308,7 +308,7 @@ def validate_batch(batch: pa.RecordBatch) -> pa.RecordBatch:
mode,
partition_by or [],
schema,
partitions_filters,
partition_filters,
)


Expand Down
9 changes: 6 additions & 3 deletions python/docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
Python bindings documentation of delta-rs
Python deltalake package
=========================================================

This is the documentation of the Python bindings of delta-rs, ``deltalake``.
This is the documentation for the native Python implementation of deltalake. It
is based on the delta-rs Rust library and requires no Spark or JVM dependencies.
For the PySpark implementation, see `delta-spark`_ instead.

This module provides the capability to read, write, and manage `Delta Lake`_
tables from Python without Spark or Java. It uses `Apache Arrow`_ under the hood,
so is compatible with other Arrow-native or integrated libraries such as
Expand All @@ -13,7 +16,7 @@ Pandas_, DuckDB_, and Polars_.
It is not yet as feature-complete as the PySpark implementation of Delta
Lake. If you encounter a bug, please let us know in our `GitHub repo`_.


.. _delta-spark: https://docs.delta.io/latest/api/python/index.html
.. _Delta Lake: https://delta.io/
.. _Apache Arrow: https://arrow.apache.org/
.. _Pandas: https://pandas.pydata.org/
Expand Down
61 changes: 57 additions & 4 deletions python/docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,34 @@ To view the available history, use :meth:`DeltaTable.history`:
{'timestamp': 1587968586154, 'operation': 'WRITE', 'operationParameters': {'mode': 'ErrorIfExists', 'partitionBy': '[]'}, 'isBlindAppend': True}]
Current Add Actions
~~~~~~~~~~~~~~~~~~~

The active state for a delta table is determined by the Add actions, which
provide the list of files that are part of the table and metadata about them,
such as creation time, size, and statistics. You can get a data frame of
the add actions data using :meth:`DeltaTable.get_add_actions`:

.. code-block:: python
>>> from deltalake import DeltaTable
>>> dt = DeltaTable("../rust/tests/data/delta-0.8.0")
>>> dt.get_add_actions(flatten=True).to_pandas()
path size_bytes modification_time data_change num_records null_count.value min.value max.value
0 part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a... 440 2021-03-06 15:16:07 True 2 0 0 2
1 part-00000-04ec9591-0b73-459e-8d18-ba5711d6cbe... 440 2021-03-06 15:16:16 True 2 0 2 4
This works even with past versions of the table:

.. code-block:: python
>>> dt = DeltaTable("../rust/tests/data/delta-0.8.0", version=0)
>>> dt.get_add_actions(flatten=True).to_pandas()
path size_bytes modification_time data_change num_records null_count.value min.value max.value
0 part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a... 440 2021-03-06 15:16:07 True 2 0 0 2
1 part-00001-911a94a2-43f6-4acb-8620-5e68c265498... 445 2021-03-06 15:16:07 True 3 0 2 4
Querying Delta Tables
---------------------

Expand Down Expand Up @@ -378,10 +406,6 @@ Writing Delta Tables

.. py:currentmodule:: deltalake
.. warning::
The writer is currently *experimental*. Please use on test data first, not
on production data. Report any issues at https://github.com/delta-io/delta-rs/issues.

For overwrites and appends, use :py:func:`write_deltalake`. If the table does not
already exist, it will be created. The ``data`` parameter will accept a Pandas
DataFrame, a PyArrow Table, or an iterator of PyArrow Record Batches.
Expand Down Expand Up @@ -409,3 +433,32 @@ to append pass in ``mode='append'``:
:py:meth:`write_deltalake` will raise :py:exc:`ValueError` if the schema of
the data passed to it differs from the existing table's schema. If you wish to
alter the schema as part of an overwrite pass in ``overwrite_schema=True``.


Overwriting a partition
~~~~~~~~~~~~~~~~~~~~~~~

You can overwrite a specific partition by using ``mode="overwrite"`` together
with ``partition_filters``. This will remove all files within the matching
partition and insert your data as new files. This can only be done on one
partition at a time. All of the input data must belong to that partition or else
the method will raise an error.

.. code-block:: python
>>> from deltalake.writer import write_deltalake
>>> df = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'a', 'b']})
>>> write_deltalake('path/to/table', df, partition_by=['y'])
>>> table = DeltaTable('path/to/table')
>>> df2 = pd.DataFrame({'x': [100], 'y': ['b']})
>>> write_deltalake(table, df2, partition_filters=[('y', '=', 'b')], mode="overwrite")
>>> table.to_pandas()
x y
0 1 a
1 2 a
2 100 b
This method could also be used to insert a new partition if one doesn't already
exist, making this operation idempotent.
Loading

0 comments on commit 88bdb55

Please sign in to comment.