Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: explain the value of deltalake on first page of docs #3017

Merged
merged 5 commits into from
Nov 22, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 67 additions & 11 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,78 @@
# The deltalake package
`deltalake` is an open source library that makes working with tabular datasets easier, more robust and more performant. With deltalake you can add, remove or update rows in a dataset as new data arrives. You can time travel back to earlier versions of a dataset. You can optimize dataset storage from small files to large files.

This is the documentation for the native Rust/Python implementation of Delta Lake. It is based on the delta-rs Rust library and requires no Spark or JVM dependencies. For the PySpark implementation, see [delta-spark](https://docs.delta.io/latest/api/python/spark/index.html) instead.
`deltalake` can be used to manage data stored on a local file system or in the cloud. `deltalake` integrates with data manipulation libraries such as Pandas, Polars, DuckDB and DataFusion.

This module provides the capability to read, write, and manage [Delta Lake](https://delta.io/) tables with Python or Rust without Spark or Java. It uses [Apache Arrow](https://arrow.apache.org/) under the hood, so is compatible with other Arrow-native or integrated libraries such as [pandas](https://pandas.pydata.org/), [DuckDB](https://duckdb.org/), and [Polars](https://www.pola.rs/).
`deltalake` uses a lakehouse framework for managing datasets. With this lakehouse approach you manage your datasets with a `DeltaTable` object and then `deltalake` takes care of the underlying files. Within a `DeltaTable` your data is stored in high performance Parquet files while metadata is stored in a set of JSON files called a transaction log.

## Important terminology
`deltalake` is a Rust-based re-implementation of the DeltaLake protocol originally developed at DataBricks. The `deltalake` library has APIs in Rust and Python. The `deltalake` implementation has no dependencies on Java, Spark or DataBricks.

* "Rust deltalake" refers to the Rust API of delta-rs (no Spark dependency)
* "Python deltalake" refers to the Python API of delta-rs (no Spark dependency)
* "Delta Spark" refers to the Scala implementation of the Delta Lake transaction log protocol. This depends on Spark and Java.

## Why implement the Delta Lake transaction log protocol in Rust and Scala?
## Important terminology

Delta Spark depends on Java and Spark, which is fine for many use cases, but not all Delta Lake users want to depend on these libraries. delta-rs allows using Delta Lake in Rust or other native projects when using a JVM is often not an option.
* `deltalake` refers to the Rust or Python API of delta-rs
* "Delta Spark" refers to the Scala implementation of the Delta Lake transaction log protocol. This depends on Spark and Java.

Python deltalake lets you query Delta tables without depending on Java/Scala.
## Why implement the Delta Lake transaction log protocol in Rust?

Delta Spark depends on Java and Spark, which is fine for many use cases, but not all Delta Lake users want to depend on these libraries. `deltalake` allows you to manage your dataset using a Delta Lake approach without any Java or Spark dependencies.

A `DeltaTable` on disk is simply a directory that stores metadata in JSON files and data in Parquet files.

## Quick start

You can install `deltalake` in Python with `pip`
```bash
pip install deltalake
```
We create a Pandas `DataFrame` and write it to a `DeltaTable`:
```python
import pandas as pd
from deltalake import DeltaTable,write_deltalake

df = pd.DataFrame(
{
"id": [1, 2, 3],
"name": ["Aadhya", "Bob", "Chen"],
}
)

(
write_deltalake(
table_or_uri="delta_table_dir",
data=df,
)
)
```
We create a `DeltaTable` object that holds the metadata for the Delta table:
```python
dt = DeltaTable("delta_table_dir")
```
We load the `DeltaTable` into a Pandas `DataFrame` with `to_pandas` on a `DeltaTable`:
```python
new_df = dt.to_pandas()
```

Or we can load the data into a Polars `DataFrame` with `pl.read_delta`:
```python
import polars as pl
new_df = pl.read_delta("delta_table_dir")
```

Or we can load the data with DuckDB:
```python
import duckdb
duckdb.query("SELECT * FROM delta_scan('./delta_table_dir')")
```

Or we can load the data with DataFusion:
```python
from datafusion import SessionContext

ctx = SessionContext()
ctx.register_dataset("my_delta_table", dt.to_pyarrow_dataset())
ctx.sql("select * from my_delta_table")
braaannigan marked this conversation as resolved.
Show resolved Hide resolved
```

Suppose you want to query a Delta table with pandas on your local machine. Python deltalake makes it easy to query the table with a simple `pip install` command - no need to install Java.

## Contributing

Expand Down
Loading