diff --git a/docs/index.md b/docs/index.md index a6ac3271da..19ef941e60 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,22 +1,78 @@ -# The deltalake package +`deltalake` is an open source library that makes working with tabular datasets easier, more robust and more performant. With deltalake you can add, remove or update rows in a dataset as new data arrives. You can time travel back to earlier versions of a dataset. You can optimize dataset storage from small files to large files. -This is the documentation for the native Rust/Python implementation of Delta Lake. It is based on the delta-rs Rust library and requires no Spark or JVM dependencies. For the PySpark implementation, see [delta-spark](https://docs.delta.io/latest/api/python/spark/index.html) instead. +`deltalake` can be used to manage data stored on a local file system or in the cloud. `deltalake` integrates with data manipulation libraries such as Pandas, Polars, DuckDB and DataFusion. -This module provides the capability to read, write, and manage [Delta Lake](https://delta.io/) tables with Python or Rust without Spark or Java. It uses [Apache Arrow](https://arrow.apache.org/) under the hood, so is compatible with other Arrow-native or integrated libraries such as [pandas](https://pandas.pydata.org/), [DuckDB](https://duckdb.org/), and [Polars](https://www.pola.rs/). +`deltalake` uses a lakehouse framework for managing datasets. With this lakehouse approach you manage your datasets with a `DeltaTable` object and then `deltalake` takes care of the underlying files. Within a `DeltaTable` your data is stored in high performance Parquet files while metadata is stored in a set of JSON files called a transaction log. -## Important terminology +`deltalake` is a Rust-based re-implementation of the DeltaLake protocol originally developed at DataBricks. The `deltalake` library has APIs in Rust and Python. The `deltalake` implementation has no dependencies on Java, Spark or DataBricks. -* "Rust deltalake" refers to the Rust API of delta-rs (no Spark dependency) -* "Python deltalake" refers to the Python API of delta-rs (no Spark dependency) -* "Delta Spark" refers to the Scala implementation of the Delta Lake transaction log protocol. This depends on Spark and Java. -## Why implement the Delta Lake transaction log protocol in Rust and Scala? +## Important terminology -Delta Spark depends on Java and Spark, which is fine for many use cases, but not all Delta Lake users want to depend on these libraries. delta-rs allows using Delta Lake in Rust or other native projects when using a JVM is often not an option. +* `deltalake` refers to the Rust or Python API of delta-rs +* "Delta Spark" refers to the Scala implementation of the Delta Lake transaction log protocol. This depends on Spark and Java. -Python deltalake lets you query Delta tables without depending on Java/Scala. +## Why implement the Delta Lake transaction log protocol in Rust? + +Delta Spark depends on Java and Spark, which is fine for many use cases, but not all Delta Lake users want to depend on these libraries. `deltalake` allows you to manage your dataset using a Delta Lake approach without any Java or Spark dependencies. + +A `DeltaTable` on disk is simply a directory that stores metadata in JSON files and data in Parquet files. + +## Quick start + +You can install `deltalake` in Python with `pip` +```bash +pip install deltalake +``` +We create a Pandas `DataFrame` and write it to a `DeltaTable`: +```python +import pandas as pd +from deltalake import DeltaTable,write_deltalake + +df = pd.DataFrame( + { + "id": [1, 2, 3], + "name": ["Aadhya", "Bob", "Chen"], + } +) + +( + write_deltalake( + table_or_uri="delta_table_dir", + data=df, + ) +) +``` +We create a `DeltaTable` object that holds the metadata for the Delta table: +```python +dt = DeltaTable("delta_table_dir") +``` +We load the `DeltaTable` into a Pandas `DataFrame` with `to_pandas` on a `DeltaTable`: +```python +new_df = dt.to_pandas() +``` + +Or we can load the data into a Polars `DataFrame` with `pl.read_delta`: +```python +import polars as pl +new_df = pl.read_delta("delta_table_dir") +``` + +Or we can load the data with DuckDB: +```python +import duckdb +duckdb.query("SELECT * FROM delta_scan('./delta_table_dir')") +``` + +Or we can load the data with DataFusion: +```python +from datafusion import SessionContext + +ctx = SessionContext() +ctx.register_dataset("my_delta_table", dt.to_pyarrow_dataset()) +ctx.sql("select * from my_delta_table") +``` -Suppose you want to query a Delta table with pandas on your local machine. Python deltalake makes it easy to query the table with a simple `pip install` command - no need to install Java. ## Contributing