Answer "Why use DeltaLake?" question in the docs #2996

braaannigan · 2024-11-15T10:00:33Z

Description

The docs should open by explaining what the basic use cases are for a lakehouse and delta-rs in particular. At present we have

This is the documentation for the native Rust/Python implementation of Delta Lake. It is based on the delta-rs Rust library and requires no Spark or JVM dependencies. For the PySpark implementation, see delta-spark instead.

This module provides the capability to read, write, and manage Delta Lake tables with Python or Rust without Spark or Java. It uses Apache Arrow under the hood, so is compatible with other Arrow-native or integrated libraries such as pandas, DuckDB, and Polars.

This assumes knowledge of Delta Lake and indeed what a lakehouse is.

Proposed approach

The docs should open with a succinct paragraph that explains what deltalake is in a way that is understandable to anyone. Polars got a lot of feedback on their intro being too technical and ended up (after a lot of thought) with this:
Polars is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.

As a first draft I propose these as the opening paras for deltalake:

deltalake is an open source library for managing tabular datasets that evolve over time. With deltalake you can add, delete or overwrite rows in a dataset as new data arrives and even time travel back to previous versions of a dataset. deltalake can be used to manage data stored on a local file system or in the cloud. deltalake integrates with data manipulation libraries such as Pandas, Polars, DuckDB and DataFusion.

deltalake is an example of a lakehouse approach to managing data storage. With the lakehouse approach you manage your datasets with a DeltaTable object and then deltalake manages the underlying files. With a DeltaTable your data is stored in Parquet files while deltalake stores metadata about the DeltaTable in a set of JSON files called a transaction log.

deltalake is a Rust-based re-implementation of the DeltaLake lakehouse protocol developed by DataBricks. The deltalake library has APIs in Rust and Python. The deltalake library implementation has no dependencies on Java, Spark or DataBricks.

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2024-11-17T14:09:31Z

@braaannigan feel free to open a PR to add this

braaannigan added the enhancement New feature or request label Nov 15, 2024

braaannigan mentioned this issue Nov 21, 2024

docs: explain the value of deltalake on first page of docs #3017

Merged

ion-elgreco closed this as completed in #3017 Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Answer "Why use DeltaLake?" question in the docs #2996

Answer "Why use DeltaLake?" question in the docs #2996

braaannigan commented Nov 15, 2024

ion-elgreco commented Nov 17, 2024

Answer "Why use DeltaLake?" question in the docs #2996

Answer "Why use DeltaLake?" question in the docs #2996

Comments

braaannigan commented Nov 15, 2024

Description

Proposed approach

ion-elgreco commented Nov 17, 2024