Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Answer "Why use DeltaLake?" question in the docs #2996

Closed
braaannigan opened this issue Nov 15, 2024 · 1 comment · Fixed by #3017
Closed

Answer "Why use DeltaLake?" question in the docs #2996

braaannigan opened this issue Nov 15, 2024 · 1 comment · Fixed by #3017
Labels
enhancement New feature or request

Comments

@braaannigan
Copy link
Contributor

Description

The docs should open by explaining what the basic use cases are for a lakehouse and delta-rs in particular. At present we have

This is the documentation for the native Rust/Python implementation of Delta Lake. It is based on the delta-rs Rust library and requires no Spark or JVM dependencies. For the PySpark implementation, see delta-spark instead.

This module provides the capability to read, write, and manage Delta Lake tables with Python or Rust without Spark or Java. It uses Apache Arrow under the hood, so is compatible with other Arrow-native or integrated libraries such as pandas, DuckDB, and Polars.

This assumes knowledge of Delta Lake and indeed what a lakehouse is.

Proposed approach

The docs should open with a succinct paragraph that explains what deltalake is in a way that is understandable to anyone. Polars got a lot of feedback on their intro being too technical and ended up (after a lot of thought) with this:
Polars is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.

As a first draft I propose these as the opening paras for deltalake:

deltalake is an open source library for managing tabular datasets that evolve over time. With deltalake you can add, delete or overwrite rows in a dataset as new data arrives and even time travel back to previous versions of a dataset. deltalake can be used to manage data stored on a local file system or in the cloud. deltalake integrates with data manipulation libraries such as Pandas, Polars, DuckDB and DataFusion.

deltalake is an example of a lakehouse approach to managing data storage. With the lakehouse approach you manage your datasets with a DeltaTable object and then deltalake manages the underlying files. With a DeltaTable your data is stored in Parquet files while deltalake stores metadata about the DeltaTable in a set of JSON files called a transaction log.

deltalake is a Rust-based re-implementation of the DeltaLake lakehouse protocol developed by DataBricks. The deltalake library has APIs in Rust and Python. The deltalake library implementation has no dependencies on Java, Spark or DataBricks.

@braaannigan braaannigan added the enhancement New feature or request label Nov 15, 2024
@ion-elgreco
Copy link
Collaborator

@braaannigan feel free to open a PR to add this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants