Skip to content

Commit

Permalink
feat: create benchmarks for merge (delta-io#1857)
Browse files Browse the repository at this point in the history
# Description
Implements benchmarks that are similar to Spark's Delta benchmarks.

Enable us to have a standard benchmark to measure improvements to merge
and some pieces can be factored out to build a framework for bench
marking delta workflows.
  • Loading branch information
Blajda authored Nov 20, 2023
1 parent 8a66343 commit 2c8c0ec
Show file tree
Hide file tree
Showing 3 changed files with 748 additions and 0 deletions.
46 changes: 46 additions & 0 deletions crates/benchmarks/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
[package]
name = "delta-benchmarks"
version = "0.0.1"
authors = ["David Blajda <db@davidblajda.com>"]
homepage = "https://github.com/delta-io/delta.rs"
license = "Apache-2.0"
keywords = ["deltalake", "delta", "datalake"]
description = "Delta-rs Benchmarks"
edition = "2021"

[dependencies]
clap = { version = "4", features = [ "derive" ] }
chrono = { version = "0.4.31", default-features = false, features = ["clock"] }
tokio = { version = "1", features = ["fs", "macros", "rt", "io-util"] }
env_logger = "0"

# arrow
arrow = { workspace = true }
arrow-array = { workspace = true }
arrow-buffer = { workspace = true }
arrow-cast = { workspace = true }
arrow-ord = { workspace = true }
arrow-row = { workspace = true }
arrow-schema = { workspace = true, features = ["serde"] }
arrow-select = { workspace = true }
parquet = { workspace = true, features = [
"async",
"object_store",
] }

# serde
serde = { workspace = true, features = ["derive"] }
serde_json = { workspace = true }

# datafusion
datafusion = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-common = { workspace = true }
datafusion-proto = { workspace = true }
datafusion-sql = { workspace = true }
datafusion-physical-expr = { workspace = true }

[dependencies.deltalake-core]
path = "../deltalake-core"
version = "0"
features = ["datafusion"]
55 changes: 55 additions & 0 deletions crates/benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Merge
The merge benchmarks are similar to the ones used by [Delta Spark](https://github.com/delta-io/delta/pull/1835).


## Dataset

Databricks maintains a public S3 bucket of the TPC-DS dataset with various factor where requesters must pay to download this dataset. Below is an example of how to list the 1gb scale factor

```
aws s3api list-objects --bucket devrel-delta-datasets --request-payer requester --prefix tpcds-2.13/tpcds_sf1_parquet/web_returns/
```

You can generate the TPC-DS dataset yourself by downloading and compiling [the generator](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp)
You may need to update the CFLAGS to include `-fcommon` to compile on newer versions of GCC.

## Commands
These commands can be executed from the root of the benchmark crate. Some commands depend on the existance of the TPC-DS Dataset existing.

### Convert
Converts a TPC-DS web_returns csv into a Delta table
Assumes the dataset is pipe delimited and records do not have a trailing delimiter

```
cargo run --release --bin merge -- convert data/tpcds/web_returns.dat data/web_returns
```

### Standard
Execute the standard merge bench suite.
Results can be saved to a delta table for further analysis.
This table has the following schema:

group_id: Used to group all tests that executed as a part of this call. Default value is the timestamp of execution
name: The benchmark name that was executed
sample: The iteration number for a given benchmark name
duration_ms: How long the benchmark took in ms
data: Free field to pack any additonal data

```
cargo run --release --bin merge -- standard data/web_returns 1 data/merge_results
```

### Compare
Compare the results of two different runs.
The a Delta table paths and the `group_id` of each run and obtain the speedup for each test case

```
cargo run --release --bin merge -- compare data/benchmarks/ 1698636172801 data/benchmarks/ 1699759539902
```

### Show
Show all benchmarks results from a delta table

```
cargo run --release --bin merge -- show data/benchmark
```
Loading

0 comments on commit 2c8c0ec

Please sign in to comment.