forked from delta-io/delta-rs
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: create benchmarks for merge (delta-io#1857)
# Description Implements benchmarks that are similar to Spark's Delta benchmarks. Enable us to have a standard benchmark to measure improvements to merge and some pieces can be factored out to build a framework for bench marking delta workflows.
- Loading branch information
Showing
3 changed files
with
748 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
[package] | ||
name = "delta-benchmarks" | ||
version = "0.0.1" | ||
authors = ["David Blajda <db@davidblajda.com>"] | ||
homepage = "https://github.com/delta-io/delta.rs" | ||
license = "Apache-2.0" | ||
keywords = ["deltalake", "delta", "datalake"] | ||
description = "Delta-rs Benchmarks" | ||
edition = "2021" | ||
|
||
[dependencies] | ||
clap = { version = "4", features = [ "derive" ] } | ||
chrono = { version = "0.4.31", default-features = false, features = ["clock"] } | ||
tokio = { version = "1", features = ["fs", "macros", "rt", "io-util"] } | ||
env_logger = "0" | ||
|
||
# arrow | ||
arrow = { workspace = true } | ||
arrow-array = { workspace = true } | ||
arrow-buffer = { workspace = true } | ||
arrow-cast = { workspace = true } | ||
arrow-ord = { workspace = true } | ||
arrow-row = { workspace = true } | ||
arrow-schema = { workspace = true, features = ["serde"] } | ||
arrow-select = { workspace = true } | ||
parquet = { workspace = true, features = [ | ||
"async", | ||
"object_store", | ||
] } | ||
|
||
# serde | ||
serde = { workspace = true, features = ["derive"] } | ||
serde_json = { workspace = true } | ||
|
||
# datafusion | ||
datafusion = { workspace = true } | ||
datafusion-expr = { workspace = true } | ||
datafusion-common = { workspace = true } | ||
datafusion-proto = { workspace = true } | ||
datafusion-sql = { workspace = true } | ||
datafusion-physical-expr = { workspace = true } | ||
|
||
[dependencies.deltalake-core] | ||
path = "../deltalake-core" | ||
version = "0" | ||
features = ["datafusion"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Merge | ||
The merge benchmarks are similar to the ones used by [Delta Spark](https://github.com/delta-io/delta/pull/1835). | ||
|
||
|
||
## Dataset | ||
|
||
Databricks maintains a public S3 bucket of the TPC-DS dataset with various factor where requesters must pay to download this dataset. Below is an example of how to list the 1gb scale factor | ||
|
||
``` | ||
aws s3api list-objects --bucket devrel-delta-datasets --request-payer requester --prefix tpcds-2.13/tpcds_sf1_parquet/web_returns/ | ||
``` | ||
|
||
You can generate the TPC-DS dataset yourself by downloading and compiling [the generator](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) | ||
You may need to update the CFLAGS to include `-fcommon` to compile on newer versions of GCC. | ||
|
||
## Commands | ||
These commands can be executed from the root of the benchmark crate. Some commands depend on the existance of the TPC-DS Dataset existing. | ||
|
||
### Convert | ||
Converts a TPC-DS web_returns csv into a Delta table | ||
Assumes the dataset is pipe delimited and records do not have a trailing delimiter | ||
|
||
``` | ||
cargo run --release --bin merge -- convert data/tpcds/web_returns.dat data/web_returns | ||
``` | ||
|
||
### Standard | ||
Execute the standard merge bench suite. | ||
Results can be saved to a delta table for further analysis. | ||
This table has the following schema: | ||
|
||
group_id: Used to group all tests that executed as a part of this call. Default value is the timestamp of execution | ||
name: The benchmark name that was executed | ||
sample: The iteration number for a given benchmark name | ||
duration_ms: How long the benchmark took in ms | ||
data: Free field to pack any additonal data | ||
|
||
``` | ||
cargo run --release --bin merge -- standard data/web_returns 1 data/merge_results | ||
``` | ||
|
||
### Compare | ||
Compare the results of two different runs. | ||
The a Delta table paths and the `group_id` of each run and obtain the speedup for each test case | ||
|
||
``` | ||
cargo run --release --bin merge -- compare data/benchmarks/ 1698636172801 data/benchmarks/ 1699759539902 | ||
``` | ||
|
||
### Show | ||
Show all benchmarks results from a delta table | ||
|
||
``` | ||
cargo run --release --bin merge -- show data/benchmark | ||
``` |
Oops, something went wrong.