Feature/benchmark config from env #15782

ctsk · 2025-04-20T16:26:09Z

Which issue does this PR close?

Closes benchmarks: Read SessionConfig from Environment #15684

Rationale for this change

When benchmarking, it is convenient to modify some datafusion settings between runs and compare the results.

What changes are included in this PR?

This PR makes the benchmark executables pick up the environment variables. Additionally, it changes the target_partitions and batch_size options to fall back on the datafusion defaults when not set - this was necessary so that environment variables can override those options if they weren't provided.

Are these changes tested?

No.

Are there any user-facing changes?

Only developers of datafusion are affected.

fix

2010YOUY01 · 2025-04-21T04:09:08Z

Thanks, I think it's good to go after adding some simple docs for such configurations, in:
https://github.com/apache/datafusion/tree/main/benchmarks
and also the output of ./bench.sh help

I think it's also great if we could add some output to show the actual set configurations like

export DATAFUSION_EXECUTION_TARGET_PARTITIONS=1
export DATAFUSION_EXECUTION_FOO=1 
./bench.sh run tpch

Terminal output:

Set datafusion.execution.target_partitions to 1 from the environment variable

... start running the benchmark

We can just open an issue for it now, and leave it as a follow-up.

Orchestrates running benchmarks against DataFusion checkouts Usage: ./bench.sh data [benchmark] [query] ./bench.sh run [benchmark] ./bench.sh compare <branch1> <branch2> ./bench.sh venv ********** Examples: ********** # Create the datasets for all benchmarks in /Users/christian/MA/datafusion/benchmarks/data ./bench.sh data # Run the 'tpch' benchmark on the datafusion checkout in /source/datafusion DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch ********** * Commands ********** data: Generates or downloads data needed for benchmarking run: Runs the named benchmark compare: Compares results from benchmark runs venv: Creates new venv (unless already exists) and installs compare's requirements into it ********** * Benchmarks ********** all(default): Data/Run/Compare for all benchmarks tpch: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join tpch_mem: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory cancellation: How long cancelling a query takes parquet: Benchmark of parquet reader's filtering speed sort: Benchmark of sorting speed sort_tpch: Benchmark of sorting speed for end-to-end sort queries on TPCH dataset clickbench_1: ClickBench queries against a single parquet file clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet clickbench_extended: ClickBench "inspired" queries against a single parquet (DataFusion specific) external_aggr: External aggregation benchmark h2o_small: h2oai benchmark with small dataset (1e7 rows) for groupby, default file format is csv h2o_medium: h2oai benchmark with medium dataset (1e8 rows) for groupby, default file format is csv h2o_big: h2oai benchmark with large dataset (1e9 rows) for groupby, default file format is csv h2o_small_join: h2oai benchmark with small dataset (1e7 rows) for join, default file format is csv h2o_medium_join: h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv h2o_big_join: h2oai benchmark with large dataset (1e9 rows) for join, default file format is csv imdb: Join Order Benchmark (JOB) using the IMDB dataset converted to parquet ********** * Supported Configuration (Environment Variables) ********** DATA_DIR directory to store datasets CARGO_COMMAND command that runs the benchmark binary DATAFUSION_DIR directory to use (default /Users/christian/MA/datafusion/benchmarks/..) RESULTS_NAME folder where the benchmark files are stored PREFER_HASH_JOIN Prefer hash join algorithm (default true) VENV_PATH Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate) DATAFUSION_* Set the given datafusion configuration

ctsk · 2025-04-22T08:17:19Z

I can't see an easy way to check what environment variables were actually picked up. I've opted to add some logging to ConfigOptions::from_env

* Read benchmark SessionConfig from env * Set target partitions from env by default fix * Set batch size from env by default * Fix batch size option for tpch ci * Log environment variable configuration * Document benchmarking env variable config * Add DATAFUSION_* env config to Error: unknown command: help Orchestrates running benchmarks against DataFusion checkouts Usage: ./bench.sh data [benchmark] [query] ./bench.sh run [benchmark] ./bench.sh compare <branch1> <branch2> ./bench.sh venv ********** Examples: ********** # Create the datasets for all benchmarks in /Users/christian/MA/datafusion/benchmarks/data ./bench.sh data # Run the 'tpch' benchmark on the datafusion checkout in /source/datafusion DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch ********** * Commands ********** data: Generates or downloads data needed for benchmarking run: Runs the named benchmark compare: Compares results from benchmark runs venv: Creates new venv (unless already exists) and installs compare's requirements into it ********** * Benchmarks ********** all(default): Data/Run/Compare for all benchmarks tpch: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join tpch_mem: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory cancellation: How long cancelling a query takes parquet: Benchmark of parquet reader's filtering speed sort: Benchmark of sorting speed sort_tpch: Benchmark of sorting speed for end-to-end sort queries on TPCH dataset clickbench_1: ClickBench queries against a single parquet file clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet clickbench_extended: ClickBench "inspired" queries against a single parquet (DataFusion specific) external_aggr: External aggregation benchmark h2o_small: h2oai benchmark with small dataset (1e7 rows) for groupby, default file format is csv h2o_medium: h2oai benchmark with medium dataset (1e8 rows) for groupby, default file format is csv h2o_big: h2oai benchmark with large dataset (1e9 rows) for groupby, default file format is csv h2o_small_join: h2oai benchmark with small dataset (1e7 rows) for join, default file format is csv h2o_medium_join: h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv h2o_big_join: h2oai benchmark with large dataset (1e9 rows) for join, default file format is csv imdb: Join Order Benchmark (JOB) using the IMDB dataset converted to parquet ********** * Supported Configuration (Environment Variables) ********** DATA_DIR directory to store datasets CARGO_COMMAND command that runs the benchmark binary DATAFUSION_DIR directory to use (default /Users/christian/MA/datafusion/benchmarks/..) RESULTS_NAME folder where the benchmark files are stored PREFER_HASH_JOIN Prefer hash join algorithm (default true) VENV_PATH Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate) DATAFUSION_* Set the given datafusion configuration * fmt

ctsk added 3 commits April 20, 2025 17:39

Read benchmark SessionConfig from env

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

b8accb7

Set target partitions from env by default

cfb7800

fix

Set batch size from env by default

Loading
Loading status checks…

beb3a0d

ctsk force-pushed the feature/benchmark-config-from-env branch from 571965f to beb3a0d Compare April 20, 2025 16:27

Fix batch size option for tpch ci

Loading
Loading status checks…

3e71b54

ctsk added 3 commits April 22, 2025 09:52

Log environment variable configuration

9615753

Document benchmarking env variable config

e80010c

github-actions bot added the common label Apr 22, 2025

fmt

Loading
Loading status checks…

e691aaa

2010YOUY01 approved these changes Apr 23, 2025

View reviewed changes

comphead merged commit 11088b9 into apache:main Apr 24, 2025
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/benchmark config from env #15782

Feature/benchmark config from env #15782

ctsk commented Apr 20, 2025

Uh oh!

2010YOUY01 commented Apr 21, 2025 •

edited

Loading

Uh oh!

ctsk commented Apr 22, 2025

Uh oh!

Uh oh!

Feature/benchmark config from env #15782

Feature/benchmark config from env #15782

Conversation

ctsk commented Apr 20, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ctsk commented Apr 22, 2025

Uh oh!

Uh oh!

2010YOUY01 commented Apr 21, 2025 •

edited

Loading