Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 1 addition & 18 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,28 +243,11 @@ See the help for more details.
You can enable `mimalloc` or `snmalloc` (to use either the mimalloc or snmalloc allocator) as features by passing them in as `--features`. For example:

```shell
cargo run --release --features "mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
```

The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from `tbl`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data is generated directly using tpchgen-cli now, no need to convert data with another pass

(generated by the `dbgen` utility) to CSV and Parquet.

```bash
cargo run --release --bin tpch -- convert --input ./data --output /mnt/tpch-parquet --format parquet
cargo run --release --features "mimalloc" --bin dfbench tpch --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
```

Or if you want to verify and run all the queries in the benchmark, you can just run `cargo test`.

#### Sorted Conversion

The TPCH tables generated by the dbgen utility are sorted by their first column (their primary key for most tables, the `l_orderkey` column for the `lineitem` table.)

To preserve this sorted order information during conversion (useful for benchmarking execution on pre-sorted data) include the `--sort` flag:

```bash
cargo run --release --bin tpch -- convert --input ./data --output /mnt/tpch-sorted-parquet --format parquet --sort
```

### Comparing results between runs

Any `dfbench` execution with `-o <dir>` argument will produce a
Expand Down
98 changes: 51 additions & 47 deletions benchmarks/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -189,8 +189,8 @@ main() {
echo "***************************"
case "$BENCHMARK" in
all)
data_tpch "1"
data_tpch "10"
data_tpch "1" "parquet"
data_tpch "10" "parquet"
data_h2o "SMALL"
data_h2o "MEDIUM"
data_h2o "BIG"
Expand All @@ -203,26 +203,22 @@ main() {
# nlj uses range() function, no data generation needed
;;
tpch)
data_tpch "1"
data_tpch "1" "parquet"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per @comphead 's suggestion, now only the format required is made, rather than always creating all three formats

;;
tpch_mem)
# same data as for tpch
data_tpch "1"
data_tpch "1" "parquet"
;;
tpch_csv)
# same data as for tpch
data_tpch "1"
data_tpch "1" "csv"
;;
tpch10)
data_tpch "10"
data_tpch "10" "parquet"
;;
tpch_mem10)
# same data as for tpch10
data_tpch "10"
data_tpch "10" "parquet"
;;
tpch_csv10)
# same data as for tpch10
data_tpch "10"
data_tpch "10" "csv"
;;
clickbench_1)
data_clickbench_1
Expand Down Expand Up @@ -297,19 +293,19 @@ main() {
;;
external_aggr)
# same data as for tpch
data_tpch "1"
data_tpch "1" "parquet"
;;
sort_tpch)
# same data as for tpch
data_tpch "1"
data_tpch "1" "parquet"
;;
sort_tpch10)
# same data as for tpch10
data_tpch "10"
data_tpch "10" "parquet"
;;
topk_tpch)
# same data as for tpch
data_tpch "1"
data_tpch "1" "parquet"
;;
nlj)
# nlj uses range() function, no data generation needed
Expand All @@ -320,7 +316,7 @@ main() {
echo "HJ benchmark does not require data generation"
;;
compile_profile)
data_tpch "1"
data_tpch "1" "parquet"
;;
*)
echo "Error: unknown benchmark '$BENCHMARK' for data generation"
Expand Down Expand Up @@ -537,7 +533,7 @@ main() {
# Creates TPCH data at a certain scale factor, if it doesn't already
# exist
#
# call like: data_tpch($scale_factor)
# call like: data_tpch($scale_factor, format)
#
# Creates data in $DATA_DIR/tpch_sf1 for scale factor 1
# Creates data in $DATA_DIR/tpch_sf10 for scale factor 10
Expand All @@ -548,20 +544,23 @@ data_tpch() {
echo "Internal error: Scale factor not specified"
exit 1
fi
FORMAT=$2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FORMAT=$2
FORMAT=${2:-parquet}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, that would default the argument to parquet -- what is the rationale for doing so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #19035 (comment)
There are two calls of data_tpch there which do not pass the format.

https://github.com/apache/datafusion/pull/19035/files/907bce3e16352148eade3b7cf512091a9aab4232#diff-1769f5787dc11c8b1f1b48288cdf3c89d25a5b5cbc6be4740bfcc70a6313ba99R550 will print Creating tpch <EMPTY> dataset at Scale Factor, where <EMPTY> is an empty string.

And the third reason why I proposed parquet as default is:

Also @comphead pointed out on https://github.com/apache/datafusion/pull/19034#pullrequestreview-3526952491 that the bench.sh data tpch generated both csv and parquet files when it only really needs parquet.

This sounds like parquet is the needed format most of the time.

But data_h2o() uses CSV as a default format:
https://github.com/alamb/datafusion/blob/907bce3e16352148eade3b7cf512091a9aab4232/benchmarks/bench.sh#L853

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you are saying now - there are several other calls to data_tpch that don't pass the format argument here

https://github.com/alamb/datafusion/blob/21a0237a1b96cf42bd96e93e6fc184a8e320138f/benchmarks/bench.sh#L298-L308

eg

                sort_tpch)
                    # same data as for tpch
                    data_tpch "1"
                    ;;
                sort_tpch10)
                    # same data as for tpch10
                    data_tpch "10"
                    ;;
                topk_tpch)
                    # same data as for tpch
                    data_tpch "1"
                    ;;
                nlj)

I will update those calls to explicitly pass in a format as I think that will make it clear what is going on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 9a5a3b0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻
Just FYI: There is a shorter syntax for this check too: FORMAT=${1:?Internal error: Format not specified}

if [ -z "$FORMAT" ] ; then
echo "Internal error: Format not specified"
exit 1
fi

TPCH_DIR="${DATA_DIR}/tpch_sf${SCALE_FACTOR}"
echo "Creating tpch dataset at Scale Factor ${SCALE_FACTOR} in ${TPCH_DIR}..."
echo "Creating tpch $FORMAT dataset at Scale Factor ${SCALE_FACTOR} in ${TPCH_DIR}..."

# Ensure the target data directory exists
mkdir -p "${TPCH_DIR}"

# Create 'tbl' (CSV format) data into $DATA_DIR if it does not already exist
FILE="${TPCH_DIR}/supplier.tbl"
if test -f "${FILE}"; then
echo " tbl files exist ($FILE exists)."
else
echo " creating tbl files with tpch_dbgen..."
docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}"
# check if tpchgen-cli is installed
if ! command -v tpchgen-cli &> /dev/null
then
echo "tpchgen-cli could not be found, please install it via 'cargo install tpchgen-cli'"
exit 1
fi

# Copy expected answers into the ./data/answers directory if it does not already exist
Expand All @@ -574,27 +573,32 @@ data_tpch() {
docker run -v "${TPCH_DIR}":/data -it --entrypoint /bin/bash --rm ghcr.io/scalytics/tpch-docker:main -c "cp -f /opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/"
fi

# Create 'parquet' files from tbl
FILE="${TPCH_DIR}/supplier"
if test -d "${FILE}"; then
echo " parquet files exist ($FILE exists)."
else
echo " creating parquet files using benchmark binary ..."
pushd "${SCRIPT_DIR}" > /dev/null
$CARGO_COMMAND --bin tpch -- convert --input "${TPCH_DIR}" --output "${TPCH_DIR}" --format parquet
popd > /dev/null
if [ "$FORMAT" = "parquet" ]; then
# Create 'parquet' files, one directory per file
FILE="${TPCH_DIR}/supplier"
if test -d "${FILE}"; then
echo " parquet files exist ($FILE exists)."
else
echo " creating parquet files using tpchgen-cli ..."
tpchgen-cli --scale-factor "${SCALE_FACTOR}" --format parquet --parquet-compression='ZSTD(1)' --parts=1 --output-dir "${TPCH_DIR}"
fi
return
fi

# Create 'csv' files from tbl
FILE="${TPCH_DIR}/csv/supplier"
if test -d "${FILE}"; then
echo " csv files exist ($FILE exists)."
else
echo " creating csv files using benchmark binary ..."
pushd "${SCRIPT_DIR}" > /dev/null
$CARGO_COMMAND --bin tpch -- convert --input "${TPCH_DIR}" --output "${TPCH_DIR}/csv" --format csv
popd > /dev/null
# Create 'csv' files, one directory per file
if [ "$FORMAT" = "csv" ]; then
FILE="${TPCH_DIR}/csv/supplier"
if test -d "${FILE}"; then
echo " csv files exist ($FILE exists)."
else
echo " creating csv files using tpchgen-cli binary ..."
tpchgen-cli --scale-factor "${SCALE_FACTOR}" --format csv --parts=1 --output-dir "${TPCH_DIR}/csv"
fi
return
fi

echo "Error: unknown format '$FORMAT' for tpch data generation, expected 'parquet' or 'csv'"
exit 1
}

# Runs the tpch benchmark
Expand All @@ -611,10 +615,10 @@ run_tpch() {
echo "Running tpch benchmark..."

FORMAT=$2
debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format ${FORMAT} -o "${RESULTS_FILE}" ${QUERY_ARG}
debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format ${FORMAT} -o "${RESULTS_FILE}" ${QUERY_ARG}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this updates bench.sh to use dfbench directly rather than the alternate tpch binary

}

# Runs the tpch in memory
# Runs the tpch in memory (needs tpch parquet data)
run_tpch_mem() {
SCALE_FACTOR=$1
if [ -z "$SCALE_FACTOR" ] ; then
Expand All @@ -627,7 +631,7 @@ run_tpch_mem() {
echo "RESULTS_FILE: ${RESULTS_FILE}"
echo "Running tpch_mem benchmark..."
# -m means in memory
debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" -m --format parquet -o "${RESULTS_FILE}" ${QUERY_ARG}
debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" -m --format parquet -o "${RESULTS_FILE}" ${QUERY_ARG}
}

# Runs the compile profile benchmark helper
Expand Down
2 changes: 0 additions & 2 deletions benchmarks/src/bin/dfbench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,6 @@ enum Options {
Nlj(nlj::RunOpt),
SortTpch(sort_tpch::RunOpt),
Tpch(tpch::RunOpt),
TpchConvert(tpch::ConvertOpt),
}

// Main benchmark runner entrypoint
Expand All @@ -65,6 +64,5 @@ pub async fn main() -> Result<()> {
Options::Nlj(opt) => opt.run().await,
Options::SortTpch(opt) => opt.run().await,
Options::Tpch(opt) => Box::pin(opt.run()).await,
Options::TpchConvert(opt) => opt.run().await,
}
}
65 changes: 0 additions & 65 deletions benchmarks/src/bin/tpch.rs

This file was deleted.

Loading