-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Use tpchgen-cli to generate tpch data in bench.sh
#19035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e0a9157
907bce3
21a0237
1790863
9a5a3b0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -189,8 +189,8 @@ main() { | |||||
| echo "***************************" | ||||||
| case "$BENCHMARK" in | ||||||
| all) | ||||||
| data_tpch "1" | ||||||
| data_tpch "10" | ||||||
| data_tpch "1" "parquet" | ||||||
| data_tpch "10" "parquet" | ||||||
| data_h2o "SMALL" | ||||||
| data_h2o "MEDIUM" | ||||||
| data_h2o "BIG" | ||||||
|
|
@@ -203,26 +203,22 @@ main() { | |||||
| # nlj uses range() function, no data generation needed | ||||||
| ;; | ||||||
| tpch) | ||||||
| data_tpch "1" | ||||||
| data_tpch "1" "parquet" | ||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. per @comphead 's suggestion, now only the format required is made, rather than always creating all three formats |
||||||
| ;; | ||||||
| tpch_mem) | ||||||
| # same data as for tpch | ||||||
| data_tpch "1" | ||||||
| data_tpch "1" "parquet" | ||||||
| ;; | ||||||
| tpch_csv) | ||||||
| # same data as for tpch | ||||||
| data_tpch "1" | ||||||
| data_tpch "1" "csv" | ||||||
| ;; | ||||||
| tpch10) | ||||||
| data_tpch "10" | ||||||
| data_tpch "10" "parquet" | ||||||
| ;; | ||||||
| tpch_mem10) | ||||||
| # same data as for tpch10 | ||||||
| data_tpch "10" | ||||||
| data_tpch "10" "parquet" | ||||||
| ;; | ||||||
| tpch_csv10) | ||||||
| # same data as for tpch10 | ||||||
| data_tpch "10" | ||||||
| data_tpch "10" "csv" | ||||||
| ;; | ||||||
| clickbench_1) | ||||||
| data_clickbench_1 | ||||||
|
|
@@ -297,19 +293,19 @@ main() { | |||||
| ;; | ||||||
| external_aggr) | ||||||
| # same data as for tpch | ||||||
| data_tpch "1" | ||||||
| data_tpch "1" "parquet" | ||||||
| ;; | ||||||
| sort_tpch) | ||||||
| # same data as for tpch | ||||||
| data_tpch "1" | ||||||
| data_tpch "1" "parquet" | ||||||
| ;; | ||||||
| sort_tpch10) | ||||||
| # same data as for tpch10 | ||||||
| data_tpch "10" | ||||||
| data_tpch "10" "parquet" | ||||||
| ;; | ||||||
| topk_tpch) | ||||||
| # same data as for tpch | ||||||
| data_tpch "1" | ||||||
| data_tpch "1" "parquet" | ||||||
| ;; | ||||||
| nlj) | ||||||
| # nlj uses range() function, no data generation needed | ||||||
|
|
@@ -320,7 +316,7 @@ main() { | |||||
| echo "HJ benchmark does not require data generation" | ||||||
| ;; | ||||||
| compile_profile) | ||||||
| data_tpch "1" | ||||||
| data_tpch "1" "parquet" | ||||||
| ;; | ||||||
| *) | ||||||
| echo "Error: unknown benchmark '$BENCHMARK' for data generation" | ||||||
|
|
@@ -537,7 +533,7 @@ main() { | |||||
| # Creates TPCH data at a certain scale factor, if it doesn't already | ||||||
| # exist | ||||||
| # | ||||||
| # call like: data_tpch($scale_factor) | ||||||
| # call like: data_tpch($scale_factor, format) | ||||||
| # | ||||||
| # Creates data in $DATA_DIR/tpch_sf1 for scale factor 1 | ||||||
| # Creates data in $DATA_DIR/tpch_sf10 for scale factor 10 | ||||||
|
|
@@ -548,20 +544,23 @@ data_tpch() { | |||||
| echo "Internal error: Scale factor not specified" | ||||||
| exit 1 | ||||||
| fi | ||||||
| FORMAT=$2 | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I understand, that would default the argument to
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See #19035 (comment) https://github.com/apache/datafusion/pull/19035/files/907bce3e16352148eade3b7cf512091a9aab4232#diff-1769f5787dc11c8b1f1b48288cdf3c89d25a5b5cbc6be4740bfcc70a6313ba99R550 will print And the third reason why I proposed This sounds like parquet is the needed format most of the time. But data_h2o() uses CSV as a default format:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see what you are saying now - there are several other calls to data_tpch that don't pass the format argument here eg sort_tpch)
# same data as for tpch
data_tpch "1"
;;
sort_tpch10)
# same data as for tpch10
data_tpch "10"
;;
topk_tpch)
# same data as for tpch
data_tpch "1"
;;
nlj)I will update those calls to explicitly pass in a format as I think that will make it clear what is going on
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in 9a5a3b0
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍🏻 |
||||||
| if [ -z "$FORMAT" ] ; then | ||||||
| echo "Internal error: Format not specified" | ||||||
| exit 1 | ||||||
| fi | ||||||
|
|
||||||
| TPCH_DIR="${DATA_DIR}/tpch_sf${SCALE_FACTOR}" | ||||||
| echo "Creating tpch dataset at Scale Factor ${SCALE_FACTOR} in ${TPCH_DIR}..." | ||||||
| echo "Creating tpch $FORMAT dataset at Scale Factor ${SCALE_FACTOR} in ${TPCH_DIR}..." | ||||||
|
|
||||||
| # Ensure the target data directory exists | ||||||
| mkdir -p "${TPCH_DIR}" | ||||||
|
|
||||||
| # Create 'tbl' (CSV format) data into $DATA_DIR if it does not already exist | ||||||
| FILE="${TPCH_DIR}/supplier.tbl" | ||||||
| if test -f "${FILE}"; then | ||||||
| echo " tbl files exist ($FILE exists)." | ||||||
| else | ||||||
| echo " creating tbl files with tpch_dbgen..." | ||||||
| docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}" | ||||||
| # check if tpchgen-cli is installed | ||||||
| if ! command -v tpchgen-cli &> /dev/null | ||||||
| then | ||||||
| echo "tpchgen-cli could not be found, please install it via 'cargo install tpchgen-cli'" | ||||||
| exit 1 | ||||||
| fi | ||||||
|
|
||||||
| # Copy expected answers into the ./data/answers directory if it does not already exist | ||||||
|
|
@@ -574,27 +573,32 @@ data_tpch() { | |||||
| docker run -v "${TPCH_DIR}":/data -it --entrypoint /bin/bash --rm ghcr.io/scalytics/tpch-docker:main -c "cp -f /opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/" | ||||||
| fi | ||||||
|
|
||||||
| # Create 'parquet' files from tbl | ||||||
| FILE="${TPCH_DIR}/supplier" | ||||||
| if test -d "${FILE}"; then | ||||||
| echo " parquet files exist ($FILE exists)." | ||||||
| else | ||||||
| echo " creating parquet files using benchmark binary ..." | ||||||
| pushd "${SCRIPT_DIR}" > /dev/null | ||||||
| $CARGO_COMMAND --bin tpch -- convert --input "${TPCH_DIR}" --output "${TPCH_DIR}" --format parquet | ||||||
| popd > /dev/null | ||||||
| if [ "$FORMAT" = "parquet" ]; then | ||||||
| # Create 'parquet' files, one directory per file | ||||||
| FILE="${TPCH_DIR}/supplier" | ||||||
| if test -d "${FILE}"; then | ||||||
| echo " parquet files exist ($FILE exists)." | ||||||
| else | ||||||
| echo " creating parquet files using tpchgen-cli ..." | ||||||
| tpchgen-cli --scale-factor "${SCALE_FACTOR}" --format parquet --parquet-compression='ZSTD(1)' --parts=1 --output-dir "${TPCH_DIR}" | ||||||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| fi | ||||||
| return | ||||||
| fi | ||||||
|
|
||||||
| # Create 'csv' files from tbl | ||||||
| FILE="${TPCH_DIR}/csv/supplier" | ||||||
| if test -d "${FILE}"; then | ||||||
| echo " csv files exist ($FILE exists)." | ||||||
| else | ||||||
| echo " creating csv files using benchmark binary ..." | ||||||
| pushd "${SCRIPT_DIR}" > /dev/null | ||||||
| $CARGO_COMMAND --bin tpch -- convert --input "${TPCH_DIR}" --output "${TPCH_DIR}/csv" --format csv | ||||||
| popd > /dev/null | ||||||
| # Create 'csv' files, one directory per file | ||||||
| if [ "$FORMAT" = "csv" ]; then | ||||||
| FILE="${TPCH_DIR}/csv/supplier" | ||||||
| if test -d "${FILE}"; then | ||||||
| echo " csv files exist ($FILE exists)." | ||||||
| else | ||||||
| echo " creating csv files using tpchgen-cli binary ..." | ||||||
| tpchgen-cli --scale-factor "${SCALE_FACTOR}" --format csv --parts=1 --output-dir "${TPCH_DIR}/csv" | ||||||
| fi | ||||||
| return | ||||||
| fi | ||||||
|
|
||||||
| echo "Error: unknown format '$FORMAT' for tpch data generation, expected 'parquet' or 'csv'" | ||||||
| exit 1 | ||||||
| } | ||||||
|
|
||||||
| # Runs the tpch benchmark | ||||||
|
|
@@ -611,10 +615,10 @@ run_tpch() { | |||||
| echo "Running tpch benchmark..." | ||||||
|
|
||||||
| FORMAT=$2 | ||||||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format ${FORMAT} -o "${RESULTS_FILE}" ${QUERY_ARG} | ||||||
| debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format ${FORMAT} -o "${RESULTS_FILE}" ${QUERY_ARG} | ||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this updates bench.sh to use dfbench directly rather than the alternate |
||||||
| } | ||||||
|
|
||||||
| # Runs the tpch in memory | ||||||
| # Runs the tpch in memory (needs tpch parquet data) | ||||||
| run_tpch_mem() { | ||||||
| SCALE_FACTOR=$1 | ||||||
| if [ -z "$SCALE_FACTOR" ] ; then | ||||||
|
|
@@ -627,7 +631,7 @@ run_tpch_mem() { | |||||
| echo "RESULTS_FILE: ${RESULTS_FILE}" | ||||||
| echo "Running tpch_mem benchmark..." | ||||||
| # -m means in memory | ||||||
| debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" -m --format parquet -o "${RESULTS_FILE}" ${QUERY_ARG} | ||||||
| debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" -m --format parquet -o "${RESULTS_FILE}" ${QUERY_ARG} | ||||||
| } | ||||||
|
|
||||||
| # Runs the compile profile benchmark helper | ||||||
|
|
||||||
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the data is generated directly using tpchgen-cli now, no need to convert data with another pass