-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Perf: load default Utf8View for CSV datatype #16243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🤖 |
🤖: Benchmark completed Details
|
It looks like no performance improvement for h2o_window benchmark result... |
Now that I think about it, the h2o benchmark may not have any string columns 🤔 Do the TPCH benchmarks read from CSV? Maybe we could just get some manual benchmarks ? |
Thank you @alamb , this is a good point. Do some investigation from benchmark code now. # Runs the tpch benchmark
run_tpch() {
SCALE_FACTOR=$1
if [ -z "$SCALE_FACTOR" ] ; then
echo "Internal error: Scale factor not specified"
exit 1
fi
TPCH_DIR="${DATA_DIR}/tpch_sf${SCALE_FACTOR}"
RESULTS_FILE="${RESULTS_DIR}/tpch_sf${SCALE_FACTOR}.json"
echo "RESULTS_FILE: ${RESULTS_FILE}"
echo "Running tpch benchmark..."
# Optional query filter to run specific query
QUERY=$([ -n "$ARG3" ] && echo "--query $ARG3" || echo "")
debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format parquet -o "${RESULTS_FILE}" $QUERY
} /// File format: `csv` or `parquet`
#[structopt(short = "f", long = "format", default_value = "csv")]
file_format: String, It looks like we default to parquet for tpch, but it also supports csv, i will try to create a PR to support csv from tpch benchmark parameters. Because from the generator code for tpch, we also generate the CSV format, so it's reasonable for us to support CSV benchmark also, i will create a PR soon. Thanks # Create 'tbl' (CSV format) data into $DATA_DIR if it does not already exist
FILE="${TPCH_DIR}/supplier.tbl"
if test -f "${FILE}"; then
echo " tbl files exist ($FILE exists)."
else
echo " creating tbl files with tpch_dbgen..."
docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}"
fi |
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?