Skip to content

Conversation

zhuqi-lucas
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the datasource Changes to the datasource crate label Jun 3, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Jun 3, 2025
@alamb
Copy link
Contributor

alamb commented Jun 3, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing default_utf8_for_unkown_type (7586ca2) to 5b08b84 diff
Benchmarks: h2o_small_window
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 3, 2025

🤖: Benchmark completed

Details

Comparing HEAD and default_utf8_for_unkown_type
--------------------
Benchmark h2o_window.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ default_utf8_for_unkown_type ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │  1907.12ms │                    1874.72ms │    no change │
│ QQuery 2     │  3435.06ms │                    3435.28ms │    no change │
│ QQuery 3     │  3836.65ms │                    4366.66ms │ 1.14x slower │
│ QQuery 4     │  1008.06ms │                     959.61ms │    no change │
│ QQuery 5     │  2238.31ms │                    2291.12ms │    no change │
│ QQuery 6     │  2998.08ms │                    2978.58ms │    no change │
│ QQuery 7     │  2948.78ms │                    2920.08ms │    no change │
│ QQuery 8     │ 11870.28ms │                   12098.03ms │    no change │
│ QQuery 9     │   866.09ms │                     852.79ms │    no change │
│ QQuery 10    │   938.85ms │                     940.99ms │    no change │
│ QQuery 11    │   910.25ms │                     894.20ms │    no change │
│ QQuery 12    │  1769.99ms │                    1741.19ms │    no change │
└──────────────┴────────────┴──────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 34727.52ms │
│ Total Time (default_utf8_for_unkown_type)   │ 35353.24ms │
│ Average Time (HEAD)                         │  2893.96ms │
│ Average Time (default_utf8_for_unkown_type) │  2946.10ms │
│ Queries Faster                              │          0 │
│ Queries Slower                              │          1 │
│ Queries with No Change                      │         11 │
└─────────────────────────────────────────────┴────────────┘

@zhuqi-lucas
Copy link
Contributor Author

It looks like no performance improvement for h2o_window benchmark result...

@alamb
Copy link
Contributor

alamb commented Jun 4, 2025

It looks like no performance improvement for h2o_window benchmark result...

Now that I think about it, the h2o benchmark may not have any string columns 🤔

Do the TPCH benchmarks read from CSV? Maybe we could just get some manual benchmarks ?

@zhuqi-lucas
Copy link
Contributor Author

It looks like no performance improvement for h2o_window benchmark result...

Now that I think about it, the h2o benchmark may not have any string columns 🤔

Do the TPCH benchmarks read from CSV? Maybe we could just get some manual benchmarks ?

Thank you @alamb , this is a good point. Do some investigation from benchmark code now.

# Runs the tpch benchmark
run_tpch() {
    SCALE_FACTOR=$1
    if [ -z "$SCALE_FACTOR" ] ; then
        echo "Internal error: Scale factor not specified"
        exit 1
    fi
    TPCH_DIR="${DATA_DIR}/tpch_sf${SCALE_FACTOR}"

    RESULTS_FILE="${RESULTS_DIR}/tpch_sf${SCALE_FACTOR}.json"
    echo "RESULTS_FILE: ${RESULTS_FILE}"
    echo "Running tpch benchmark..."
    # Optional query filter to run specific query
    QUERY=$([ -n "$ARG3" ] && echo "--query $ARG3" || echo "")
    debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format parquet -o "${RESULTS_FILE}" $QUERY
}
    /// File format: `csv` or `parquet`
    #[structopt(short = "f", long = "format", default_value = "csv")]
    file_format: String,

It looks like we default to parquet for tpch, but it also supports csv, i will try to create a PR to support csv from tpch benchmark parameters.

Because from the generator code for tpch, we also generate the CSV format, so it's reasonable for us to support CSV benchmark also, i will create a PR soon. Thanks

 # Create 'tbl' (CSV format) data into $DATA_DIR if it does not already exist
    FILE="${TPCH_DIR}/supplier.tbl"
    if test -f "${FILE}"; then
        echo " tbl files exist ($FILE exists)."
    else
        echo " creating tbl files with tpch_dbgen..."
        docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}"
    fi

@alamb alamb marked this pull request as draft June 11, 2025 03:48
@alamb
Copy link
Contributor

alamb commented Jun 11, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Jun 11, 2025

Thank you @alamb , i created a ticket for the support for csv tpch benchmark first.
#16370

Updated:

Submitted a PR for review:

#16373

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Aug 12, 2025
@github-actions github-actions bot closed this Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate Stale PR has not had any activity for some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read from csv default datatype setting to utf8view

2 participants