-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IMDB queries (a.k.a. JOB - Join Order Benchmark) to DataFusion benchmark suite #12311
Comments
@doupache Thanks. It seems promising to integrate the |
I think adding the join order benchmark would be reasonable.
I would personally recommend following the model of the other benchmarks and not try and incorporate the files directly. Instead, download them on demand. If you do this I don't think we need any licensing updates The benchmarking scripts are here: https://github.com/apache/datafusion/tree/main/benchmarks I would recommend working on orchestrating the process using So a benchmark session might look like something like bench.sh data job
bench.sh run job
TPCH does something similar (convert the |
Thanks @doupache -- this sounds very cool |
|
Is your feature request related to a problem or challenge?
JOB (Join Order Benchmark) was proposed by a research team from TUM in the paper "How Good Are Query Optimizers, Really?".
It is also used in HyPer, DuckDB, and CedarDB. It is a good benchmark for testing join ordering and join operators. It is also part of DuckDB's regression test suite.
I think if we add this test suite, it will also help with improvements like those discussed in #7955.
Describe the solution you'd like
JOB utilize the IMDB datasets. These datasets are provided in csv.gz format and represent real-world data, making them ideal for testing datafusion.
task
csv.gz
format toParquet
.dfbench
.Once everything is set up, we will be able to easily run benchmarks using the following command:
I would like to work on this!
Can someone help me understand the usual process for adding a third-party license in a Apache project ?
cc @jayzhan211 @alamb
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: