Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IMDB queries (a.k.a. JOB - Join Order Benchmark) to DataFusion benchmark suite #12311

Closed
4 tasks
doupache opened this issue Sep 4, 2024 · 4 comments · Fixed by #12497 or #12529
Closed
4 tasks
Labels
enhancement New feature or request

Comments

@doupache
Copy link
Contributor

doupache commented Sep 4, 2024

Is your feature request related to a problem or challenge?

JOB (Join Order Benchmark) was proposed by a research team from TUM in the paper "How Good Are Query Optimizers, Really?".

It is also used in HyPer, DuckDB, and CedarDB. It is a good benchmark for testing join ordering and join operators. It is also part of DuckDB's regression test suite.

I think if we add this test suite, it will also help with improvements like those discussed in #7955.

Describe the solution you'd like

JOB utilize the IMDB datasets. These datasets are provided in csv.gz format and represent real-world data, making them ideal for testing datafusion.

task

  • Convert the dataset from csv.gz format to Parquet.
  • Add the IMDB license to the LICENSE.
  • add benchmark queries.
  • Integrate the benchmark suite into dfbench.

Once everything is set up, we will be able to easily run benchmarks using the following command:

cargo run  --bin dfbench --imdb --query=5

I would like to work on this!
Can someone help me understand the usual process for adding a third-party license in a Apache project ?

cc @jayzhan211 @alamb

Describe alternatives you've considered

No response

Additional context

No response

@doupache doupache added the enhancement New feature or request label Sep 4, 2024
@austin362667
Copy link
Contributor

@doupache Thanks. It seems promising to integrate the Join Order Benchmark. I look forward to take the follow-up tasks.

@alamb
Copy link
Contributor

alamb commented Sep 5, 2024

I think adding the join order benchmark would be reasonable.

Can someone help me understand the usual process for adding a third-party license in a Apache project ?

I would personally recommend following the model of the other benchmarks and not try and incorporate the files directly. Instead, download them on demand. If you do this I don't think we need any licensing updates

The benchmarking scripts are here: https://github.com/apache/datafusion/tree/main/benchmarks

I would recommend working on orchestrating the process using
https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh

So a benchmark session might look like something like

bench.sh data job
bench.sh run job

Convert the dataset from csv.gz format to Parquet.

TPCH does something similar (convert the tsv output of the tpch data generator to parquet)

@alamb
Copy link
Contributor

alamb commented Sep 5, 2024

Thanks @doupache -- this sounds very cool

@doupache
Copy link
Contributor Author

doupache commented Sep 17, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants