Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

doupache
Copy link

@doupache doupache commented Sep 17, 2024

Which issue does this PR close?

Partial Closes #12311

cd benchmarks/   
./bench.sh data imdb

All IMDB tables are now generated in benchmarks/data/imdb/*.parquet

Rationale for this change

Add imdb dataset for the JOB benchmarking

What changes are included in this PR?

Download the the dataset and convert it to parquet files.

Are these changes tested?

Just like running to generate tpch dataset:
./bench.sh data tpch

run:
./bench.sh data imdb

Are there any user-facing changes?

no

@doupache
Copy link
Author

Unlike TPCH, which uses table-named folders with partitioned parquet files, IMDB has smaller tables (largest is 360MB).
We can convert each IMDB table to a single, non-partitioned parquet file.

image

@austin362667
Copy link
Contributor

Thanks @doupache for paving the way! Got few nit suggestion.

  1. Do we prefer benchmark name as imdb or job?
  2. Cloud you add [1/N] in the beginning of the PR title to help us track the follow-ups progress?

@doupache
Copy link
Author

Thanks @austin362667 for the suggestions. IMDB is more suitable than JOB as it's specific and avoids confusion. Job can be used in many different contexts.

Adding 'progress' to the title is also a good idea 👍

@doupache doupache changed the title Add JOB benchmark dataset (imdb dataset) Add JOB benchmark dataset 1/N (imdb dataset) Sep 17, 2024
@doupache doupache changed the title Add JOB benchmark dataset 1/N (imdb dataset) Add JOB benchmark dataset [1/N] (imdb dataset) Sep 17, 2024
benchmarks/src/imdb/convert.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Sep 19, 2024

Thanks @doupache -- I started the CI jobs, and I will try and test this out manually locally over the next few days

@doupache
Copy link
Author

doupache commented Sep 20, 2024

Thanks @austin362667 and @alamb.

I have updated the PR and learned some Cargo tips from @austin362667.
Using debug build during development is much faster.

#1
cd benchmarks && cargo build 

#2 
cargo run --bin imdb -- convert --input ./data/imdb/ --output ./data/imdb/ --format parquet

i also test all 21 parquet like follwoing. schema is from the original dataset.

# create table 
CREATE EXTERNAL TABLE name (
    id INTEGER NOT NULL PRIMARY KEY,
    name STRING NOT NULL,
    imdb_index STRING,
    imdb_id INTEGER,
    gender STRING,
    name_pcode_cf STRING,
    name_pcode_nf STRING,
    surname_pcode STRING,
    md5sum STRING
)
STORED AS PARQUET
LOCATION '../benchmarks/data/imdb/name.parquet';

# read 
SELECT * FROM name LIMIT 5;

@austin362667
Copy link
Contributor

Unlike TPCH, which uses table-named folders with partitioned parquet files, IMDB has smaller tables (largest is 360MB).
We can convert each IMDB table to a single, non-partitioned parquet file.

Sure! I think maybe it's because historical reasons that ParquetExec didn't support parallel execution for single Parquet file back then. Now it's being supported in #5057

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add IMDB queries (a.k.a. JOB - Join Order Benchmark) to DataFusion benchmark suite
3 participants