Add JOB benchmark dataset [1/N] (imdb dataset) #12497

doupache · 2024-09-17T05:54:11Z

Which issue does this PR close?

Partial Closes #12311

cd benchmarks/   
./bench.sh data imdb

All IMDB tables are now generated in benchmarks/data/imdb/*.parquet

Rationale for this change

Add imdb dataset for the JOB benchmarking

What changes are included in this PR?

Download the the dataset and convert it to parquet files.

Are these changes tested?

Just like running to generate tpch dataset:
./bench.sh data tpch

run:
./bench.sh data imdb

Are there any user-facing changes?

no

doupache · 2024-09-17T06:21:45Z

Unlike TPCH, which uses table-named folders with partitioned parquet files, IMDB has smaller tables (largest is 360MB).
We can convert each IMDB table to a single, non-partitioned parquet file.

austin362667 · 2024-09-17T08:48:50Z

Thanks @doupache for paving the way! Got few nit suggestion.

Do we prefer benchmark name as imdb or job?
Cloud you add [1/N] in the beginning of the PR title to help us track the follow-ups progress?

doupache · 2024-09-17T10:12:43Z

Thanks @austin362667 for the suggestions. IMDB is more suitable than JOB as it's specific and avoids confusion. Job can be used in many different contexts.

Adding 'progress' to the title is also a good idea 👍

benchmarks/src/imdb/convert.rs

alamb · 2024-09-19T19:29:25Z

Thanks @doupache -- I started the CI jobs, and I will try and test this out manually locally over the next few days

doupache · 2024-09-20T03:35:50Z

Thanks @austin362667 and @alamb.

I have updated the PR and learned some Cargo tips from @austin362667.
Using debug build during development is much faster.

#1
cd benchmarks && cargo build 

#2 
cargo run --bin imdb -- convert --input ./data/imdb/ --output ./data/imdb/ --format parquet

i also test all 21 parquet like follwoing. schema is from the original dataset.

# create table 
CREATE EXTERNAL TABLE name (
    id INTEGER NOT NULL PRIMARY KEY,
    name STRING NOT NULL,
    imdb_index STRING,
    imdb_id INTEGER,
    gender STRING,
    name_pcode_cf STRING,
    name_pcode_nf STRING,
    surname_pcode STRING,
    md5sum STRING
)
STORED AS PARQUET
LOCATION '../benchmarks/data/imdb/name.parquet';

# read 
SELECT * FROM name LIMIT 5;

austin362667 · 2024-09-20T14:11:39Z

Unlike TPCH, which uses table-named folders with partitioned parquet files, IMDB has smaller tables (largest is 360MB).
We can convert each IMDB table to a single, non-partitioned parquet file.

Sure! I think maybe it's because historical reasons that ParquetExec didn't support parallel execution for single Parquet file back then. Now it's being supported in #5057

imdb dataset

e34bd2a

doupache mentioned this pull request Sep 17, 2024

Add IMDB queries (a.k.a. JOB - Join Order Benchmark) to DataFusion benchmark suite #12311

Open

4 tasks

cargo fmt

dced3af

austin362667 approved these changes Sep 17, 2024

View reviewed changes

we should also extrac the tar after download

0bf209e

doupache changed the title ~~Add JOB benchmark dataset (imdb dataset)~~ Add JOB benchmark dataset 1/N (imdb dataset) Sep 17, 2024

doupache changed the title ~~Add JOB benchmark dataset 1/N (imdb dataset)~~ Add JOB benchmark dataset [1/N] (imdb dataset) Sep 17, 2024

austin362667 mentioned this pull request Sep 18, 2024

Add IMDB(JOB) Benchmark [2/N] (imdb queries) #12529

Open

8 tasks

austin362667 suggested changes Sep 19, 2024

View reviewed changes

benchmarks/src/imdb/convert.rs Outdated Show resolved Hide resolved

we should not skip last col

dec1fbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

doupache commented Sep 17, 2024 •

edited

Loading

doupache commented Sep 17, 2024

austin362667 commented Sep 17, 2024

doupache commented Sep 17, 2024

alamb commented Sep 19, 2024

doupache commented Sep 20, 2024 •

edited

Loading

austin362667 commented Sep 20, 2024

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

Are you sure you want to change the base?

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

Conversation

doupache commented Sep 17, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

doupache commented Sep 17, 2024

austin362667 commented Sep 17, 2024

doupache commented Sep 17, 2024

alamb commented Sep 19, 2024

doupache commented Sep 20, 2024 • edited Loading

austin362667 commented Sep 20, 2024

doupache commented Sep 17, 2024 •

edited

Loading

doupache commented Sep 20, 2024 •

edited

Loading