[FEA] Add multi-threaded Parquet read example #16717

GregoryKimball · 2024-08-31T21:17:19Z

Is your feature request related to a problem? Please describe.
I wish we had a more flexible tool for studying read_parquet pipelining.

The Parquet reader multithread benchmark is great, but it doesn't let me control the input parquet file. For instance if I want to run against LLM data or change the compression format. Also since the benchmark writes new files, they end up in the OS cache and the test only covers the hot-cache parquet input case. The benchmark only reads one file per thread, and this usually isn't enough work to show stable pipelining.

The parquet_io example is great, but it is only single-threaded and single-file. Also the first read can be controlled to be cold-cache, which is useful.

The brc_pipeline example in #15983 is great, but it relies on the CSV reader which doesn't currently use kvikIO (also see #13916).

Describe the solution you'd like
We could add an example alongside parquet_io to read a list of file names, or maybe all the files in a directory, across a variable number of threads.

We could also read the same file multiple times across threads, but this would make cache-clearing impossible. Even that could be OK for studying performance with hot cache parquet files.

Describe alternatives you've considered
Edit the 1billion example to accept parquet, but this wouldn't be very flexible.
Use the parquet_reader_multithread and hack different generation patterns

The text was updated successfully, but these errors were encountered:

GregoryKimball added cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Aug 31, 2024

GregoryKimball added this to the Benchmarking milestone Aug 31, 2024

GregoryKimball assigned mhaseeb123 Aug 31, 2024

GregoryKimball modified the milestones: Benchmarking, Speed-of-light IO in libcudf Sep 4, 2024

mhaseeb123 mentioned this issue Sep 18, 2024

Add an example to demonstrate multithreaded read_parquet pipelines #16828

Merged

3 tasks

rapids-bot bot closed this as completed in #16828 Oct 11, 2024

rapids-bot bot closed this as completed in be1dd32 Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add multi-threaded Parquet read example #16717

[FEA] Add multi-threaded Parquet read example #16717

GregoryKimball commented Aug 31, 2024 •

edited

Loading

[FEA] Add multi-threaded Parquet read example #16717

[FEA] Add multi-threaded Parquet read example #16717

Comments

GregoryKimball commented Aug 31, 2024 • edited Loading

GregoryKimball commented Aug 31, 2024 •

edited

Loading