Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add multi-threaded Parquet read example #16717

Closed
GregoryKimball opened this issue Aug 31, 2024 · 0 comments · Fixed by #16828
Closed

[FEA] Add multi-threaded Parquet read example #16717

GregoryKimball opened this issue Aug 31, 2024 · 0 comments · Fixed by #16828
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Aug 31, 2024

Is your feature request related to a problem? Please describe.
I wish we had a more flexible tool for studying read_parquet pipelining.

The Parquet reader multithread benchmark is great, but it doesn't let me control the input parquet file. For instance if I want to run against LLM data or change the compression format. Also since the benchmark writes new files, they end up in the OS cache and the test only covers the hot-cache parquet input case. The benchmark only reads one file per thread, and this usually isn't enough work to show stable pipelining.

The parquet_io example is great, but it is only single-threaded and single-file. Also the first read can be controlled to be cold-cache, which is useful.

The brc_pipeline example in #15983 is great, but it relies on the CSV reader which doesn't currently use kvikIO (also see #13916).

Describe the solution you'd like
We could add an example alongside parquet_io to read a list of file names, or maybe all the files in a directory, across a variable number of threads.

We could also read the same file multiple times across threads, but this would make cache-clearing impossible. Even that could be OK for studying performance with hot cache parquet files.

Describe alternatives you've considered
Edit the 1billion example to accept parquet, but this wouldn't be very flexible.
Use the parquet_reader_multithread and hack different generation patterns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants