You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I wish we had a more flexible tool for studying read_parquet pipelining.
The Parquet reader multithread benchmark is great, but it doesn't let me control the input parquet file. For instance if I want to run against LLM data or change the compression format. Also since the benchmark writes new files, they end up in the OS cache and the test only covers the hot-cache parquet input case. The benchmark only reads one file per thread, and this usually isn't enough work to show stable pipelining.
The parquet_io example is great, but it is only single-threaded and single-file. Also the first read can be controlled to be cold-cache, which is useful.
The brc_pipeline example in #15983 is great, but it relies on the CSV reader which doesn't currently use kvikIO (also see #13916).
Describe the solution you'd like
We could add an example alongside parquet_io to read a list of file names, or maybe all the files in a directory, across a variable number of threads.
We could also read the same file multiple times across threads, but this would make cache-clearing impossible. Even that could be OK for studying performance with hot cache parquet files.
Describe alternatives you've considered
Edit the 1billion example to accept parquet, but this wouldn't be very flexible.
Use the parquet_reader_multithread and hack different generation patterns
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
I wish we had a more flexible tool for studying read_parquet pipelining.
The Parquet reader multithread benchmark is great, but it doesn't let me control the input parquet file. For instance if I want to run against LLM data or change the compression format. Also since the benchmark writes new files, they end up in the OS cache and the test only covers the hot-cache parquet input case. The benchmark only reads one file per thread, and this usually isn't enough work to show stable pipelining.
The parquet_io example is great, but it is only single-threaded and single-file. Also the first read can be controlled to be cold-cache, which is useful.
The
brc_pipeline
example in #15983 is great, but it relies on the CSV reader which doesn't currently use kvikIO (also see #13916).Describe the solution you'd like
We could add an example alongside
parquet_io
to read a list of file names, or maybe all the files in a directory, across a variable number of threads.We could also read the same file multiple times across threads, but this would make cache-clearing impossible. Even that could be OK for studying performance with hot cache parquet files.
Describe alternatives you've considered
Edit the
1billion
example to accept parquet, but this wouldn't be very flexible.Use the
parquet_reader_multithread
and hack different generation patternsThe text was updated successfully, but these errors were encountered: