[FEA] Add synchronization for IO between `read_parquet` calls on different threads #16936

GregoryKimball · 2024-09-26T23:19:22Z

Is your feature request related to a problem? Please describe.

When launching a pool of threads, each calling read_parquet, we often find poor pipelining for the first read. We observe that each thread submits IO to the scheduler, and since each thread has the same priority, the scheduler completes the IO for all the threads at about the same time. This means that compute on the data cannot begin until all threads finish IO.

In this profile, we have 4 threads reading 3 parquet files each. You can see that the first read does not pipeline and instead does all the copying, followed by all the compute. The second read shows better copy-compute pipelining because the threads are staggered when the second read begins. For this example it only takes about 10-20 ms of jitter in the start times to achieve good pipelining.

Describe the solution you'd like
I would like to enable simple communication between read_parquet calls on the same process. Only one thread should be able to do IO at a time. This way the first thread will complete its IO, launch its decompression kernel, and then the second thread can begin IO. This way the second thread's copying will overlap with the first thread's compute, and so on for the rest of the threads.

To implement this, we could introduce an "IO lock" using std::conditional_variable that let's a thread set a bit before starting IO and then unset the bit when IO completes.

Describe alternatives you've considered
We could add sleeps for each thread to help stagger the copies and enable better pipelining.

Additional context
Please note, we can still do a kvikIO read with multiple threads (e.g. KVIKIO_NTHREADS=8) for each of the main read_parquet threads. We just want each read_parquet call to complete its IO and launch compute before other the read_parquet calls begin their IO.

For a disk read it is probably also better to do all the sequential reads together for a single file, rather than submitting 4 MB reads from different files back-to-back.

The text was updated successfully, but these errors were encountered:

vuule · 2024-09-26T23:40:42Z

we can still do a kvikIO read with multiple threads (e.g. KVIKIO_NTHREADS=8) for each of the main read_parquet threads

I think the thread pool is per-process, so we'd have 8 threads shared between our read_parquet threads. Which makes sense to me, as the optimal IO parallelism is presumable determined by the storage.

vuule · 2024-10-10T22:37:35Z

Experimented with this:

Putting only IO (only initiation of the async copies) in a critical section leads to almost no changes in pipelining. IO is not interleaved between threads, but the problem is that we perform a few H2D copies before decompression, and these end up blocked on the IO of other threads.
Including decompression in the critical section does improve pipelining a bit, as decode is done during IO from other threads. However, total execution time ends up much longer because threads are mostly idle while any of the threads is in the critical section.

We might be able to get mild improvements with only IO in the critical section + pinned host buffers + MME/SM copy. Tricky part is that even a single H2D copy before compression eliminates the positive impact.

GregoryKimball · 2024-10-15T18:32:59Z

Thank you @vuule for experimenting with this. I think we can close it for now. We will need a better tool (e.g. multi-threaded NDS-H) to study pipelining improvements.

GregoryKimball added the feature request New feature or request label Sep 26, 2024

GregoryKimball mentioned this issue Sep 26, 2024

Add an example to demonstrate multithreaded read_parquet pipelines #16828

Merged

3 tasks

GregoryKimball added this to libcudf Sep 26, 2024

GregoryKimball added this to the Parquet continuous improvement milestone Sep 26, 2024

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Sep 26, 2024

GregoryKimball closed this as completed Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add synchronization for IO between `read_parquet` calls on different threads #16936

[FEA] Add synchronization for IO between `read_parquet` calls on different threads #16936

GregoryKimball commented Sep 26, 2024 •

edited

Loading

vuule commented Sep 26, 2024 •

edited

Loading

vuule commented Oct 10, 2024 •

edited

Loading

GregoryKimball commented Oct 15, 2024

[FEA] Add synchronization for IO between read_parquet calls on different threads #16936

[FEA] Add synchronization for IO between read_parquet calls on different threads #16936

Comments

GregoryKimball commented Sep 26, 2024 • edited Loading

vuule commented Sep 26, 2024 • edited Loading

vuule commented Oct 10, 2024 • edited Loading

GregoryKimball commented Oct 15, 2024

[FEA] Add synchronization for IO between `read_parquet` calls on different threads #16936

[FEA] Add synchronization for IO between `read_parquet` calls on different threads #16936

GregoryKimball commented Sep 26, 2024 •

edited

Loading

vuule commented Sep 26, 2024 •

edited

Loading

vuule commented Oct 10, 2024 •

edited

Loading