Allow ParquetExec to parallelize work based on row groups #137

alamb · 2021-04-26T13:25:26Z

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11056

ParquetExec currently parallelizes work by passinging individual files to threads. It would be nice to be able to do this in a finer-grained way by assigning row groups and/or column chunks instead. This will be especially important in distributed systems built on DataFusion.

alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021

houqp added enhancement New feature or request help wanted Extra attention is needed labels Oct 18, 2021

korowa mentioned this issue Jan 25, 2023

Parquet parallel scan #5057

Merged

alamb closed this as completed in #5057 Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow ParquetExec to parallelize work based on row groups #137

Allow ParquetExec to parallelize work based on row groups #137

alamb commented Apr 26, 2021

Allow ParquetExec to parallelize work based on row groups #137

Allow ParquetExec to parallelize work based on row groups #137

Comments

alamb commented Apr 26, 2021