Add a separate configuration setting for parallelism of scanning parquet files #924

alamb · 2021-08-22T11:06:15Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When reading multiple parquet files, DataFusion will sometimes request many file handles from the OS concurrently. This is both inefficient (each file handles takes up memory, requires system calls, etc) as well as leads to "too many open files" types errors.

Depending on how fast IO comes in and the details of the Tokio scheduler, sometimes it will have far too many open files at once (it might end up opening 100 input parquet files, for example, even if there are only 8 cores available for processing)

Describe the solution you'd like
As described by @Dandandan in https://github.com/apache/arrow-datafusion/pull/706/files#r667508175 it would be nice to decouple the setting for number of concurrent parquet files scanned with the number of target partitions for other operators.

So the idea would be to add a new config setting parquet_partitions or perhapsfilesource_partitions that would control the number of parquet "partitions" created and thus the number of file handles to run datafusion plans

Describe alternatives you've considered
@andygrove has mentioned the Ballista scheduler is more sophisticated in this area and hopefully we can move some of those improvements down into the core DataFusion engine

Additional context
There are reports in arrow-rs of "too many open files" apache/arrow-rs#47 (comment) which may also be helped by this feature, though there is probably more work as well

The text was updated successfully, but these errors were encountered:

Dandandan · 2021-08-22T14:00:40Z

A good start might be to start limiting the number of maximum threads that are used for spawn_blocking code, by default there are max 512 concurrent threads for those tasks.

See:

https://docs.rs/tokio/1.10.0/tokio/index.html#cpu-bound-tasks-and-blocking-code

alamb · 2021-08-23T10:19:54Z

For systems that use the same tokio (global) executor for DF and possibly other non DF code, I do think it would also make sense to have DataFusion limit its parallelism internally

tustvold · 2022-01-15T14:09:40Z

Related https://github.com/influxdata/influxdb_iox/issues/3288

alamb · 2022-01-17T14:31:11Z

@yjshen started consolidating such config settings in #1562

alamb · 2022-11-28T14:56:46Z

I think this is basically no longer an issue as the parquet scanning parallelism can be controlled by the configuration of the parquetexec

alamb added the enhancement New feature or request label Aug 22, 2021

alamb mentioned this issue Aug 22, 2021

Rename concurrency to target_partitions #706

Merged

alamb mentioned this issue Aug 23, 2021

Generate Tokio Runtime in ExecutionContext #928

Open

yjshen mentioned this issue Mar 25, 2022

RFC: More Granular File Operators #2079

Closed

alippai mentioned this issue Oct 19, 2022

Slow startup time timvw/qv#37

Closed

alamb closed this as completed Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a separate configuration setting for parallelism of scanning parquet files #924

Add a separate configuration setting for parallelism of scanning parquet files #924

alamb commented Aug 22, 2021

Dandandan commented Aug 22, 2021

alamb commented Aug 23, 2021

tustvold commented Jan 15, 2022

alamb commented Jan 17, 2022

alamb commented Nov 28, 2022

Add a separate configuration setting for parallelism of scanning parquet files #924

Add a separate configuration setting for parallelism of scanning parquet files #924

Comments

alamb commented Aug 22, 2021

Dandandan commented Aug 22, 2021

alamb commented Aug 23, 2021

tustvold commented Jan 15, 2022

alamb commented Jan 17, 2022

alamb commented Nov 28, 2022