Parallel CSV reading #6325

alamb · 2023-05-10T11:37:10Z

Is your feature request related to a problem or challenge?

As part of having a great "out of the box" experience it is important to use as many cores as possible in DataFusion. Given modern consumer laptops have 8-16 cores using multiple cores can literally translate to an order of magnitude faster performance.

While DataFusion offers the ability to read partitioned datasets (aka when the input is in multiple files), often, especially for initially testing out the tool, people will simply running queries on their existing CSV or JSON datasets and it will be relatively slow.

We already have the great datafusion.optimizer.repartition_file_scans option (see docs) -- added by @korowa in #5057 (👋 !) which uses multiple cores to decode the parquet files in parallel. I would like a similar feature for CSV files

Describe the solution you'd like

One basic approach (following what @korowa did for Parquet) would be:

If the datafusion.optimizer.repartition_file_scans option is set, divide the file into even (byte) sized contiguous blocks, probably with some lower limit (like 1MB)
Update CsvExec to process partitions using those subsets of the viles

Notes:
Given the vagaries of CSV (e.g. unescaped quoted newlines) it is likely impossible to parallelize CSV reading for all possible files. I think this is fine, and as long as we can turn off the reading in parallel it is better to have faster out of the box query performance for 99.99% of the queries than handle bizzare CSV files always

Care will be required to make sure all records are read exactly once, given the partition splits will likely be in the middle of rows.

One idea for parsing a partition (offset, len):

Start CSV parsing the data starting after finding the next newline after the offset bytes
Continue CSV parsing until the newline after the offset + len byte

        0        A,1,2,3,4,5,6,7,8,9\n                            
        20       A,1,2,3,4,5,6,7,8,9\n                            
        40       A,1,2,3,4,5,6,7,8,9\n ◀─ ─ ─ ─ ─ ─ ─ ─           
        60       A,1,2,3,4,5,6,7,8,9\n                 │          
        80       A,1,2,3,4,5,6,7,8,9\n                            
        100      A,1,2,3,4,5,6,7,8,9\n                 │          
                                                                  
Byte Offset       Lines of CSV Data                    │          
                  (in this case 20                                
                  bytes per line)           Split at byte 50 is in
                                              the middle of this  
                                                     line

This is similar to what is described in #5205 (comment)

Describe alternatives you've considered

No response

Additional context

@kmitchener noticed the same thing in: #5205

The duckdb implementation of a similar feature may offer some inspiration: duckdb/duckdb#5194

The text was updated successfully, but these errors were encountered:

tustvold · 2023-05-11T09:09:43Z

I believe this will require apache/arrow-rs#2241, in particular the ability to a streaming byte range get. I will add this to my list

alamb added enhancement New feature or request performance Make DataFusion faster labels May 10, 2023

This was referenced May 10, 2023

use more than one core/thread when querying CSV #5205

Closed

[DISCUSS] Set DataFusion settings for maximum "out of the box" performance #6287

Open

2010YOUY01 mentioned this issue Jun 29, 2023

parallel csv scan #6801

Merged

alamb closed this as completed in #6801 Jul 12, 2023

This was referenced Dec 11, 2023

Parallel NDSON file reading #8502

Closed

Parallel Arrow file format reading #8503

Closed

marvinlanhenke mentioned this issue Dec 26, 2023

Closes #8502: Parallel NDJSON file reading #8659

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel CSV reading #6325

Parallel CSV reading #6325

alamb commented May 10, 2023 •

edited

Loading

tustvold commented May 11, 2023

Parallel CSV reading #6325

Parallel CSV reading #6325

Comments

alamb commented May 10, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

tustvold commented May 11, 2023

alamb commented May 10, 2023 •

edited

Loading