Parquet parallel scan #5057

korowa · 2023-01-25T13:04:16Z

Which issue does this PR close?

Closes #137.

Rationale for this change

Improved performance for reading single parquet file / parquet files in quantity less than number of cores in multicore runtime

What changes are included in this PR?

repartition_file_scans & repartition_file_min_size optimizer settings - by default repartitioning of file scans disabled, and performed if total size of files to scan greater than 10MB (to avoid splitting small amount of small files)
get_repartitioned in ParquetExec - returns cloned object with ranged PartitionedFiles redistributed file groups in base_config
repartition.rs - now calls get_repartitioned for ParquetExec in case repartitioning is allowed (upstream operator benefit from it / no data ordering violations / etc.)

As any other repartitioning operation, parallelization is applied only in case ParquetExec is underloaded in terms of partitions -- two files scan will be distributed over 4 partitions, but no redistribution will be performed in case of "2 files - 2 target partitions".

Are these changes tested?

Tests for ParquetExec.get_repartitioned added, and more tests added for repartition rule -- mostly copies of existing tests for cases when parallelization should be ignored, to ensure it won't break physical plans.

Are there any user-facing changes?

New configuration settings repartition_file_scans & repartition_file_min_size

…scan

tustvold · 2023-01-25T18:08:58Z

Only had time to take a brief look at this PR, and so I'm likely missing something but please bear with me 😄

This PR modifies ListingTable to pair together PartitionedFile with Vec<Option<FileRange>>, this makes this approach specific to ListingTable and also adds parallelism control to a part of the system that doesn't really have context on how much parallelism is needed, nor what invariants such as sort orders may need to be upheld.

I have two suggestions that may be stupid:

Make this a physical optimizer rule that looks at operators containing FileScanConfig and adds more partitions based on the target_partitions property
Rather than adding a new FileRanges property, instead using the existing range: Option<FileRange> already present on PartitionedFile, the same file with disjoint ranges can then appear in multiple partitions

tustvold · 2023-01-25T18:23:42Z

datafusion/core/src/datasource/file_format/parquet.rs

+    };
+
+    if collect_file_ranges {
+        let file_ranges = parquet_metadata


FWIW the way these ranges are applied in parquet is based on if the row group's midpoint lies within the given range, as a result there is no requirement that these ranges exactly delimit boundaries.

For example you could take a parquet file of 2GB and blindly chop it into 4x 512MB slices. This makes the assumption that there are at least 4 row groups and the row groups are similarly sized, in practice this is probably fine. This is what Spark does and avoids needing the file's metadata to do the optimisation.

True, cutting file on N even parts will allow to read only row groups having their start offset inside corresponding ranges without any duplicate reads or skipped row groups - so, splitting could be much easier without using metadata (except for ObjectMetadata for the size)

korowa · 2023-01-25T18:54:45Z

@tustvold , thank you for the comments!

Initially my intention was to handle scan planning as early as possible, so ListingTable looked like proper place for this - it actually holds parallelism settings in options attribute, and fetches required metadata.

But, yeah, now I see that physical optimizer, and especially its repartitioning rule is much better suited for repatitioning ParquetExec 🤔 .

I guess I'll convert this PR to draft and come up a bit later with updated version of this optimizer rule.

…scan

korowa · 2023-01-27T10:10:39Z

I've reworked this PR by utilizing repartition rule, and updated the description -- it looks way more simple now, and, what's more important, it takes into account any plan-related restrictions for ParquetExec.

@tustvold, thank you for the tip about using physical optimizer!

tustvold

This looks really cool, left some minor comments

datafusion/core/src/physical_optimizer/repartition.rs

tustvold · 2023-01-27T10:58:10Z

datafusion/core/src/physical_optimizer/repartition.rs

@@ -846,6 +871,182 @@ mod tests {
        Ok(())
    }

+    #[test]


Love the test coverage

datafusion/core/src/physical_plan/file_format/parquet.rs

alamb · 2023-01-27T20:54:03Z

I plan to review this carefully either later today or tomorrow. Very exciting @korowa -- thank you

alamb

I went through the code carefully and I really like it. Thank you @korowa -- do you have any performance benchmarks you can share? I think this will mostly help when scanning single large parquet files.

I would like to explore turning this feature on by default (perhaps we can have a separate ticket to track that)

I already feel bad that we don't have other parquet options enabled by default.

datafusion/common/src/config.rs

datafusion/core/src/physical_optimizer/repartition.rs

alamb · 2023-01-28T13:23:41Z

My measurements suggest this setting can improve the performance with single large parquet files significantly (over 2x in my measurement). 👨‍🍳 👌 -- very nice

I tested this out by making a 9G parquet file from https://github.com/tustvold/access-log-gen/

Then using datafusion-cli:

❯ select avg(request_bytes), avg(response_bytes), avg(response_status), host from '/Users/alamb/Software/access-log-gen/logs.9G.parquet' group by host;
...
927 rows in set. Query took 2.313 seconds.

And then I enabled this setting:

❯ set datafusion.optimizer.repartition_file_scans = true;
0 rows in set. Query took 0.000 seconds.
❯ select avg(request_bytes), avg(response_bytes), avg(response_status), host from '/Users/alamb/Software/access-log-gen/logs.9G.parquet' group by host;

927 rows in set. Query took 0.962 seconds.

😮

korowa · 2023-01-28T13:34:35Z

I have only these in my notes

// Single partition
./target/release/examples/clickbench  4.10s user 0.59s system 113% cpu 4.142 total
./target/release/examples/clickbench  4.14s user 0.56s system 136% cpu 3.451 total
./target/release/examples/clickbench  4.13s user 0.55s system 135% cpu 3.446 total
./target/release/examples/clickbench  4.29s user 0.55s system 134% cpu 3.603 total

// Multiple partition
./target/release/examples/clickbench  5.32s user 0.96s system 288% cpu 2.174 total
./target/release/examples/clickbench  5.14s user 0.89s system 392% cpu 1.534 total
./target/release/examples/clickbench  5.24s user 0.92s system 405% cpu 1.517 total
./target/release/examples/clickbench  5.14s user 0.91s system 399% cpu 1.516 total
./target/release/examples/clickbench  5.23s user 0.92s system 405% cpu 1.518 total

which doesn't look like "official" benchmark at all 🙃 . These are results for this query from clickbench over 14GB parquet file on 2.6 GHz 6-Core Intel Core i7

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

korowa · 2023-01-28T14:40:38Z

I think this will mostly help when scanning single large parquet files.

As it works for now - yes, use case is mostly "relatively large files less than number of target_partitions" -- I guess it could be improved / reworked later to something like "perform repartitioning even for target_partitions in case there is significant skew in current partitioning"

I would like to explore turning this feature on by default (perhaps we can have a separate ticket to track that)

I don't mind enabling parallelism by default and it seems to be the fastest way to deliver this feature, but (I'm not sure, just a suggestion) maybe better time for this will be in 1 (or 2) releases after the setting itself will be released?

alamb · 2023-01-28T16:18:53Z

As it works for now - yes, use case is mostly "relatively large files less than number of target_partitions" -- I guess it could be improved / reworked later to something like "perform repartitioning even for target_partitions in case there is significant skew in current partitioning"

I think to make this effective we will need to have more runtime dynamics (aka using a morsel driven scheduler)

I don't mind enabling parallelism by default and it seems to be the fastest way to deliver this feature, but (I'm not sure, just a suggestion) maybe better time for this will be in 1 (or 2) releases after the setting itself will be released?

I agree -- let's get this PR merged in (default to off) and then plan to enable it by default in a few weeks (we just need to remember to do so!)

alamb · 2023-01-28T16:20:23Z

datafusion/core/src/physical_optimizer/repartition.rs

+            "CoalescePartitionsExec",
+            "AggregateExec: mode=Partial, gby=[], aggr=[]",
+            // Multiple source files splitted across partitions
+            "ParquetExec: limit=None, partitions={4 groups: [[x:0..75], [x:75..100, y:0..50], [y:50..125], [y:125..200]]}, projection=[c1]",


that is quite clever that the partitions have different parts of the same file 👍

alamb · 2023-01-28T16:21:21Z

I plan to leave this open for the rest of the weekend so others have a chance to comment if they want, and then merge on Monday

alamb · 2023-01-30T19:18:30Z

Thanks again @korowa ❤️

ursabot · 2023-01-30T19:25:54Z

Benchmark runs are scheduled for baseline = 74b05fa and contender = 67b1da8. 67b1da8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Ted-Jiang · 2023-01-31T05:30:17Z

Thanks @korowa this is really cool! 👍 nice work!
I have a question about multiple row groups from the same file may be read concurrently

from the code

 let target_partition_size =
            (total_size as usize + (target_partitions) - 1) / (target_partitions);

How could we make sure the file is dived into row group? 🤔

alamb · 2023-01-31T11:54:01Z

I filed #5125 to track turning this on by default

tustvold · 2023-01-31T11:55:55Z

How could we make sure the file is dived into row group?

Without the parquet file metadata we can't reliably, but it isn't important to correctness that we do so. Not needing the metadata significantly simplifies the planning and avoids potentially costly round trips to object store whilst planning. FWIW I believe this is a similar approach as taken by Spark.

alamb · 2023-01-31T11:57:14Z

I believe what happens is that the file is divided into byte ranges and then the row groups whose data falls within those ranges are scanned. It would be good to double check this undertanding though

https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/datasource/listing/mod.rs#L52-L63

tustvold · 2023-01-31T11:58:44Z

whose data falls within those ranges are scanned

More specifically it is the row groups with a midpoint that falls into the range, this means that so long as the ranges are disjoint there is no risk of reading the same row group twice

korowa · 2023-01-31T13:54:18Z

@Ted-Jiang here is the exact place where DF decides to read/not to read RowGroup depending on range. So it actually isn't required to split files on ranges with boundaries same as RowGroups boundaries.

Ted-Jiang · 2023-02-01T04:04:02Z

Thanks for all kindly reply ! ❤️

here is the exact place where DF decides to read/not to read RowGroup depending on range. So it actually isn't required to split files on ranges with boundaries same as RowGroups boundaries.

So PartitionedFile range is not used for fetch bytes from objectStore, finally we use rowgroup start, offset to fetch bytest ? am i right?

I think i miss this part cause the misunderstanding😂

korowa · 2023-02-01T08:01:09Z

So PartitionedFile range is not used for fetch bytes from objectStore, finally we use rowgroup start, offset to fetch bytest ? am i right?

Yeah, I can't find any places where range is used, except for row_group pruning

Ted-Jiang · 2023-02-01T09:29:16Z

So PartitionedFile range is not used for fetch bytes from objectStore, finally we use rowgroup start, offset to fetch bytest ? am i right?

Yeah, I can't find any places where range is used, except for row_group pruning

Thanks again for everyone ✌️

korowa added 2 commits January 25, 2023 14:44

parallel parquet scanning

5a6d074

Merge remote-tracking branch 'upstream/master' into parquet_parallel_…

b326dce

…scan

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jan 25, 2023

tustvold reviewed Jan 25, 2023

View reviewed changes

korowa marked this pull request as draft January 25, 2023 18:54

korowa added 2 commits January 27, 2023 12:12

repartitioning ParquetExec

44b4284

Merge remote-tracking branch 'upstream/master' into parquet_parallel_…

659d9dc

…scan

korowa marked this pull request as ready for review January 27, 2023 10:01

tustvold approved these changes Jan 27, 2023

View reviewed changes

korowa force-pushed the parquet_parallel_scan branch from 9cc44c4 to 9caa62a Compare January 27, 2023 11:04

minor changes & review comments

556b0c6

korowa force-pushed the parquet_parallel_scan branch 2 times, most recently from 53e8fc5 to d6e95f7 Compare January 27, 2023 17:52

settings reorganized

997b63e

korowa force-pushed the parquet_parallel_scan branch from d6e95f7 to 997b63e Compare January 27, 2023 17:56

alamb approved these changes Jan 28, 2023

View reviewed changes

korowa and others added 2 commits January 28, 2023 16:46

Apply suggestions from code review

e829fde

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

additional test case & updated docs

57440a9

alamb reviewed Jan 28, 2023

View reviewed changes

alamb approved these changes Jan 28, 2023

View reviewed changes

alamb merged commit 67b1da8 into apache:master Jan 30, 2023

alamb mentioned this pull request Jan 31, 2023

Enable parquet parallel scans by default #5125

Closed

alamb mentioned this pull request May 10, 2023

Parallel CSV reading #6325

Closed

2010YOUY01 mentioned this pull request Jun 29, 2023

parallel csv scan #6801

Merged

austin362667 mentioned this pull request Sep 20, 2024

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

Merged

austin362667 mentioned this pull request Oct 12, 2024

Update Datafusion Ray architecture docs apache/datafusion-ray#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet parallel scan #5057

Parquet parallel scan #5057

korowa commented Jan 25, 2023 •

edited

Loading

tustvold commented Jan 25, 2023

tustvold Jan 25, 2023 •

edited

Loading

korowa Jan 25, 2023

korowa commented Jan 25, 2023

korowa commented Jan 27, 2023

tustvold left a comment

tustvold Jan 27, 2023

alamb commented Jan 27, 2023

alamb left a comment

alamb commented Jan 28, 2023

korowa commented Jan 28, 2023 •

edited

Loading

korowa commented Jan 28, 2023 •

edited

Loading

alamb commented Jan 28, 2023 •

edited

Loading

alamb Jan 28, 2023

alamb commented Jan 28, 2023

alamb commented Jan 30, 2023

ursabot commented Jan 30, 2023

Ted-Jiang commented Jan 31, 2023

alamb commented Jan 31, 2023

tustvold commented Jan 31, 2023

alamb commented Jan 31, 2023

tustvold commented Jan 31, 2023

korowa commented Jan 31, 2023 •

edited

Loading

Ted-Jiang commented Feb 1, 2023 •

edited

Loading

korowa commented Feb 1, 2023

Ted-Jiang commented Feb 1, 2023

@@ @@ -846,6 +871,182 @@ mod tests { @@
                       Ok(())
                   }
+                  #[test]

Parquet parallel scan #5057

Parquet parallel scan #5057

Conversation

korowa commented Jan 25, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold commented Jan 25, 2023

tustvold Jan 25, 2023 • edited Loading

Choose a reason for hiding this comment

korowa Jan 25, 2023

Choose a reason for hiding this comment

korowa commented Jan 25, 2023

korowa commented Jan 27, 2023

tustvold left a comment

Choose a reason for hiding this comment

tustvold Jan 27, 2023

Choose a reason for hiding this comment

alamb commented Jan 27, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 28, 2023

korowa commented Jan 28, 2023 • edited Loading

korowa commented Jan 28, 2023 • edited Loading

alamb commented Jan 28, 2023 • edited Loading

alamb Jan 28, 2023

Choose a reason for hiding this comment

alamb commented Jan 28, 2023

alamb commented Jan 30, 2023

ursabot commented Jan 30, 2023

Ted-Jiang commented Jan 31, 2023

alamb commented Jan 31, 2023

tustvold commented Jan 31, 2023

alamb commented Jan 31, 2023

tustvold commented Jan 31, 2023

korowa commented Jan 31, 2023 • edited Loading

Ted-Jiang commented Feb 1, 2023 • edited Loading

korowa commented Feb 1, 2023

Ted-Jiang commented Feb 1, 2023

korowa commented Jan 25, 2023 •

edited

Loading

tustvold Jan 25, 2023 •

edited

Loading

korowa commented Jan 28, 2023 •

edited

Loading

korowa commented Jan 28, 2023 •

edited

Loading

alamb commented Jan 28, 2023 •

edited

Loading

korowa commented Jan 31, 2023 •

edited

Loading

Ted-Jiang commented Feb 1, 2023 •

edited

Loading