feat: Parallel Arrow file format reading #8897

my-vegetable-has-exploded · 2024-01-17T14:30:29Z

Which issue does this PR close?

Closes #8503

Rationale for this change

What changes are included in this PR?

If file_meta.range is some, filter recordbatches according to range, then scan recordbatches.

Are these changes tested?

physical plan of scaning arrow files changes in repartition_scan.slt.

Are there any user-facing changes?

alamb

The code looks good to me @my-vegetable-has-exploded -- thank you very much.

I think it is likely to not work well on remote object store given how many requests are made but I also think that could be handled by a follow on PR

My only concern with this PR as written is if the tests actually exercise the multi-batch reading code given how small the input files in repartition.slt are

alamb · 2024-01-25T22:17:47Z

datafusion/sqllogictest/Cargo.toml

@@ -61,6 +61,7 @@ postgres = ["bytes", "chrono", "tokio-postgres", "postgres-types", "postgres-pro
 [dev-dependencies]
 env_logger = { workspace = true }
 num_cpus = { workspace = true }
+tokio = { version = "1.0", features = ["rt-multi-thread"] }


why is this needed?

alamb · 2024-01-25T22:18:10Z

datafusion/sqllogictest/test_files/repartition_scan.slt

@@ -253,7 +253,16 @@ query TT
 EXPLAIN SELECT * FROM arrow_table
 ----
 logical_plan TableScan: arrow_table projection=[f0, f1, f2]
-physical_plan ArrowExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow]]}, projection=[f0, f1, f2]
+physical_plan ArrowExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:0..461], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:461..922], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:922..1383], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:1383..1842]]}, projection=[f0, f1, f2]


👍 looks good to me -- though I I wonder will this actually read in parallel (or do these ranges all end up in the same reader)?

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jan 17, 2024

my-vegetable-has-exploded marked this pull request as ready for review January 18, 2024 03:22

my-vegetable-has-exploded added 5 commits January 21, 2024 12:31

feat: Parallel Arrow file format reading

cd7335d

update slt for arrow scan.

6a04b5c

fix tomlfmt

97177a6

fix tomlfmt

1dfcb2b

update configs.md

9348b09

my-vegetable-has-exploded force-pushed the arrow-repartition branch from 8e4d9f2 to 9348b09 Compare January 21, 2024 04:36

alamb approved these changes Jan 25, 2024

View reviewed changes

alamb mentioned this pull request Jan 25, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 22, 2024 #8933

Closed

9 tasks

alamb merged commit 9bf0f68 into apache:main Jan 29, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Parallel Arrow file format reading #8897

feat: Parallel Arrow file format reading #8897

my-vegetable-has-exploded commented Jan 17, 2024 •

edited

Loading

alamb left a comment

alamb Jan 25, 2024

alamb Jan 25, 2024

feat: Parallel Arrow file format reading #8897

feat: Parallel Arrow file format reading #8897

Conversation

my-vegetable-has-exploded commented Jan 17, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 25, 2024

Choose a reason for hiding this comment

alamb Jan 25, 2024

Choose a reason for hiding this comment

my-vegetable-has-exploded commented Jan 17, 2024 •

edited

Loading