Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Parallel Arrow file format reading #8897

Merged
merged 5 commits into from
Jan 29, 2024

Conversation

my-vegetable-has-exploded
Copy link
Contributor

@my-vegetable-has-exploded my-vegetable-has-exploded commented Jan 17, 2024

Which issue does this PR close?

Closes #8503

Rationale for this change

What changes are included in this PR?

If file_meta.range is some, filter recordbatches according to range, then scan recordbatches.

Are these changes tested?

physical plan of scaning arrow files changes in repartition_scan.slt.

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jan 17, 2024
@my-vegetable-has-exploded my-vegetable-has-exploded marked this pull request as ready for review January 18, 2024 03:22
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me @my-vegetable-has-exploded -- thank you very much.

I think it is likely to not work well on remote object store given how many requests are made but I also think that could be handled by a follow on PR

My only concern with this PR as written is if the tests actually exercise the multi-batch reading code given how small the input files in repartition.slt are

@@ -61,6 +61,7 @@ postgres = ["bytes", "chrono", "tokio-postgres", "postgres-types", "postgres-pro
[dev-dependencies]
env_logger = { workspace = true }
num_cpus = { workspace = true }
tokio = { version = "1.0", features = ["rt-multi-thread"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

@@ -253,7 +253,16 @@ query TT
EXPLAIN SELECT * FROM arrow_table
----
logical_plan TableScan: arrow_table projection=[f0, f1, f2]
physical_plan ArrowExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow]]}, projection=[f0, f1, f2]
physical_plan ArrowExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:0..461], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:461..922], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:922..1383], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:1383..1842]]}, projection=[f0, f1, f2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 looks good to me -- though I I wonder will this actually read in parallel (or do these ranges all end up in the same reader)?

@alamb alamb merged commit 9bf0f68 into apache:main Jan 29, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel Arrow file format reading
2 participants