-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Parallel Arrow file format reading #8897
feat: Parallel Arrow file format reading #8897
Conversation
8e4d9f2
to
9348b09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me @my-vegetable-has-exploded -- thank you very much.
I think it is likely to not work well on remote object store given how many requests are made but I also think that could be handled by a follow on PR
My only concern with this PR as written is if the tests actually exercise the multi-batch reading code given how small the input files in repartition.slt
are
@@ -61,6 +61,7 @@ postgres = ["bytes", "chrono", "tokio-postgres", "postgres-types", "postgres-pro | |||
[dev-dependencies] | |||
env_logger = { workspace = true } | |||
num_cpus = { workspace = true } | |||
tokio = { version = "1.0", features = ["rt-multi-thread"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
@@ -253,7 +253,16 @@ query TT | |||
EXPLAIN SELECT * FROM arrow_table | |||
---- | |||
logical_plan TableScan: arrow_table projection=[f0, f1, f2] | |||
physical_plan ArrowExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow]]}, projection=[f0, f1, f2] | |||
physical_plan ArrowExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:0..461], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:461..922], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:922..1383], [WORKSPACE_ROOT/datafusion/core/tests/data/example.arrow:1383..1842]]}, projection=[f0, f1, f2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 looks good to me -- though I I wonder will this actually read in parallel (or do these ranges all end up in the same reader)?
Which issue does this PR close?
Closes #8503
Rationale for this change
What changes are included in this PR?
If file_meta.range is some, filter recordbatches according to range, then scan recordbatches.
Are these changes tested?
physical plan of scaning arrow files changes in
repartition_scan.slt
.Are there any user-facing changes?