Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Extract parquet row group pruning code into its own module #4160

Merged
merged 1 commit into from
Nov 10, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 9, 2022

Which issue does this PR close?

re #3463

Rationale for this change

This makes the row group pruning code match the rest of the pushdown rules in DataFusion (filtering and page indexing)

What changes are included in this PR?

Move a bunch of code out of the parquet.rs module into row_groups.rs

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

No it just moves code around. No functional change is intended

@github-actions github-actions bot added the core Core DataFusion crate label Nov 9, 2022
@alamb alamb requested a review from Ted-Jiang November 9, 2022 22:19
@alamb alamb marked this pull request as ready for review November 9, 2022 22:19
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Ted-Jiang @tustvold and @thinkharderdev

(this is a end of day refactor PR that took time but not brain power)

@@ -17,43 +17,36 @@

//! Execution plan for reading Parquet files

use arrow::datatypes::SchemaRef;
use fmt::Debug;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is still pretty massive, but after this PR it is down to under 2000 lines so that is progress I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! before you split. I open this file in Clion it always lagging 😂. Thanks for your work @alamb 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also hope that some day I can split more of this code (file_format) into its own crate -- but one thing at a time !

@@ -435,7 +423,7 @@ impl FileOpener for ParquetOpener {
// Row group pruning: attempt to skip entire row_groups
// using metadata on the row groups
let file_metadata = builder.metadata();
let row_groups = prune_row_groups(
let row_groups = row_groups::prune_row_groups(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of the PR is to move this function (and its tests) into its own module

predicate: Option<PruningPredicate>,
metrics: &ParquetFileMetrics,
) -> Vec<usize> {
// TODO: Columnar pruning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what here Columnar pruning means 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it means the page index pruning you have implemented -- I 'll remove this comment in a follow on PR

Copy link
Member

@Ted-Jiang Ted-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is really necessary ❤️

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@alamb alamb merged commit 4d621c5 into apache:master Nov 10, 2022
@alamb alamb deleted the alamb/extract_row_group_pruning branch November 10, 2022 12:33
@ursabot
Copy link

ursabot commented Nov 10, 2022

Benchmark runs are scheduled for baseline = 9c24a79 and contender = 4d621c5. 4d621c5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants