Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: Make page_index/pushdown metrics consistent with row_group metrics #12545

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

progval
Copy link
Contributor

@progval progval commented Sep 20, 2024

Which issue does this PR close?

Closes #12543.
Closes #12544.

What changes are included in this PR?

  1. Rename {pushdown,page_index}_filtered to {pushdown,page_index}_pruned
  2. Add {pushdown,page_index}_matched
  3. Added documentation for existing pushdown-related metrics

Rationale for this change

The latter makes it clearer in EXPLAIN ANALYZE when the Page Index is not checked because their row groups were already eliminated (with a Bloom Filter or row group statistics).

Are these changes tested?

yes

Are there any user-facing changes?

New metrics in EXPLAIN ANALYZE, documented in docs/source/user-guide/explain-usage.md

…etrics

1. Rename `{pushdown,page_index}_filtered` to `{pushdown,page_index}_pruned`
2. Add `{pushdown,page_index}_matched`

The latter makes it clearer in EXPLAIN ANALYZE when the Page Index is
not checked because their row groups were already eliminated
(with a Bloom Filter or row group statistics).
@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate labels Sep 20, 2024
@alamb alamb added the api change Changes the API exposed to users of the crate label Sep 20, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @progval -- this looks like a very nice improvement to me. I left some small suggestions but I don't think they are required to merge this PR

@@ -276,6 +281,14 @@ fn rows_skipped(selection: &RowSelection) -> usize {
.fold(0, |acc, x| if x.skip { acc + x.row_count } else { acc })
}

/// returns the number of rows not skipped in the selection
/// TODO should this be upstreamed to RowSelection?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks the same as https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowSelection.html#method.row_count

It would be great to upstream this and rows_skipped to parquet -- any chance you are willing to file a ticket to do so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -223,6 +223,21 @@ Again, reading from bottom up:
- `SortPreservingMergeExec`
- `output_rows=5`, `elapsed_compute=2.375µs`: Produced the final 5 rows in 2.375µs (microseconds)

When predicate pushdown is enabled, `ParquetExec` gains the following metrics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

docs/source/user-guide/explain-usage.md Outdated Show resolved Hide resolved
progval and others added 2 commits September 20, 2024 16:54
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate documentation Improvements or additions to documentation
Projects
None yet
2 participants