refactor: parquet pruning simplifications #5386

crepererum · 2023-02-24T10:26:10Z

Which issue does this PR close?

Prep work for #4695.

Rationale for this change

Makes the actual change in #4695 easier.

What changes are included in this PR?

Internal improvements.

Are these changes tested?

Existing tests pass, no function changes.

Are there any user-facing changes?

-

crepererum · 2023-02-24T10:54:19Z

I am pretty sure that this test breakage is unrelated, seems flaky to me:

[window.slt] Running query: "select 1 - lag(amount, 1) over (order by idx) as column1 from (values ('a', 1, 100), ('a', 2, 150)) as t (col1, idx, amount)
---"
Error: query result mismatch:
[SQL] select 1 - lag(amount, 1) over (order by idx) as column1 from (values ('a', 1, 100), ('a', 2, 150)) as t (col1, idx, amount)
---
[Diff] (-expected|+actual)
-   NULL
-   -99
+   -99
+   NULL
at tests/sqllogictests/test_files/window.slt:414

error: test failed, to rerun pass `-p datafusion --test sqllogictests`

waynexia

Nice simplification, LGTM 👍

avantgardnerio · 2023-02-24T20:42:22Z

datafusion/core/src/physical_optimizer/pruning.rs

+    /// Returns number of unique columns.
+    pub(crate) fn n_columns(&self) -> usize {
+        self.iter()
+            .map(|(c, _s, _f)| c)


More descriptive variable names would help readability here.

avantgardnerio · 2023-02-24T20:43:40Z

datafusion/core/src/physical_optimizer/pruning.rs

                .unwrap_or(unhandled);
-            return Ok(expr);


Oh, so this was always returning Ok?

avantgardnerio · 2023-02-24T21:00:41Z

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

+                match PruningPredicate::try_new(predicate.clone(), schema.clone()) {
+                    Ok(p)
+                        if (!p.allways_true())
+                            && (p.required_columns().n_columns() < 2) =>


This is a behavior change for n_columns() == 0. Based on:

pub fn allways_true(&self) -> bool { self.predicate_expr .as_any() .downcast_ref::<Literal>() .map(|l| matches!(l.value(), ScalarValue::Boolean(Some(true)))) .unwrap_or_default() }

I ran the test suite, panicing if n_columns() == 0 and I can't get it to happen, so I guess it LGTM.
I assume that would default to false, in which case I think we'd want to return a None here?

Yeah, we skip the predicate if we don't refer to any column. However you might be right (at least this is how I read your comment) that we need additional test coverage for a "constant" predicate (i.e. one that doesn't reference any column). I'll check next week if such a test exists and if not, add one.

ursabot · 2023-02-24T21:02:08Z

Benchmark runs are scheduled for baseline = 7224901 and contender = 1841736. 1841736 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

crepererum added 2 commits February 24, 2023 11:24

refactor: build_predicate_expression cannot fail

36102c1

refactor: simplify PruningPredicate creation

966e0ac

github-actions bot added the core Core DataFusion crate label Feb 24, 2023

waynexia approved these changes Feb 24, 2023

View reviewed changes

avantgardnerio approved these changes Feb 24, 2023

View reviewed changes

avantgardnerio merged commit 1841736 into apache:main Feb 24, 2023

crepererum mentioned this pull request Feb 27, 2023

refactor: ParquetExec logical expr. => phys. expr. #5419

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: parquet pruning simplifications #5386

refactor: parquet pruning simplifications #5386

crepererum commented Feb 24, 2023

crepererum commented Feb 24, 2023 •

edited

Loading

waynexia left a comment

avantgardnerio Feb 24, 2023

avantgardnerio Feb 24, 2023

crepererum Feb 25, 2023

avantgardnerio Feb 24, 2023

crepererum Feb 25, 2023

ursabot commented Feb 24, 2023

refactor: parquet pruning simplifications #5386

refactor: parquet pruning simplifications #5386

Conversation

crepererum commented Feb 24, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

crepererum commented Feb 24, 2023 • edited Loading

waynexia left a comment

Choose a reason for hiding this comment

avantgardnerio Feb 24, 2023

Choose a reason for hiding this comment

avantgardnerio Feb 24, 2023

Choose a reason for hiding this comment

crepererum Feb 25, 2023

Choose a reason for hiding this comment

avantgardnerio Feb 24, 2023

Choose a reason for hiding this comment

crepererum Feb 25, 2023

Choose a reason for hiding this comment

ursabot commented Feb 24, 2023

crepererum commented Feb 24, 2023 •

edited

Loading