Handle merged schemas in parquet pruning #2170

thinkharderdev · 2022-04-06T10:07:09Z

Which issue does this PR close?

Rationale for this change

Adding schema merging to ParquetFormat broke pruning since the existing implementation assumes that each file in the ListingTable has the merged schema. In the best case this just prevents pruning row groups, but in certain cases (such as #2161) it can cause runtime errors and possibly incorrect query results.

What changes are included in this PR?

Two changes:

When gathering column statistics during pruning, we need to find row group columns by name instead of the index in the merged schema. 2.
When collecting statistics during a ListingTable scan, we need to map file-level statistics to the expected schema of the tables merged schema.

Are there any user-facing changes?

No

Yes, in order to map the statistics to the merged schema, we need to pass the merged schema to FileFormat::infer_stats.

alamb

Thank you @thinkharderdev - I think this looks quite good.

The only thing I am not sure about is some rows that appear to show null passing a c2 = 0 filter. Otherwise good to go from my perspective

datafusion/core/src/datasource/file_format/mod.rs

alamb · 2022-04-06T18:55:24Z

datafusion/core/src/datasource/file_format/parquet.rs

 use crate::logical_plan::Expr;
 use crate::physical_plan::expressions::{MaxAccumulator, MinAccumulator};
-use crate::physical_plan::file_format::ParquetExec;
+use crate::physical_plan::file_format::{ParquetExec, SchemaAdapter};


alamb · 2022-04-06T18:57:46Z

datafusion/core/src/datasource/file_format/parquet.rs

+        RecordBatch::try_new(schema, columns).expect("error; creating record batch")
+    }
+
+    fn create_batch(columns: Vec<(&str, ArrayRef)>) -> RecordBatch {


This looks very similar / the same as RecordBatch::try_from_iter: https://docs.rs/arrow/11.1.0/arrow/record_batch/struct.RecordBatch.html#method.try_from_iter

alamb · 2022-04-06T18:59:23Z

datafusion/core/src/datasource/file_format/parquet.rs

+        let c1_stats = &stats.column_statistics.as_ref().expect("missing c1 stats")[0];
+        let c2_stats = &stats.column_statistics.as_ref().expect("missing c2 stats")[1];
+        assert_eq!(c1_stats.null_count, Some(1));
+        assert_eq!(c2_stats.null_count, Some(3));


this is cool to fill in the null stats for the missing column 👍

alamb · 2022-04-06T19:00:16Z

datafusion/core/src/physical_plan/file_format/parquet.rs

-            .statistics()
+            .columns()
+            .iter()
+            .find(|c| c.column_descr().name() == &$column.name)


datafusion/core/src/physical_plan/file_format/parquet.rs

alamb · 2022-04-06T19:02:32Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+            .unwrap();
+        let expected = vec![
+            "+-----+----+----+",
+            "| c1  | c3 | c2 |",


🤔 if the filter is c2 = 0 then none of these rows should pass.... so something doesn't look quite right

Yeah, this looked wrong to me as well. What I think is happening is that the min/max aren't set the pruning predicates aren't applied. In a "real" query where this predicate was pushed down from a filter stage this would still get piped into a FilerExec. I think we would have to special case the scenario where we fill in a null column to conform to a merged schema which may be worth doing. I can double check though and make sure there's not a bug here.

Makes sense -- a comment in the test to explain why it is ok would be helpful for future readers

datafusion/core/src/physical_plan/file_format/parquet.rs

alamb · 2022-04-06T19:04:16Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+            .unwrap();
+
+        let expected = vec![
+            "+-----+----+",


same thing here -- I wouldn't expect null values in c2 to be returned...

alamb · 2022-04-06T19:05:14Z

FYI @Cheappie

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb · 2022-04-06T19:19:30Z

Thanks agian @thinkharderdev

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb · 2022-04-07T10:15:10Z

datafusion/core/src/physical_plan/file_format/parquet.rs

            .await
            .unwrap();

+        // This does not look correct since the "c2" values in the result do not in fact match the predicate `c2 == 0`


kamilkonior · 2022-04-07T20:30:59Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+        let c1: ArrayRef =
+            Arc::new(StringArray::from(vec![Some("Foo"), None, Some("bar")]));
+
+        let c2: ArrayRef = Arc::new(Int64Array::from(vec![Some(1), Some(2), None]));


I might miss some point, but why values in c2 are not materialized if we weren't able to prune them ? I wonder how filter like c2 eq 1_i64 can be satisfied against null array ?

"+-----+----+", "| c1 | c2 |", "+-----+----+", "| Foo | |", "| | |", "| bar | |", "+-----+----+",

I think the key point is that filtering is also applied after the initial parquet scan / pruning -- the pruning is just a first pass to try and reduce additional work.

So subsequent Filter operations will actually handle filtering out the columns with c2 = null

Handle merged schemas in parquet pruning

9e5c51d

github-actions bot added the datafusion label Apr 6, 2022

Handle merged schemas in ListingTable stats collection

04d9dec

alamb reviewed Apr 6, 2022

View reviewed changes

thinkharderdev and others added 3 commits April 6, 2022 20:17

Update datafusion/core/src/physical_plan/file_format/parquet.rs

067c10d

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update datafusion/core/src/datasource/file_format/mod.rs

adc0a28

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update datafusion/core/src/physical_plan/file_format/parquet.rs

a0809b5

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb approved these changes Apr 6, 2022

View reviewed changes

thinkharderdev and others added 3 commits April 6, 2022 20:22

Update datafusion/core/src/physical_plan/file_format/parquet.rs

869fe27

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update datafusion/core/src/physical_plan/file_format/parquet.rs

5e6f77d

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Add comments and cargo fmt

b1aaaa8

alamb reviewed Apr 7, 2022

View reviewed changes

alamb merged commit 9815ac6 into apache:master Apr 7, 2022

kamilkonior reviewed Apr 7, 2022

View reviewed changes

This was referenced Apr 20, 2022

Add SchemaAdapterExec #2292

Closed

Single File Per ParquetExec, AvroExec, etc... #2293

Closed

Handle merged schemas in parquet pruning #2170

Handle merged schemas in parquet pruning #2170

Uh oh!

Conversation

thinkharderdev commented Apr 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 6, 2022

Uh oh!

alamb commented Apr 6, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thinkharderdev commented Apr 6, 2022 •

edited

Loading