Skip to content

Conversation

@thinkharderdev
Copy link
Contributor

@thinkharderdev thinkharderdev commented Apr 6, 2022

Which issue does this PR close?

Closes #2161

Rationale for this change

Adding schema merging to ParquetFormat broke pruning since the existing implementation assumes that each file in the ListingTable has the merged schema. In the best case this just prevents pruning row groups, but in certain cases (such as #2161) it can cause runtime errors and possibly incorrect query results.

What changes are included in this PR?

Two changes:

  1. When gathering column statistics during pruning, we need to find row group columns by name instead of the index in the merged schema. 2.
  2. When collecting statistics during a ListingTable scan, we need to map file-level statistics to the expected schema of the tables merged schema.

Are there any user-facing changes?

No

Yes, in order to map the statistics to the merged schema, we need to pass the merged schema to FileFormat::infer_stats.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @thinkharderdev - I think this looks quite good.

The only thing I am not sure about is some rows that appear to show null passing a c2 = 0 filter. Otherwise good to go from my perspective

use crate::logical_plan::Expr;
use crate::physical_plan::expressions::{MaxAccumulator, MinAccumulator};
use crate::physical_plan::file_format::ParquetExec;
use crate::physical_plan::file_format::{ParquetExec, SchemaAdapter};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

RecordBatch::try_new(schema, columns).expect("error; creating record batch")
}

fn create_batch(columns: Vec<(&str, ArrayRef)>) -> RecordBatch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very similar / the same as RecordBatch::try_from_iter: https://docs.rs/arrow/11.1.0/arrow/record_batch/struct.RecordBatch.html#method.try_from_iter

let c1_stats = &stats.column_statistics.as_ref().expect("missing c1 stats")[0];
let c2_stats = &stats.column_statistics.as_ref().expect("missing c2 stats")[1];
assert_eq!(c1_stats.null_count, Some(1));
assert_eq!(c2_stats.null_count, Some(3));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cool to fill in the null stats for the missing column 👍

.statistics()
.columns()
.iter()
.find(|c| c.column_descr().name() == &$column.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

.unwrap();
let expected = vec![
"+-----+----+----+",
"| c1 | c3 | c2 |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 if the filter is c2 = 0 then none of these rows should pass.... so something doesn't look quite right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looked wrong to me as well. What I think is happening is that the min/max aren't set the pruning predicates aren't applied. In a "real" query where this predicate was pushed down from a filter stage this would still get piped into a FilerExec. I think we would have to special case the scenario where we fill in a null column to conform to a merged schema which may be worth doing. I can double check though and make sure there's not a bug here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- a comment in the test to explain why it is ok would be helpful for future readers

.unwrap();

let expected = vec![
"+-----+----+",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here -- I wouldn't expect null values in c2 to be returned...

@alamb
Copy link
Contributor

alamb commented Apr 6, 2022

FYI @Cheappie

thinkharderdev and others added 3 commits April 6, 2022 20:17
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Contributor

alamb commented Apr 6, 2022

Thanks agian @thinkharderdev

thinkharderdev and others added 3 commits April 6, 2022 20:22
.await
.unwrap();

// This does not look correct since the "c2" values in the result do not in fact match the predicate `c2 == 0`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 9815ac6 into apache:master Apr 7, 2022
let c1: ArrayRef =
Arc::new(StringArray::from(vec![Some("Foo"), None, Some("bar")]));

let c2: ArrayRef = Arc::new(Int64Array::from(vec![Some(1), Some(2), None]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might miss some point, but why values in c2 are not materialized if we weren't able to prune them ? I wonder how filter like c2 eq 1_i64 can be satisfied against null array ?

            "+-----+----+",
            "| c1  | c2 |",
            "+-----+----+",
            "| Foo |    |",
            "|     |    |",
            "| bar |    |",
            "+-----+----+",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key point is that filtering is also applied after the initial parquet scan / pruning -- the pruning is just a first pass to try and reduce additional work.

So subsequent Filter operations will actually handle filtering out the columns with c2 = null

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query execution fails with index out of bounds err

3 participants