Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parquet statistics for struct columns #8334

Open
alamb opened this issue Nov 27, 2023 · 7 comments
Open

Support parquet statistics for struct columns #8334

alamb opened this issue Nov 27, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 27, 2023

Is your feature request related to a problem or challenge?

While working on #8294 @tustvold noted that the statistics extraction code does not do the right thing with Structs that were written to parquet

I think the usecase is something like the following query:

SELECT * 
FROM my_table
WHERE struct_column = struct('foo', 'bar')

Describe the solution you'd like

I would like the parquet statistics reading code to handle structs (I think).

Describe alternatives you've considered

No response

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Nov 27, 2023

It appears that statistics for such types is ready as all nulls

@tustvold
Copy link
Contributor

Yes, this is what I would expect in the absence of a column name collision. My greater concern is that the presence of the struct column will mess up the ordinals of the other columns

@alamb
Copy link
Contributor Author

alamb commented Nov 27, 2023

Yes, this is what I would expect in the absence of a column name collision. My greater concern is that the presence of the struct column will mess up the ordinals of the other columns

I added tests for this case here: #8294 (comment) (TLDR is it seems to do the right thing, though it would be good to get a second set of eyes)

@edmondop
Copy link
Contributor

edmondop commented Apr 6, 2024

I read in statistics

/// Lookups up the parquet column by name
///
/// Returns the parquet column index and the corresponding arrow field
pub(crate) fn parquet_column<'a>(
    parquet_schema: &SchemaDescriptor,
    arrow_schema: &'a Schema,
    name: &str,
) -> Option<(usize, &'a FieldRef)> {
    let (root_idx, field) = arrow_schema.fields.find(name)?;
    if field.data_type().is_nested() {
        // Nested fields are not supported and require non-trivial logic
        // to correctly walk the parquet schema accounting for the
        // logical type rules - <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md>
        //
        // For example a ListArray could correspond to anything from 1 to 3 levels
        // in the parquet schema
        return None;
    }

    // This could be made more efficient (#TBD)
    let parquet_idx = (0..parquet_schema.columns().len())
        .find(|x| parquet_schema.get_column_root_idx(*x) == root_idx)?;
    Some((parquet_idx, field))
}```

`git blame` shows @alamb as an author of those lines... I'll look into the rules.  I suppose aggregation functions will need to be updated to walk the schema correctly? 

@edmondop
Copy link
Contributor

@alamb I realized the last part of the code is actually the question I have for you, apologies

I'll look into the rules. I suppose aggregation functions will need to be updated to walk the schema correctly?

@alamb
Copy link
Contributor Author

alamb commented Apr 17, 2024

I'll look into the rules. I suppose aggregation functions will need to be updated to walk the schema correctly?

That sounds likely.

I think the first thing to do would be to write a test case showing the incorrect / lack of behavior. Then we can work out how to solve the problem

@edmondop
Copy link
Contributor

@alamb I have done some progress in getting a test, but I need guidance here:

        let (idx, _) = parquet_column(parquet_schema, &schema, "struct_col").unwrap();
        assert_eq!(idx, 0);
 
        let row_groups = metadata.row_groups();
        let iter = row_groups.iter().map(|x| x.column(idx).statistics());

The test now fails because row_groups.iter().map(|x| x.column(idx).statistics()) doesn't return the statistics of the first column of the RecordBatch, the struct, but of the first column of the struct. I can definitely create myself the statistics doing something like that:

        let datatype = DataType::Struct(
            Fields::from(vec![
                Field::new("bool_col", DataType::Boolean, false),
                Field::new("int_col", DataType::Boolean, false),
            ]),
        );
       // Comment next line, change iter above for the two fields of the struct, and build the min as a struxt
        let min_field1 = min_statistics(&datatype, iter1.clone()).unwrap();
        let min_field2 = min_statistics(&datatype, iter2.clone()).unwrap();
       let min = find out how to do this// 
        assert_eq!(
            &min,
            &expected_min,
            "Min. Statistics\n\n{}\n\n",
            DisplayStats(row_groups)
        );

but I am not sure that's the right way to proceed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants