Skip to content

Incorrect handling of missing values in Statistics #2134

@asfimport

Description

@asfimport

As per the parquet-format specs the min/max values in statistics are optional. Therefore, it is possible to have numNulls in Statistics while we don't have min/max values. In StatisticsFilter we rely on the method StatisticsFilter.isAllNulls(ColumnChunkMetaData) to handle the case of null min/max values which is not correct due to the described scenario.
We shall check Statistics.hasNonNullValue() any time before using the actual min/max values.

In addition we don't check if the null_count is set or not when reading from the parquet file. We simply use the value which is 0 in case of unset. In the parquet-mr side the Statistics object uses the value 0 to sign that the num_nulls is unset. It is incorrect if we are searching for null values and we falsely drop a column chunk thinking there are no null values but the field in the statistics was simply unset.

Reporter: Gabor Szadovszky / @gszadovszky
Assignee: Gabor Szadovszky / @gszadovszky

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-1217. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions