Skip to content

NaNs can break parquet statistics #255

@crepererum

Description

@crepererum

Describe the bug
NaN can occur in parquet statistics and override all other possible values. This is very similar to PARQUET-1225 which was filed for the C++ implementation.

To Reproduce
Add the following tests:

#[test]
fn test_float_statistics_nan_middle() {
    let stats = statistics_roundtrip::<FloatType>(&[1.0, f32::NAN, 2.0]);
    assert!(stats.has_min_max_set());
    if let Statistics::Float(stats) = stats {
        assert_eq!(stats.min(), &1.0);
        assert_eq!(stats.max(), &2.0);
    } else {
        panic!("expecting Statistics::Float");
    }
}

#[test]
fn test_float_statistics_nan_start() {
    let stats = statistics_roundtrip::<FloatType>(&[f32::NAN, 1.0, 2.0]);
    assert!(stats.has_min_max_set());
    if let Statistics::Float(stats) = stats {
        assert_eq!(stats.min(), &1.0);
        assert_eq!(stats.max(), &2.0);
    } else {
        panic!("expecting Statistics::Float");
    }
}

#[test]
fn test_float_statistics_nan_only() {
    let stats = statistics_roundtrip::<FloatType>(&[f32::NAN, f32::NAN]);
    assert!(!stats.has_min_max_set());
    assert!(matches!(stats, Statistics::Float(_)));
}

fn statistics_roundtrip<T: DataType>(values: &[<T as DataType>::T]) -> Statistics {
    let page_writer = get_test_page_writer();
    let props = Arc::new(WriterProperties::builder().build());
    let mut writer = get_test_column_writer::<T>(page_writer, 0, 0, props);
    writer.write_batch(values, None, None).unwrap();

    let (_bytes_written, _rows_written, metadata) = writer.close().unwrap();
    if let Some(stats) = metadata.statistics() {
        stats.clone()
    } else {
        panic!("metadata missing statistics");
    }
}

Note that while the tests are written for f32/float, this also applies to f64/double.

Expected behavior
NaNs should be ignored during stats calculation. If only NaNs are present then min and max value should be unset.

Additional context
Tested commit was 8f030db53d9eda901c82db9daf94339fc447d0db.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions