Incorrect min/max statistics for strings in parquet files #641

alamb · 2021-07-30T19:22:05Z

Describe the bug
The statistics written by the arrow / parquet writer for String columns seem to be incorrect.

To Reproduce
Run this code:

fn main() {
    let input = vec![
        Some("andover"),
        Some("reading"),
        Some("bedford"),
        Some("tewsbury"),
        Some("lexington"),
        Some("lawrence"),
    ];

    let input: StringArray = input.into_iter().collect();
    println!("Staring to test with array {:?}", input);

    let record_batch = RecordBatch::try_from_iter(vec![
        ("city", Arc::new(input) as _)
    ]).unwrap();

    println!("Opening output file /tmp/test.parquet");
    let out_file = File::create("/tmp/test.parquet").unwrap();

    println!("Creating writer...");
    let mut writer = ArrowWriter::try_new(out_file, record_batch.schema(), None)
        .expect("creating writer");

    println!("writing...");
    writer.write(&record_batch).expect("writing");

    println!("closing...");
    writer.close().expect("closing");

    println!("done...");
}

Then examine the resulting parquet file and note the min/max values for the "city" column are:

min: "andover"
max: "lexington"

alamb@MacBook-Pro rust_parquet % parquet-tools dump  /tmp/test.parquet 
parquet-tools dump  /tmp/test.parquet 
row group 0 
------------------------------------------------------------------------------------------------------------------------------
city:  BINARY UNCOMPRESSED DO:4 FPO:90 SZ:130/130/1.00 VC:6 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: andover, max: lexi [more]...

    city TV=6 RL=0 DL=0 DS: 6 DE:PLAIN
    --------------------------------------------------------------------------------------------------------------------------
    page 0:                  DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: andover, max: lexington, num_nulls not defined] [more]... VC:6

BINARY city 
------------------------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 6 *** 
value 1: R:0 D:0 V:andover
value 2: R:0 D:0 V:reading
value 3: R:0 D:0 V:bedford
value 4: R:0 D:0 V:tewsbury
value 5: R:0 D:0 V:lexington
value 6: R:0 D:0 V:lawrence

Expected behavior
The parquet file produced has min/max statistics for the city column:

min: "andover"
max: "tewsbury"

As 't' follows 'l'

Additional context

Since DataFusion now uses these statistics for pruning out row groups, this leads to incorrect results in DataFusion. I found this when investigating https://github.com/influxdata/influxdb_iox/issues/2153

The text was updated successfully, but these errors were encountered:

alamb · 2021-07-30T22:25:40Z

I believe the problem is that the comparison for byte arrays at https://github.com/apache/arrow-rs/blob/master/parquet/src/data_type.rs#L117-L138 is comparing the lengths first rather than lexicographically comparing the entries

I need to verify the correct behavior and polish up the PR, but now it should just be a mechanical exercise.

alamb · 2021-07-31T10:19:39Z

I have confirmed that the python parquet writer correctly stores "tewksbury" as the max in statistics

Using this python script:

import pyarrow
import pandas as pd

data = [
    "andover",
    "reading",
    "bedford",
    "tewsbury",
    "lexington",
    "lawrence",
];

df = pd.DataFrame(data, columns = ['city'])
df.to_parquet('/tmp/test_python.parquet')

parquet-tools then confirm the min/max are "andover"/"tewksbury" as expected:

alamb@ip-192-168-0-133 /tmp % parquet-tools dump /tmp/test_python.parquet 
parquet-tools dump /tmp/test_python.parquet 
row group 0 
--------------------------------------------------------------------------------------------------------------------------------------------------------
city:  BINARY SNAPPY DO:4 FPO:90 SZ:139/137/0.99 VC:6 ENC:RLE,PLAIN,PLAIN_DICTIONARY ST:[min: andover, max: tewsbury, num_nulls: 0]

    city TV=6 RL=0 DL=1 DS: 6 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------------------------------------------------------------------------------
    page 0:                  DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY ST:[min: andover, max: tewsbury, num_nulls: 0] CRC:[none] SZ:11 VC:6

BINARY city 
--------------------------------------------------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 6 *** 
value 1: R:0 D:1 V:andover
value 2: R:0 D:1 V:reading
value 3: R:0 D:1 V:bedford
value 4: R:0 D:1 V:tewsbury
value 5: R:0 D:1 V:lexington
value 6: R:0 D:1 V:lawrence

alamb added bug parquet Changes to the parquet crate labels Jul 30, 2021

alamb mentioned this issue Jul 30, 2021

Produce correct answers for Group BY NULL (Option 1) apache/datafusion#793

Merged

alamb self-assigned this Jul 30, 2021

This was referenced Jul 31, 2021

Add human readable Format for parquet ByteArray #642

Merged

Fix parquet string statistics generation #643

Merged

xianwill mentioned this issue Aug 2, 2021

Add failing test to track arrow-rs PR#643 delta-io/kafka-delta-ingest#45

Closed

nevi-me closed this as completed in #643 Aug 8, 2021

alamb mentioned this issue Dec 15, 2021

Query failing to return any results when filter is an equality check on strings (bad statistics in parquet) apache/datafusion#1433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect min/max statistics for strings in parquet files #641

Incorrect min/max statistics for strings in parquet files #641

alamb commented Jul 30, 2021

alamb commented Jul 30, 2021

alamb commented Jul 31, 2021 •

edited

Loading

Incorrect min/max statistics for strings in parquet files #641

Incorrect min/max statistics for strings in parquet files #641

Comments

alamb commented Jul 30, 2021

alamb commented Jul 30, 2021

alamb commented Jul 31, 2021 • edited Loading

alamb commented Jul 31, 2021 •

edited

Loading