Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect min/max statistics for strings in parquet files #641

Closed
alamb opened this issue Jul 30, 2021 · 2 comments · Fixed by #643
Closed

Incorrect min/max statistics for strings in parquet files #641

alamb opened this issue Jul 30, 2021 · 2 comments · Fixed by #643
Assignees
Labels
bug parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Jul 30, 2021

Describe the bug
The statistics written by the arrow / parquet writer for String columns seem to be incorrect.

To Reproduce
Run this code:

fn main() {
    let input = vec![
        Some("andover"),
        Some("reading"),
        Some("bedford"),
        Some("tewsbury"),
        Some("lexington"),
        Some("lawrence"),
    ];

    let input: StringArray = input.into_iter().collect();
    println!("Staring to test with array {:?}", input);

    let record_batch = RecordBatch::try_from_iter(vec![
        ("city", Arc::new(input) as _)
    ]).unwrap();

    println!("Opening output file /tmp/test.parquet");
    let out_file = File::create("/tmp/test.parquet").unwrap();

    println!("Creating writer...");
    let mut writer = ArrowWriter::try_new(out_file, record_batch.schema(), None)
        .expect("creating writer");

    println!("writing...");
    writer.write(&record_batch).expect("writing");

    println!("closing...");
    writer.close().expect("closing");

    println!("done...");
}

Then examine the resulting parquet file and note the min/max values for the "city" column are:

min: "andover"
max: "lexington"
alamb@MacBook-Pro rust_parquet % parquet-tools dump  /tmp/test.parquet 
parquet-tools dump  /tmp/test.parquet 
row group 0 
------------------------------------------------------------------------------------------------------------------------------
city:  BINARY UNCOMPRESSED DO:4 FPO:90 SZ:130/130/1.00 VC:6 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: andover, max: lexi [more]...

    city TV=6 RL=0 DL=0 DS: 6 DE:PLAIN
    --------------------------------------------------------------------------------------------------------------------------
    page 0:                  DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: andover, max: lexington, num_nulls not defined] [more]... VC:6

BINARY city 
------------------------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 6 *** 
value 1: R:0 D:0 V:andover
value 2: R:0 D:0 V:reading
value 3: R:0 D:0 V:bedford
value 4: R:0 D:0 V:tewsbury
value 5: R:0 D:0 V:lexington
value 6: R:0 D:0 V:lawrence

Expected behavior
The parquet file produced has min/max statistics for the city column:

min: "andover"
max: "tewsbury"

As 't' follows 'l'

Additional context

Since DataFusion now uses these statistics for pruning out row groups, this leads to incorrect results in DataFusion. I found this when investigating https://github.com/influxdata/influxdb_iox/issues/2153

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2021

I believe the problem is that the comparison for byte arrays at https://github.com/apache/arrow-rs/blob/master/parquet/src/data_type.rs#L117-L138 is comparing the lengths first rather than lexicographically comparing the entries

I need to verify the correct behavior and polish up the PR, but now it should just be a mechanical exercise.

@alamb alamb self-assigned this Jul 30, 2021
@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2021

I have confirmed that the python parquet writer correctly stores "tewksbury" as the max in statistics

Using this python script:

import pyarrow
import pandas as pd

data = [
    "andover",
    "reading",
    "bedford",
    "tewsbury",
    "lexington",
    "lawrence",
];

df = pd.DataFrame(data, columns = ['city'])
df.to_parquet('/tmp/test_python.parquet')

parquet-tools then confirm the min/max are "andover"/"tewksbury" as expected:

alamb@ip-192-168-0-133 /tmp % parquet-tools dump /tmp/test_python.parquet 
parquet-tools dump /tmp/test_python.parquet 
row group 0 
--------------------------------------------------------------------------------------------------------------------------------------------------------
city:  BINARY SNAPPY DO:4 FPO:90 SZ:139/137/0.99 VC:6 ENC:RLE,PLAIN,PLAIN_DICTIONARY ST:[min: andover, max: tewsbury, num_nulls: 0]

    city TV=6 RL=0 DL=1 DS: 6 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------------------------------------------------------------------------------
    page 0:                  DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY ST:[min: andover, max: tewsbury, num_nulls: 0] CRC:[none] SZ:11 VC:6

BINARY city 
--------------------------------------------------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 6 *** 
value 1: R:0 D:1 V:andover
value 2: R:0 D:1 V:reading
value 3: R:0 D:1 V:bedford
value 4: R:0 D:1 V:tewsbury
value 5: R:0 D:1 V:lexington
value 6: R:0 D:1 V:lawrence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant