-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect min/max statistics for strings in parquet files #641
Comments
I believe the problem is that the comparison for byte arrays at https://github.com/apache/arrow-rs/blob/master/parquet/src/data_type.rs#L117-L138 is comparing the lengths first rather than lexicographically comparing the entries I need to verify the correct behavior and polish up the PR, but now it should just be a mechanical exercise. |
I have confirmed that the python parquet writer correctly stores Using this python script: import pyarrow
import pandas as pd
data = [
"andover",
"reading",
"bedford",
"tewsbury",
"lexington",
"lawrence",
];
df = pd.DataFrame(data, columns = ['city'])
df.to_parquet('/tmp/test_python.parquet')
alamb@ip-192-168-0-133 /tmp % parquet-tools dump /tmp/test_python.parquet
parquet-tools dump /tmp/test_python.parquet
row group 0
--------------------------------------------------------------------------------------------------------------------------------------------------------
city: BINARY SNAPPY DO:4 FPO:90 SZ:139/137/0.99 VC:6 ENC:RLE,PLAIN,PLAIN_DICTIONARY ST:[min: andover, max: tewsbury, num_nulls: 0]
city TV=6 RL=0 DL=1 DS: 6 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY ST:[min: andover, max: tewsbury, num_nulls: 0] CRC:[none] SZ:11 VC:6
BINARY city
--------------------------------------------------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 6 ***
value 1: R:0 D:1 V:andover
value 2: R:0 D:1 V:reading
value 3: R:0 D:1 V:bedford
value 4: R:0 D:1 V:tewsbury
value 5: R:0 D:1 V:lexington
value 6: R:0 D:1 V:lawrence |
Describe the bug
The statistics written by the arrow / parquet writer for String columns seem to be incorrect.
To Reproduce
Run this code:
Then examine the resulting parquet file and note the min/max values for the "city" column are:
Expected behavior
The parquet file produced has min/max statistics for the city column:
As 't' follows 'l'
Additional context
Since DataFusion now uses these statistics for pruning out row groups, this leads to incorrect results in DataFusion. I found this when investigating https://github.com/influxdata/influxdb_iox/issues/2153
The text was updated successfully, but these errors were encountered: