-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty collection in array or map lead to incorrect results in parquet reader #7776
Comments
I am taking a look thanks! Initial inspection seems like there is some bug in the offsets and size vectors of type Array or null of type Row |
@jaystarshot I'm working on the refactor of null handling. It should resolve this issue. |
We encounter same issue when reading map, will keep tracking this issue. |
Hi, I realized this has already been caught in existing test. I did an analysis and found the problem is empty collection related. whenever there is a empty array or map, the final result will be incorrect. There is already a unit test capture the issue. I use the data A array with type ARRAY<VARCHAR> containing 3 elements:
0: 2 elements starting at 0 {a, null}
1: <empty>
2: 2 elements starting at 2 {null, b}
array vector:
offset size value
0 2 a
2 0 null
2 2 null
b the actual output: size = 0
0: 2 elements starting at 0 {a, null}
1: <empty>
2: 2 elements starting at 2 {null, null}
array vector:
offset size value
0 2 a
2 0 null
2 2 null
null parquet meta data: lniu@lniu-FXGFKFV Downloads % parquet meta array_2.parquet
File path: array_2.parquet
Created by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
Properties:
writer.model.name: example
Schema:
message spark_schema {
required group _1 (LIST) {
repeated group list {
optional binary element (STRING);
}
}
}
Row group 0: count: 3 20.00 B records start: 4 total(compressed): 60 B total(uncompressed):46 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_1.list.element BINARY G _ 5 12.00 B 3 "a" / "b" the data in parquet format is: def rep data
2 0 a
1 1 b
0 0
1 0
2 1 Current findings:
Possible Solutions:
I also realized this is so far the only issue that can lead to data incorrectness issue when (we use presto unit test testing velox parquet)[https://gist.github.com/qqibrow/689ed97b91cc0b58337be96a86291301]. If [ I remove all generating empty list in presto unit test] (https://github.com/qqibrow/presto/pull/1/files), all data incorrectness issues will go away. |
@qqibrow Thanks for the findings. Yes NestedStructureDecoder::readOffsetsAndNulls() will replace Arrow's rep/def decoder and should correctly process this case. However, it has a known bug, and it requires more changes than just replacing the rep/def decoder. I have the fix to the known bug, but haven't gotten time to send PRs. Once the first Iceberg equality delete PR is out I'll start working on this, hopefully next week. |
@yingsu00 is this known issue for map only? or also impact array/struct? We noted parquet parse issue on array |
does it impact stringview? We hit a bug that a string should be null but it return as "" |
I tried various parquet files with different iterations of empty and null. Seems to be just an issue with the empty array as @qqibrow mentioned and not related to null atm. |
@hitarth @qqibrow and me debugged this issue. |
forgot about it. It's fixed by #9129 |
Pr #9187 is ready for review |
Summary: Fixes facebookincubator#7776 Parquet has notion of optional and repeated layers which is needed in arrow calls like [DefLevelsToBitmap](https://github.com/facebookincubator/velox/blob/7fc09667d5e22c684fdeff81da529b79cc974fee/velox/dwio/parquet/reader/PageReader.cpp#L573). This info is passed using arrow:LevelInfo. We were incorrectly computing **repeatedAncestor** by ignoring optional fields which is fixed in this PR. Parquet has 3 level structure for nested types https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists ``` // List<String> (list non-null, elements nullable) 1. required group my_list (LIST) { 2. repeated group list { 3. optional binary element (UTF8); } } ``` However when we read this and convert to **ParquetTypeWithId** in current velox parquet reader, we ignore the intermediated layer 2. **repeated group list** (grandfather logic) in https://github.com/facebookincubator/velox/pull/9187/files#diff-64787e76c1b0ad12b5764770a94acd62054896a762ccead8f083a71a060f2f44R325. Pull Request resolved: facebookincubator#9187 Reviewed By: mbasmanova Differential Revision: D55975472 Pulled By: Yuhta fbshipit-source-id: d0972b3134cc710645a9f50cd74a23efac830751
Bug description
parquet-tools output is different than velox parquet reader output:
System information
Velox System Info v0.0.2
Commit: 1e186e548833750cdee4b95d829711ddad78aba1
CMake Version: 3.16.3
System: Linux-5.4.0-1063-aws
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 9.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 9.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt
Relevant logs
The text was updated successfully, but these errors were encountered: