Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty collection in array or map lead to incorrect results in parquet reader #7776

Closed
qqibrow opened this issue Nov 28, 2023 · 14 comments
Closed
Labels
bug Something isn't working triage Newly created issue that needs attention.

Comments

@qqibrow
Copy link
Collaborator

qqibrow commented Nov 28, 2023

Bug description

parquet-tools output is different than velox parquet reader output:

lniu@devrestricted-lniu:~/velox_parquet_test_triage/fail_parquet_files/testArray$ parquet-tools show native_parquet_reader_test2148048251466121645parquet
+-------------------+
| test              |
|-------------------|
| [ 1. nan  3.]     |
| [5.]              |
| [nan nan nan]     |
| []                |
| [ 7. 11.]         |      ---> difference starting here 
| [nan]             |
| [13.]             |
| [17.]             |
| [ 1. nan  3.]     |
| [5.]              |
| [nan nan nan  7.] |
| [11. nan 13.]     |
+-------------------+
lniu@devrestricted-lniu:~/velox_parquet_test_triage/fail_parquet_files/testArray$ ~/velox_scan_parquet native_parquet_reader_test2148048251466121645parquet
number of rows: 12
velox type: ROW<test:ARRAY<INTEGER>>
{3 elements starting at 0 {1, null, 3}}
{1 elements starting at 3 {5}}
{3 elements starting at 4 {null, null, null}}
{<empty>}
{2 elements starting at 7 {null, 7}} ---> difference starting here 
{1 elements starting at 9 {11}}
{1 elements starting at 10 {null}}
{1 elements starting at 11 {13}}
{3 elements starting at 12 {17, 1, null}}
{1 elements starting at 15 {3}}
{4 elements starting at 16 {5, null, null, null}}
{3 elements starting at 20 {7, 11, null}}

System information

Velox System Info v0.0.2
Commit: 1e186e548833750cdee4b95d829711ddad78aba1
CMake Version: 3.16.3
System: Linux-5.4.0-1063-aws
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 9.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 9.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

file to reproduce: https://www.dropbox.com/scl/fi/ln4d2tnoz497w309dr2qr/native_parquet_reader_test2148048251466121645parquet?rlkey=3ke9bkml3w35ij3g6gdymmz26&dl=0
@qqibrow qqibrow added bug Something isn't working triage Newly created issue that needs attention. labels Nov 28, 2023
@aditi-pandit
Copy link
Collaborator

@yingsu00 @nmahadevuni

@jaystarshot
Copy link
Contributor

jaystarshot commented Nov 30, 2023

I am taking a look thanks! Initial inspection seems like there is some bug in the offsets and size vectors of type Array or null of type Row

@yingsu00
Copy link
Collaborator

yingsu00 commented Dec 4, 2023

@jaystarshot I'm working on the refactor of null handling. It should resolve this issue.

@Yohahaha
Copy link
Contributor

We encounter same issue when reading map, will keep tracking this issue.

@qqibrow
Copy link
Collaborator Author

qqibrow commented Jan 13, 2024

Hi, I realized this has already been caught in existing test. I did an analysis and found the problem is empty collection related. whenever there is a empty array or map, the final result will be incorrect.

There is already a unit test capture the issue. I use the data https://github.com/facebookincubator/velox/blob/main/velox/dwio/parquet/tests/examples/array_2.parquet . the expect output is:

A array with type ARRAY<VARCHAR> containing 3 elements:

0: 2 elements starting at 0 {a, null}
1: <empty>
2: 2 elements starting at 2 {null, b}

array vector:

offset size  value
0      2     a
2      0     null 
2      2     null
             b

the actual output:

size = 0
0: 2 elements starting at 0 {a, null}
1: <empty>
2: 2 elements starting at 2 {null, null}

array vector:

offset size  value
0      2     a
2      0     null 
2      2     null
             null

parquet meta data:

lniu@lniu-FXGFKFV Downloads % parquet meta array_2.parquet

File path:  array_2.parquet
Created by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
Properties:
  writer.model.name: example
Schema:
message spark_schema {
  required group _1 (LIST) {
    repeated group list {
      optional binary element (STRING);
    }
  }
}

Row group 0:  count: 3  20.00 B records  start: 4  total(compressed): 60 B total(uncompressed):46 B
--------------------------------------------------------------------------------
                 type      encodings count     avg size   nulls   min / max
_1.list.element  BINARY    G   _     5         12.00 B    3       "a" / "b"

the data in parquet format is:

def rep data
2   0   a 
1   1   b 
0   0
1   0
2   1

Current findings:

  1. offset and size are right. RepeatedColumnReader::setLengthsFromRepDefs is correct.
  2. on leaf level. pageReader.readPageDefLevels() which calls arrow::DefLevelsToBitmap() that makes leafNulls_ is “0b10001” . this is incorrect because 3rd def is 0, which is less than max def - 1, means the value doesn’t exist at all. (value has 3 status: exist_value, exist_null, not_exist)
  3. Given size info, ListColumnReader decided to read 4 elements from leafReader that becomes [’a’, null, null, null]

Possible Solutions:

  1. Given we are using arrow library to get nulls and value_to_read information, is it possible we are not using it correct? if so, fix the usage can fix the problem.
  2. don’t reply on arrow::DefLevelsToBitmap() when deciding leaf level data.
  3. Depending on whether 1 or 2 alone can cover all cases (nested array containing empty, array over map containing empty), might need to rebuild whole readPageDefLevels and not use arrow library here.

I also realized this is so far the only issue that can lead to data incorrectness issue when (we use presto unit test testing velox parquet)[https://gist.github.com/qqibrow/689ed97b91cc0b58337be96a86291301]. If [ I remove all generating empty list in presto unit test] (https://github.com/qqibrow/presto/pull/1/files), all data incorrectness issues will go away.

@yingsu00
Copy link
Collaborator

@qqibrow Thanks for the findings. Yes NestedStructureDecoder::readOffsetsAndNulls() will replace Arrow's rep/def decoder and should correctly process this case. However, it has a known bug, and it requires more changes than just replacing the rep/def decoder. I have the fix to the known bug, but haven't gotten time to send PRs. Once the first Iceberg equality delete PR is out I'll start working on this, hopefully next week.

@FelixYBW
Copy link

@yingsu00 is this known issue for map only? or also impact array/struct? We noted parquet parse issue on array

@nmahadevuni
Copy link
Collaborator

@yingsu00 is this known issue for map only? or also impact array/struct? We noted parquet parse issue on array

@FelixYBW This issue occurs with Array too

@FelixYBW
Copy link

does it impact stringview? We hit a bug that a string should be null but it return as ""

@jaystarshot
Copy link
Contributor

jaystarshot commented Mar 18, 2024

I tried various parquet files with different iterations of empty and null.
Through various iterations, confirmed that the result mismatch happened in these cases.
image
and
image
and
image

Seems to be just an issue with the empty array as @qqibrow mentioned and not related to null atm.
Maybe we are wrongly reading the define and repetition level of empty ?

@jaystarshot
Copy link
Contributor

jaystarshot commented Mar 20, 2024

@hitarth @qqibrow and me debugged this issue.
Seems like the current root cause is us not properly using arrows' arrow::DefLevelsToBitmap() function properly as we were not updating arrow::LevelInfo's repeated_ancestor_def_level properly.
We found out how arrow populated repeated_ancestor_def_level here which it apparently uses for empty checks.
So now we are trying to give arrow the correct repeated_ancestor_def_level. I have a wip here but this still needs some work.

@FelixYBW
Copy link

does it impact stringview? We hit a bug that a string should be null but it return as ""

forgot about it. It's fixed by #9129

@qqibrow qqibrow changed the title Null and empty array handling in parquet reader Empty collection in array or map lead to incorrect results in parquet reader Mar 26, 2024
@jaystarshot
Copy link
Contributor

jaystarshot commented Mar 26, 2024

Pr #9187 is ready for review

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this issue Jun 7, 2024
Summary:
Fixes facebookincubator#7776

Parquet has notion of optional and repeated layers which is needed in arrow calls like [DefLevelsToBitmap](https://github.com/facebookincubator/velox/blob/7fc09667d5e22c684fdeff81da529b79cc974fee/velox/dwio/parquet/reader/PageReader.cpp#L573).

This info is passed using arrow:LevelInfo. We were incorrectly computing **repeatedAncestor** by ignoring optional fields which is fixed in this PR.

Parquet has 3 level structure for nested types
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
```
// List<String> (list non-null, elements nullable)
1. required group my_list (LIST) {
2.   repeated group list {
3.    optional binary element (UTF8);
  }
}
```

However when we read this and convert to **ParquetTypeWithId** in current velox parquet reader, we ignore the intermediated layer 2.  **repeated group list** (grandfather logic) in https://github.com/facebookincubator/velox/pull/9187/files#diff-64787e76c1b0ad12b5764770a94acd62054896a762ccead8f083a71a060f2f44R325.

Pull Request resolved: facebookincubator#9187

Reviewed By: mbasmanova

Differential Revision: D55975472

Pulled By: Yuhta

fbshipit-source-id: d0972b3134cc710645a9f50cd74a23efac830751
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Newly created issue that needs attention.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants