Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary of Parquet reader Issues #9560

Open
8dukongjian opened this issue Apr 22, 2024 · 31 comments
Open

Summary of Parquet reader Issues #9560

8dukongjian opened this issue Apr 22, 2024 · 31 comments
Labels
bug Something isn't working parquet triage Newly created issue that needs attention.

Comments

@8dukongjian
Copy link
Contributor

8dukongjian commented Apr 22, 2024

Bug description

Using the test method provided by @qqibrow #7478, four compression formats(GZIP, SNAPPY, LZO and UNCOMPRESSED) and two parquet versions(V1 and V2) were tested, with a total of eight test scenarios. Summarize the problems discovered by the test into the table below.

Bugs

Seq Error Compression formats Parquet Version issue PR File to Reproduce State Owner
1 Empty collection in array or map lead to incorrect results in parquet reader GZIP, SNAPPY, LZO and UNCOMPRESSED V1 and V2 #7776 #9187 Merged jaystarshot
2 Parsing complex type errors, eg, ARRAY<STRUCT<test:string>> is parsed into ARRAY<string> GZIP, SNAPPY, LZO and UNCOMPRESSED V1 and V2 #9242 #9533 array_struct.zip Merged chliang71 qqibrow
3 children size should not be larger than 2 GZIP, SNAPPY, LZO and UNCOMPRESSED V1 and V2 #9533 children.zip Merged chliang71 qqibrow
4 presetNullsConsumed_ == presetNullsSize_ GZIP, SNAPPY, LZO and UNCOMPRESSED V1 and V2 #9238 #9728 presetNulls.zip Under repair hitarth
5 ColumnMetaData does not exist for schema Id GZIP, SNAPPY, LZO and UNCOMPRESSED V1 and V2 #9239 #9223 ColumnMetaData.zip Co makagonov
6 decompression failed, decompressedSize is not equal to remainingOutputSize LZO V1 and V2 #9618 #10123 lzo.zip Merged majetideepak
7 For raw decompression, compressedLength should be greater than zero SNAPPY V2 None #10121 snappy.zip Merged majetideepak
8 Null pointer (missing decoder for TINYINT) GZIP/SNAPPY/LZO/UNCOMPRESSED V2 None None array.zip Not repaired hitarth
10 core dump in StringColumnReader::processFilter (missing decoder for VARBINARY on FLBA) GZIP/SNAPPY/LZO/UNCOMPRESSED V1 #9757 #9887 str_gzip.zip Merged majetideepak
11 Parquet PageReader incorrectly skips rep/def levels when the max values are 0 all V2 #9924 #9939 Merged yingsu00
12 Parquet reader: can't read parquet file with no column indexes (Map backward compatibility) all V1,V2 #9463 In Progress majetideepak
13 Suspected bug related to read with Mutation (possibly Iceberg) all V1,V2 #8973 In triage yingsu00
14 Velox parquet scan fail when select row index column before data column all V1,V2 #9867 Fix makeScanSpec wrong index issue if skip rowindex column#9866 Merged gaoyangxiaozhu
15 Tokenizer cannot parse column name with space in it(special char in column names) all V1,V2 #10348 Under repair yingsu00
16 Support "parquet_use_column_names" = false in Velox (special char in column names) all V1,V2 #10388 #10085 Under repair agrawalreetika yingsu00
17 Ubuntu build failing when VELOX_ENABLE_PARQUET flag is enabled(Build issue) all V1,V2 #10323 Under triage
18 Parquet reader: can't read null map row in a single line file all V1,V2 #10510 Remove fast path for all null in filter Under repair yma11

File to Reproduce:

int96.zip

timestamp_mills.zip

timestamp_micros.zip

array_struct.zip

children.zip

presetNulls.zip

ColumnMetaData.zip

lzo.zip

delta_byte_array.zip

rle.zip

array.zip
snappy.zip

System information

None

Relevant logs

No response

The more complete feature request is in #9767
Feature requests (back up):

Seq Error Compression formats Parquet Version issue PR File to Reproduce State Owner
1 Reading INT 96 is not supported GZIP, SNAPPY, LZO and UNCOMPRESSED V1 and V2 dictionary encoding: #4680
plain encoding: oap-project#456
int96.zip Development ?
2 Reading INT64(TIMESTAMP_MICROS) is not supported GZIP, SNAPPY, LZO and UNCOMPRESSED V1 and V2 #8325 timestamp_micros.zip Development mskapilks
3 Reading INT64(TIMESTAMP_MILLIS) is not supported GZIP, SNAPPY, LZO and UNCOMPRESSED V1 and V2 #8325 timestamp_mills.zip Development mskapilks
4 Encoding not supported yet: DELTA_BYTE_ARRAY GZIP/SNAPPY/LZO/UNCOMPRESSED V2 #10938 None delta_byte_array.zip Development yingsu00
@8dukongjian 8dukongjian added bug Something isn't working triage Newly created issue that needs attention. labels Apr 22, 2024
@xumingming
Copy link
Contributor

@8dukongjian Very informative, thanks! I'd suggest submit issues for each problem you found with detailed description and reproducible case, and use this issue as an umbrella issue to track all of them.

@8dukongjian
Copy link
Contributor Author

@8dukongjian Very informative, thanks! I'd suggest submit issues for each problem you found with detailed description and reproducible case, and use this issue as an umbrella issue to track all of them.

Thanks, good suggestions,I will work it later.

@8dukongjian
Copy link
Contributor Author

@mbasmanova mbasmanova changed the title Summary of Bugs in Velox Parquet Reader Summary of bugs in Parquet reader Apr 22, 2024
@mbasmanova
Copy link
Contributor

@8dukongjian Thank you for the summary. Do you plan to work on fixing these?

CC: @FelixYBW @majetideepak

@FelixYBW
Copy link

Thank you! It's what we need. @yma11 can add more tests result from parquet-mr later.

@yma11
Copy link
Contributor

yma11 commented Apr 23, 2024

Thank you! It's what we need. @yma11 can add more tests result from parquet-mr later.

Currently issue 9463 is opened for the failure found by parquet-mr. I can add an additional table here when new more failures found.

@qqibrow
Copy link
Collaborator

qqibrow commented Apr 23, 2024

Thanks! looks there are more issues related to encoding support which are not covered before.

@aditi-pandit
Copy link
Collaborator

@yzhang1991

@yingsu00
Copy link
Collaborator

@8dukongjian Thanks Sitao. I'll triage them tomorrow myself, then we will go over the list in this week's Parquet sync meeting on Friday 1pm. Please feel free to join.

@yingsu00
Copy link
Collaborator

Thanks! looks there are more issues related to encoding support which are not covered before.

Hi @qqibrow are you talking about the V2 encodings or existing ones?

@liujiayi771
Copy link
Contributor

@yingsu00 Are there any plans to support Parquet v2 encoding?

@yingsu00
Copy link
Collaborator

@yingsu00 Are there any plans to support Parquet v2 encoding?

Yes I do have the plan to support v2 encodings, but it will be second half. Do you need it urgently?

@ethanyzhang
Copy link

@yingsu00 So I have a silly question: Reetika tried to write hive tables in Prestissimo and the parquet file has format version 2.6 when I inspected it with parquet-tools. So that’s not Parquet v2? Where should I look at if I want to tell the version of a Parquet file?

@ethanyzhang
Copy link

Or maybe that file didn’t use any Parquet v2 specific features so Prestissimo can still read it…

@yingsu00
Copy link
Collaborator

@yzhang1991 I think the 2.x version refers to the DataPage version, and does not necessarily mean the encodings of the data are all V2. The writer could just be encoding the data in V1 encodings.

We usually say the following encodings are V2 encodings, see https://parquet.apache.org/docs/file-format/data-pages/encodings/

  • 5 DELTA_BINARY_PACKED,
  • 6 DELTA_LENGTH_BYTE_ARRAY
  • 7 DELTA_BYTE_ARRAY
  • 9 BYTE_STREAM_SPLIT

Currently the Velox Parquet reader can read both V1 and V2 DataPage headers, but only support 9 in from this PR. Support BYTE_STREAM_SPLIT encoding in native Parquet reader 5,6,7 need to be added.

@liujiayi771
Copy link
Contributor

Yes I do have the plan to support v2 encodings, but it will be second half. Do you need it urgently?

@yingsu00 Yes, we need to use DELTA_BYTE_ARRAY encoding.

@yingsu00
Copy link
Collaborator

@8dukongjian I took the liberty to edit the table and added a "seq" column to help us quickly identify the issues

Also created Parquet reading failed to decompress LZO files and @nmahadevuni will take a look.

@yingsu00
Copy link
Collaborator

Yes I do have the plan to support v2 encodings, but it will be second half. Do you need it urgently?

@yingsu00 Yes, we need to use DELTA_BYTE_ARRAY encoding.

Ok, we'll prioritize this encoding then.

@8dukongjian
Copy link
Contributor Author

@8dukongjian I took the liberty to edit the table and added a "seq" column to help us quickly identify the issues

Also created Parquet reading failed to decompress LZO files and @nmahadevuni will take a look.

Thanks, now the table is clearer.

@qqibrow
Copy link
Collaborator

qqibrow commented Apr 30, 2024

@8dukongjian does your team have bandwidth to take No.9 in bugs? currently no one is working on that.

@yingsu00 yingsu00 changed the title Summary of bugs in Parquet reader Summary of Parquet reader Issues May 2, 2024
@yingsu00
Copy link
Collaborator

yingsu00 commented May 2, 2024

@qqibrow @8dukongjian I will take a look at bug no. 9

@qqibrow
Copy link
Collaborator

qqibrow commented May 2, 2024

@yingsu00 thanks. I haven't got the time for a detail check. I am wondering whether

dictionaryIdDecoder_ = std::make_unique<RleBpDataDecoder>(
or some of it can be reused.

@yingsu00
Copy link
Collaborator

yingsu00 commented May 6, 2024

@qqibrow We need a BooleanRleBpDataDecoder. There are a number of other types/decoders need to be added. I'll create a separate issue for it.

@hitarth
Copy link
Collaborator

hitarth commented May 6, 2024

Took a quick looks at bug 8 mentioned above, named Null pointer , It seems to be due to unsupported DELTA_BINARY_PACKED encoding as mentioned in the above comment here.

@yma11
Copy link
Contributor

yma11 commented May 7, 2024

@qqibrow @yingsu00 one core dump issue found in parquet read fuzzer test. I added it in the bug list with file provided, please also help take a look. Thanks! cc @FelixYBW .

@qqibrow
Copy link
Collaborator

qqibrow commented May 8, 2024

@yma11 Thanks! could you create a issue and share the stacktrace and file to reproduce there? also, are you going to work on that?

@yma11
Copy link
Contributor

yma11 commented May 9, 2024

@yma11 Thanks! could you create a issue and share the stacktrace and file to reproduce there? also, are you going to work on that?

9757 is created for track. So we need create issue for each failure? I thought we would like to keep in this consolidated place. I don't have bandwidth for fix for now but firstly focus on finding more issues leverage parquet fuzzer test. Will you can help on it?

@FelixYBW
Copy link

FelixYBW commented May 9, 2024

9757 is created for track. So we need create issue for each failure? I thought we would like to keep in this consolidated place. I don't have bandwidth for fix for now but firstly focus on finding more issues leverage parquet fuzzer test. Will you can help on it?

Let's create one issue for each failure reason, and document clearly the failure message. It's at least easy to know the issue we meet during Gluten run is already known in the list or not.

@8dukongjian
Copy link
Contributor Author

@yma11 Thanks! could you create a issue and share the stacktrace and file to reproduce there? also, are you going to work on that?

9757 is created for track. So we need create issue for each failure? I thought we would like to keep in this consolidated place. I don't have bandwidth for fix for now but firstly focus on finding more issues leverage parquet fuzzer test. Will you can help on it?

Thanks, I updated #9757 in the table

facebook-github-bot pushed a commit that referenced this issue Jun 25, 2024
Summary:
Resolves the compression issue 7 here #9560

Pull Request resolved: #10121

Reviewed By: Yuhta

Differential Revision: D59012499

Pulled By: bikramSingh91

fbshipit-source-id: 804464fa0f22e7311650c23d48aa0a75fcd9f601
@weixiuli
Copy link

weixiuli commented Jul 5, 2024

@8dukongjian #10395

@yma11
Copy link
Contributor

yma11 commented Aug 14, 2024

@8dukongjian 10752 and fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parquet triage Newly created issue that needs attention.
Projects
None yet
Development

No branches or pull requests