-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
I think it is a common practice when inserting table data as parquet file, one would always reuse the same object among rows, and if a column is byte[] of fixed length, the byte[] would also be reused.
If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row groups created by a single task would have the same max & min binary value, just as the last row's binary content.
The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary references, since I use ByteArrayBackedBinary for byte[], the real content of max & min would always point to the reused byte[], therefore the latest row's content.
Does parquet declare somewhere that the user shouldn't reuse byte[] for Binary type? If it doesn't, I think it's a bug and can be reproduced by Spark SQL's RowWriteSupport
The related Spark JIRA ticket: SPARK-6859
Reporter: Yijie Shen / @yjshen
Assignee: Ashish Singh / @SinghAsDev
Related issues:
- Release Parquet 1.8.0 (blocks)
- Improvements in ByteBuffer read path (Is contained by)
- Binary statistics are invalid if buffers are reused (is duplicated by)
- Binary statistics is not updated correctly if an underlying Binary array is modified in place (is duplicated by)
PRs and other links:
Note: This issue was originally created as PARQUET-251. Please see the migration documentation for further details.