Binary column statistics error when reuse byte[] among rows

I think it is a common practice when inserting table data as parquet file, one would always reuse the same object among rows, and if a column is byte[] of fixed length, the byte[] would also be reused. 

If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row groups created by a single task would have the same max & min binary value, just as the last row's binary content.

The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary references, since I use ByteArrayBackedBinary for byte[], the real content of max & min would always point to the reused byte[], therefore the latest row's content.

Does parquet declare somewhere that the user shouldn't reuse byte[] for Binary type?  If it doesn't, I think it's a bug and can be reproduced by [Spark SQL's RowWriteSupport ](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354)

The related Spark JIRA ticket: [SPARK-6859](https://issues.apache.org/jira/browse/SPARK-6859)

**Reporter**: [Yijie Shen](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=yijieshen) / @yjshen
**Assignee**: [Ashish Singh](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=singhashish) / @SinghAsDev
#### Related issues:
- [Release Parquet 1.8.0](https://github.com/apache/parquet-java/issues/1820) (blocks)
- [Improvements in ByteBuffer read path](https://github.com/apache/parquet-java/issues/1534) (Is contained by)
- [Binary statistics are invalid if buffers are reused](https://github.com/apache/parquet-java/issues/1846) (is duplicated by)
- [Binary statistics is not updated correctly if an underlying Binary array is modified in place](https://github.com/apache/parquet-java/issues/1792) (is duplicated by)
#### PRs and other links:
- [PR #197](https://github.com/apache/parquet-mr/pull/197)

<sub>**Note**: *This issue was originally created as [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251). Please see the [migration documentation](https://issues.apache.org/jira/browse/PARQUET-2502) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Binary column statistics error when reuse byte[] among rows #1433

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Binary column statistics error when reuse byte[] among rows #1433

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions