Skip to content

Binary column statistics error when reuse byte[] among rows #1433

@asfimport

Description

@asfimport

I think it is a common practice when inserting table data as parquet file, one would always reuse the same object among rows, and if a column is byte[] of fixed length, the byte[] would also be reused.

If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row groups created by a single task would have the same max & min binary value, just as the last row's binary content.

The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary references, since I use ByteArrayBackedBinary for byte[], the real content of max & min would always point to the reused byte[], therefore the latest row's content.

Does parquet declare somewhere that the user shouldn't reuse byte[] for Binary type? If it doesn't, I think it's a bug and can be reproduced by Spark SQL's RowWriteSupport

The related Spark JIRA ticket: SPARK-6859

Reporter: Yijie Shen / @yjshen
Assignee: Ashish Singh / @SinghAsDev

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-251. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions