Skip to content

Binary statistics is not updated correctly if an underlying Binary array is modified in place #1792

@asfimport

Description

@asfimport

The following test case shows the problem:

    byte[] bytes = new byte[] { 49 };
    BinaryStatistics reusableStats =  new BinaryStatistics();
    reusableStats.updateStats(Binary.fromByteArray(bytes));
    bytes[0] = 50;
    reusableStats.updateStats(Binary.fromByteArray(bytes, 0, 1));
 
    assertArrayEquals(new byte[] { 49 }, reusableStats.getMinBytes());
    assertArrayEquals(new byte[] { 50 }, reusableStats.getMaxBytes());

I discovered the bug when converting an AVRO file to a Parquet file by reading GenericRecords from a file using DataFileStream.next(D reuse) method. The problem is that underlying byte array of avro Utf8 object is passed to parquet that saves it as part of BinaryStatistics and then the same array is modified in place on the next read.

I am not sure what is the right way to fix the problem (in BinaryStatistics or AvroWriteSupport).

If BinaryStatistics implementation is correct (for performance reasons) then this behavior should be documented and AvroWriteSupport.fromAvroString should be fixed to duplicate underlying Utf8 array.

I am happy to create a pull request once the desired way to fix the issue is discussed.

Reporter: Konstantin Shaposhnikov / @kostya-sh

Related issues:

Note: This issue was originally created as PARQUET-258. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions