Skip to content

Encoding issue with fixed length byte arrays #1706

@asfimport

Description

@asfimport

While running some tests against the master branch I hit an encoding issue that seemed like a bug to me.

I noticed that when writing a fixed length byte array and the array's size is > dictionaryPageSize (in my test it was 512), the encoding falls back to DELTA_BYTE_ARRAY as seen below:

Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]

But then read fails with the following exception:

Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only supported for type BINARY
	at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
	at parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
	at parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
	at parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
	at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
	at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
	at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
	at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
	at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
	at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
	at parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:348)
	at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
	at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
	at parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
	at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
	at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
	at parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
	at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
	at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
	at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
	... 16 more

When the array's size is < dictionaryPageSize, RLE_DICTIONARY encoding is used and read works fine:

Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 1B comp}

Reporter: Nezih Yigitbasi / @nezihyigitbasi
Assignee: Sergio Peña / @spena

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-152. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions