Skip to content

Parquet-avro cannot decode Avro/Thrift array of primitive array (e.g. array<array<int>>) #1880

@asfimport

Description

@asfimport

The problematic Avro and Thrift schemas are:

record AvroArrayOfArray {
  array<array<int>> int_arrays_column;
}

and

struct ThriftListOfList {
  1: list<list<i32>> intArraysColumn;
}

They are converted to the following structurally equivalent Parquet schemas by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively:

message AvroArrayOfArray {
  required group int_arrays_column (LIST) {
    repeated group array (LIST) {
      repeated int32 array;
    }
  }
}

and

message ParquetSchema {
  required group intListsColumn (LIST) {
    repeated group intListsColumn_tuple (LIST) {
      repeated int32 intListsColumn_tuple_tuple;
    }
  }
}

AvroIndexedRecordConverter cannot decode such records correctly. The reason is that the 2nd level repeated group array doesn't pass AvroIndexedRecordConverter.isElementType() check. We should check for field name "array" and field name suffix "_thrift" in isElementType() to fix this issue.

Reporter: Cheng Lian / @liancheng
Assignee: Ryan Blue / @rdblue

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as PARQUET-364. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions