Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Reading Decimal Lists: ComplexObjectArrayReader Handles Repetition Levels Incorrectly #2253

Closed
tustvold opened this issue Aug 1, 2022 · 0 comments · Fixed by #2528
Closed
Labels
bug parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

tustvold commented Aug 1, 2022

Describe the bug

ComplexObjectArrayReader does not use RecordReader and consequently does not correctly delimit semantic records when reading, in particular it may yield values that truncate a row part way through. This will in turn cause the parent ListArrayReader to error out as the repetition levels will not be consistent

To Reproduce

fn test_decimal_list() {
    let decimals = Decimal128Array::from_iter_values([1, 2, 3, 4, 5, 6, 7, 8]);

    // [[], [1], [2, 3], null, [4], null, [6, 7, 8]]
    let data = ArrayDataBuilder::new(ArrowDataType::List(Box::new(Field::new(
        "item",
        decimals.data_type().clone(),
        false,
    ))))
    .len(7)
    .add_buffer(Buffer::from_iter([0_i32, 0, 1, 3, 3, 4, 5, 8]))
    .null_bit_buffer(Some(Buffer::from(&[0b01010111])))
    .child_data(vec![decimals.into_data()])
    .build()
    .unwrap();

    let written = RecordBatch::try_from_iter([(
        "list",
        Arc::new(ListArray::from(data)) as ArrayRef,
    )])
    .unwrap();

    let mut buffer = Vec::with_capacity(1024);
    let mut writer =
        ArrowWriter::try_new(&mut buffer, written.schema(), None).unwrap();
    writer.write(&written).unwrap();
    writer.close().unwrap();

    let read = ParquetFileArrowReader::try_new(Bytes::from(buffer))
        .unwrap()
        .get_record_reader(3)
        .unwrap()
        .collect::<ArrowResult<Vec<_>>>()
        .unwrap();

    assert_eq!(&written.slice(0, 3), &read[0]);
    assert_eq!(&written.slice(3, 3), &read[1]);
    assert_eq!(&written.slice(6, 1), &read[2]);
}

Results in

ParquetError("Parquet error: first repetition level of batch must be 0")

Expected behavior

We should support reading these nested types.

Additional context

#1661 tracks removing this ArrayReader as it is buggy, complex, and not really needed anymore

@tustvold tustvold added the bug label Aug 1, 2022
@alamb alamb changed the title ComplexObjectArrayReader Handles Repetition Levels Incorrectly Reading Structs / Lists may be truncated sometimes: ComplexObjectArrayReader Handles Repetition Levels Incorrectly Aug 1, 2022
@alamb alamb added the parquet Changes to the parquet crate label Aug 1, 2022
@tustvold tustvold changed the title Reading Structs / Lists may be truncated sometimes: ComplexObjectArrayReader Handles Repetition Levels Incorrectly Error Reading Decimal Lists: ComplexObjectArrayReader Handles Repetition Levels Incorrectly Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants