Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to write non-null Arrow structs to Parquet #244

Closed
nevi-me opened this issue May 2, 2021 · 0 comments · Fixed by #246
Closed

Unable to write non-null Arrow structs to Parquet #244

nevi-me opened this issue May 2, 2021 · 0 comments · Fixed by #246
Labels
bug parquet Changes to the parquet crate

Comments

@nevi-me
Copy link
Contributor

nevi-me commented May 2, 2021

Describe the bug

Unable to correctly write nested structs where a struct is non-nullable.
I've noticed this behaviour before, but couldn't quite reproduce it easily.

To Reproduce

If we have the below test case (in parquet/src/arrow/arrow_writer.rs:

#[test]
fn arrow_writer_complex_mixed() {
    // define schema
    let offset_field = Field::new("offset", DataType::Int32, true);
    let partition_field = Field::new("partition", DataType::Int64, true);
    let topic_field = Field::new("topic", DataType::Utf8, true);
    let schema = Schema::new(vec![
        Field::new("some_nested_object", DataType::Struct(
            vec![
                offset_field.clone(),
                partition_field.clone(),
                topic_field.clone()
            ]
        ), false), // NOTE: this being false results in the array not being written correctly
    ]);

    // create some data
    let offset = Int32Array::from(vec![1, 2, 3, 4, 5]);
    let partition = Int64Array::from(vec![Some(1), None, None, Some(4), Some(5)]);
    let topic = StringArray::from(vec![Some("A"), None, Some("A"), Some(""), None]);

    let some_nested_object = StructArray::from(vec![
        (offset_field, Arc::new(offset) as ArrayRef),
        (partition_field, Arc::new(partition) as ArrayRef),
        (topic_field, Arc::new(topic) as ArrayRef),
    ]);

    // build a record batch
    let batch = RecordBatch::try_new(
        Arc::new(schema),
        vec![Arc::new(some_nested_object)],
    )
    .unwrap();

    roundtrip("test_arrow_writer_complex_mixed.parquet", batch);
}

We get a failure:

thread 'arrow::arrow_writer::tests::arrow_writer_complex_mixed' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`', parquet/src/util/bit_util.rs:332:9
test arrow::arrow_writer::tests::arrow_writer_complex_mixed ... FAILED

When the struct is nullable, the file is written correctly.

Expected behavior

The batch should be written without errors.

Additional context

From inspecting the levels that are generated for the passing and failing scenarios, they look identical (https://www.diffchecker.com/89qWByeI). It looks like the bug is with how levels of non-null structs are generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant