Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ipc schema custom_metadata serialization #3282

Merged
merged 3 commits into from
Dec 8, 2022

Conversation

Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Dec 6, 2022

Which issue does this PR close?

Closes #250.

Rationale for this change

Thought meant to address #250, it is outdated in that the error doesn't seem to be due to empty children array anymore, but persists with empty custom_metadata (at schema level, not field level).

Have updated the bytes generated from pyarrow as those were from a very old version, and also made adjustments to how bytes are generated from Rust too, to make it more dynamic.

What changes are included in this PR?

Fix test

Fix ipc serialization of schema to fb for empty custom_metadata field to ensure it serializes as a None/null instead of an empty array (since pyarrow appears to do the same)

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 6, 2022
@Jefffrey
Copy link
Contributor Author

Jefffrey commented Dec 6, 2022

I'm kinda hesitant on the test changes honestly, since having the pyarrow bytes statically hardcoded doesn't seem ideal

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this

let ipc2 = crate::root_as_message(&bytes).unwrap();
let schema2 = ipc2.header_as_schema().unwrap();

// can't compare schema directly as though is same message, the underlying bytes seem to differ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely because flatbuffers aren't a canonical representation, that is there are multiple legal ways to represent the same data

// the schema is: Field("field1", DataType::UInt32, false)
// Bytes of a schema generated via following python code, using pyarrow 10.0.1:
//
// import pyarrow as pa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -63,7 +63,9 @@ pub fn schema_to_fb_offset<'a>(

let mut builder = crate::SchemaBuilder::new(fbb);
builder.add_fields(fb_field_list);
builder.add_custom_metadata(fb_metadata_list);
if !custom_metadata.is_empty() {
builder.add_custom_metadata(fb_metadata_list);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW ideally we would not create the fb_metadata_list if not needed, as effectively this encodes an empty list, but then doesn't reference it from a flatbuffer Table. In practice this just means the generated flatbuffer has an orphaned 4 bytes of "padding"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll change it to do that

@tustvold tustvold merged commit 7b71713 into apache:master Dec 8, 2022
@Jefffrey Jefffrey deleted the 250_pyarrow_ipc_compare branch December 8, 2022 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow schema from bytes doesn't handle empty children arrays consistently with python
2 participants