Skip to content

Conversation

@phil-opp
Copy link
Contributor

@phil-opp phil-opp commented Oct 29, 2025

Which issue does this PR close?

Rationale for this change

See linked issue.

What changes are included in this PR?

Add JSON decoders for binary array variants that act as counterparts to #5622. This way, it becomes possible to do a full round-trip encoding/decoding of binary array.

Are these changes tested?

I added a roundtrip test based on the test_writer_binary. It verifies that encoding and then decoding leads to the original input again. It covers Binary, LargeBinary, FixedSizeBinary, and BinaryView arrays, all with and without explicit nulls.

Are there any user-facing changes?

Yes, encoding and decoding binary arrays to/from JSON is now fully supported, given the right schema.

One limitation is that schema inference is not able to detect binary arrays as they look like normal JSON strings after encoding. However, this is already true when encoding other Arrow types, for example it's not possible to differentiate integer bit widths.

I updated the docs accordingly.

The `writer::encoder::BinaryEncoder` encodes binary arrays as hex-encoded JSON strings. This commit adds support for decoding these strings again.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 29, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @phil-opp -- this is a very nice contribution

This PR was a pleasure to review as it is well documented, and tested and coded

I left a few suggestions to potentially improve the code, but I don't think any of them are required.

cc @hiltontj (who added #5622)

#[test]
fn test_json_roundtrip_binary() {
let values: [Option<&[u8]>; 3] = [
Some(b"Ned Flanders" as &[u8]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please also add a value that is not a valid utf8 value?

Perhaps something like:

/// Invalid Utf-8 sequence in the first character
/// <https://stackoverflow.com/questions/1301402/example-invalid-utf8-string>
const INVALID_UTF8_FIRST_CHAR: &[u8] = &[0xa0, 0xa1, 0x20, 0x20];

Comment on lines +316 to +331
fn build_array_binary<O: OffsetSizeTrait>(values: &[Option<&[u8]>]) -> RecordBatch {
let schema = SchemaRef::new(Schema::new(vec![Field::new(
"bytes",
GenericBinaryType::<O>::DATA_TYPE,
true,
)]));
let mut builder = GenericByteBuilder::<GenericBinaryType<O>>::new();
for value in values {
match value {
Some(v) => builder.append_value(v),
None => builder.append_null(),
}
}
let array = Arc::new(builder.finish()) as ArrayRef;
RecordBatch::try_new(schema, vec![array]).unwrap()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can build arrays much more succinctly using this:

Suggested change
fn build_array_binary<O: OffsetSizeTrait>(values: &[Option<&[u8]>]) -> RecordBatch {
let schema = SchemaRef::new(Schema::new(vec![Field::new(
"bytes",
GenericBinaryType::<O>::DATA_TYPE,
true,
)]));
let mut builder = GenericByteBuilder::<GenericBinaryType<O>>::new();
for value in values {
match value {
Some(v) => builder.append_value(v),
None => builder.append_null(),
}
}
let array = Arc::new(builder.finish()) as ArrayRef;
RecordBatch::try_new(schema, vec![array]).unwrap()
}
fn build_array_binary<O: OffsetSizeTrait>(values: &[Option<&[u8]>]) -> RecordBatch {
let array = GenericBinaryArray::<O>::from_iter(values);
RecordBatch::try_from_iter(vec![(
"bytes",
Arc::new(array) as ArrayRef,
)]).unwrap()
}

You could probably avoid the duplication of RecordBatch creation as well by changing assert_binary_json to take an ArrayRef and creating the batch within it

The same comment / style applies to the functions for the other array types below too

fn assert_binary_json(batch: &RecordBatch) {
// encode and check JSON with explicit nulls:
{
let mut buf = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this would be eaiser to understand what was happening if you extracted out the round trip conversion test logic into a function that took a WriterBuilder and then setup the writer builders twice

{
let mut buf = Vec::new();
let json_value: Value = {
// explicit nulls are off by default, so we don't need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest setting it to false explicitly to make it clear from just the tests that both paths are covered

fn decode_hex_string(hex_string: &str) -> Result<Vec<u8>, ArrowError> {
let mut decoded = Vec::with_capacity(hex_string.len() / 2);
for substr in hex_string.as_bytes().chunks(2) {
let str = std::str::from_utf8(substr).map_err(|e| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure this code could be made much faster with a custom lookup table rather than using u8::from_str_radix etc

That being said, that would be a nice thing to improve in a future PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

arrow-json supports encoding binary arrays, but not decoding

2 participants