-
Notifications
You must be signed in to change notification settings - Fork 1k
Implement hex decoding of JSON strings to binary arrays #8737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The `writer::encoder::BinaryEncoder` encodes binary arrays as hex-encoded JSON strings. This commit adds support for decoding these strings again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #[test] | ||
| fn test_json_roundtrip_binary() { | ||
| let values: [Option<&[u8]>; 3] = [ | ||
| Some(b"Ned Flanders" as &[u8]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please also add a value that is not a valid utf8 value?
Perhaps something like:
arrow-rs/arrow-ipc/src/reader.rs
Lines 3081 to 3083 in b8ae8e0
| /// Invalid Utf-8 sequence in the first character | |
| /// <https://stackoverflow.com/questions/1301402/example-invalid-utf8-string> | |
| const INVALID_UTF8_FIRST_CHAR: &[u8] = &[0xa0, 0xa1, 0x20, 0x20]; |
| fn build_array_binary<O: OffsetSizeTrait>(values: &[Option<&[u8]>]) -> RecordBatch { | ||
| let schema = SchemaRef::new(Schema::new(vec![Field::new( | ||
| "bytes", | ||
| GenericBinaryType::<O>::DATA_TYPE, | ||
| true, | ||
| )])); | ||
| let mut builder = GenericByteBuilder::<GenericBinaryType<O>>::new(); | ||
| for value in values { | ||
| match value { | ||
| Some(v) => builder.append_value(v), | ||
| None => builder.append_null(), | ||
| } | ||
| } | ||
| let array = Arc::new(builder.finish()) as ArrayRef; | ||
| RecordBatch::try_new(schema, vec![array]).unwrap() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can build arrays much more succinctly using this:
| fn build_array_binary<O: OffsetSizeTrait>(values: &[Option<&[u8]>]) -> RecordBatch { | |
| let schema = SchemaRef::new(Schema::new(vec![Field::new( | |
| "bytes", | |
| GenericBinaryType::<O>::DATA_TYPE, | |
| true, | |
| )])); | |
| let mut builder = GenericByteBuilder::<GenericBinaryType<O>>::new(); | |
| for value in values { | |
| match value { | |
| Some(v) => builder.append_value(v), | |
| None => builder.append_null(), | |
| } | |
| } | |
| let array = Arc::new(builder.finish()) as ArrayRef; | |
| RecordBatch::try_new(schema, vec![array]).unwrap() | |
| } | |
| fn build_array_binary<O: OffsetSizeTrait>(values: &[Option<&[u8]>]) -> RecordBatch { | |
| let array = GenericBinaryArray::<O>::from_iter(values); | |
| RecordBatch::try_from_iter(vec![( | |
| "bytes", | |
| Arc::new(array) as ArrayRef, | |
| )]).unwrap() | |
| } |
You could probably avoid the duplication of RecordBatch creation as well by changing assert_binary_json to take an ArrayRef and creating the batch within it
The same comment / style applies to the functions for the other array types below too
| fn assert_binary_json(batch: &RecordBatch) { | ||
| // encode and check JSON with explicit nulls: | ||
| { | ||
| let mut buf = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this would be eaiser to understand what was happening if you extracted out the round trip conversion test logic into a function that took a WriterBuilder and then setup the writer builders twice
| { | ||
| let mut buf = Vec::new(); | ||
| let json_value: Value = { | ||
| // explicit nulls are off by default, so we don't need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest setting it to false explicitly to make it clear from just the tests that both paths are covered
| fn decode_hex_string(hex_string: &str) -> Result<Vec<u8>, ArrowError> { | ||
| let mut decoded = Vec::with_capacity(hex_string.len() / 2); | ||
| for substr in hex_string.as_bytes().chunks(2) { | ||
| let str = std::str::from_utf8(substr).map_err(|e| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure this code could be made much faster with a custom lookup table rather than using u8::from_str_radix etc
That being said, that would be a nice thing to improve in a future PR
Which issue does this PR close?
arrow-jsonsupports encoding binary arrays, but not decoding #8736Rationale for this change
See linked issue.
What changes are included in this PR?
Add JSON decoders for binary array variants that act as counterparts to #5622. This way, it becomes possible to do a full round-trip encoding/decoding of binary array.
Are these changes tested?
I added a roundtrip test based on the
test_writer_binary. It verifies that encoding and then decoding leads to the original input again. It coversBinary,LargeBinary,FixedSizeBinary, andBinaryViewarrays, all with and without explicit nulls.Are there any user-facing changes?
Yes, encoding and decoding binary arrays to/from JSON is now fully supported, given the right schema.
One limitation is that schema inference is not able to detect binary arrays as they look like normal JSON strings after encoding. However, this is already true when encoding other Arrow types, for example it's not possible to differentiate integer bit widths.
I updated the docs accordingly.