Skip to content

Conversation

@hiltontj
Copy link
Contributor

Which issue does this PR close?

Closes #5620.

Rationale for this change

See #5620.

What changes are included in this PR?

Adds encoding support to the JSON writer for the FixedSizeBinary data type.

A simple test was added as well. I can add more extensive testing if needed.

Are there any user-facing changes?

Potentially some documentation to state that FixedSizeBinary is encoded as hex strings.

Adds encoding support to the JSON writer for the FixedSizeBinary DataType

A test was added as well
@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 10, 2024
bytes = "1.4"
criterion = { version = "0.5", default-features = false }
rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] }
hex = "0.4.3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not required. The reason I added it was so that in the test I added, it was apparent that the bytes in the column resulted in the correct HEX string. I will see if I can write the test in a way that gets by without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the test so that it no longer relies on the hex crate, and removed the dep. I think it would be nice to have a test that fully round-trips data, first by decoding from HEX to the FixedSizeBinary, then encoding back. But, I am not sure if we want to support this behaviour on the JSON reader, and this issue is only about the write side.

@hiltontj
Copy link
Contributor Author

I realize now, that the test I wrote is complete bogus (not valid JSON). So I am reworking that.

The fixed size binary values were not being encoded with surrounding
double quotes. This fixes that, and updates the added test to actually
parse the written JSON as JSON, using serde_json, and make assertions
against that.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hiltontj ! This is looking close -- I am not sure it handles nulls properly yet, but I think that should be straightforward to add


impl Encoder for FixedSizeBinaryEncoder {
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
let v = self.0.value(idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to check is_valid before serializing (I think you'll see the need when testing for null)

Have the FixedSizeBinaryEncoder for the JSON writer handle explicit null
values, based on the Writer's EncoderOptions.
@hiltontj
Copy link
Contributor Author

@alamb - I think I addressed your comments. With 35371d3, explicit nulls were actually being handled correctly in the test, but I believe this was because of the wrapping list/map encoders. So, I extended the explicit null support down to the FixedSizeBinaryEncoder in 9768a43 directly, to be sure.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @hiltontj for adding the new tests

@alamb
Copy link
Contributor

alamb commented Apr 10, 2024

I think this is good to go -- any concerns @tustvold ?

Changed the FixedSizeBinaryEncoder for the JSON writer to use a borrow
of the FixedSizeBinaryArray being encoded, to follow other Encoder
implementations, and to remove the use of clone.
@hiltontj
Copy link
Contributor Author

@alamb - I changed the FixedSizeBinaryEncoder to use a borrow of the FixedSizeBinaryArray, instead of cloning it.

BooleanEncoder and StringEncoder were changed to use borrows of their
respective Array types, to avoid cloning.
@hiltontj
Copy link
Contributor Author

I also noticed that the StringEncoder and BooleanEncoder were needlessly cloning, so I updated them to use borrows, and removed the .clone()s. Hope you don't mind.

@hiltontj hiltontj requested a review from alamb April 12, 2024 00:51

impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
if self.array.is_valid(idx) {
Copy link
Contributor

@tustvold tustvold Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC nullability is handled by the parent and so this shouldn't be necessary for a non nested type

Copy link
Contributor

@alamb alamb Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked by making the following change locally and the tests all still pass. Thus I agree the valid check is not needed

diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index 27fa5747288..470d2fadd06 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -101,7 +101,7 @@ fn make_encoder_impl<'a>(

         DataType::FixedSizeBinary(_) => {
             let array = array.as_any().downcast_ref::<FixedSizeBinaryArray>().unwrap();
-            (Box::new(FixedSizeBinaryEncoder::new(array, options)) as _, array.nulls().cloned())
+            (Box::new(FixedSizeBinaryEncoder::new(array)) as _, array.nulls().cloned())
         }

         DataType::Struct(fields) => {
@@ -451,14 +451,12 @@ impl<'a> Encoder for MapEncoder<'a> {

 struct FixedSizeBinaryEncoder<'a> {
     array: &'a FixedSizeBinaryArray,
-    explicit_nulls: bool,
 }

 impl<'a> FixedSizeBinaryEncoder<'a> {
-    fn new(array: &'a FixedSizeBinaryArray, options: &EncoderOptions) -> Self {
+    fn new(array: &'a FixedSizeBinaryArray) -> Self {
         Self {
             array,
-            explicit_nulls: options.explicit_nulls,
         }
     }
 }
@@ -473,8 +471,6 @@ impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
                 write!(out, "{byte:02x}").unwrap();
             }
             out.push(b'"');
-        } else if self.explicit_nulls {
-            out.extend_from_slice(b"null");
         }
     }
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, as per #5622 (comment), when I first added the test, it was passing without adding the explicit_nulls option to the FixedSizeBinaryEncoder, but I added it anyways. I am totally fine with taking it out.

@tustvold
Copy link
Contributor

were needlessly cloning

A couple of reference counts are irrelevant compared to encoding overheads, but I suppose it can't hurt to avoid them

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hiltontj -- I think this PR could be merged as is.

@tustvold has a good suggestion for simplificiation https://github.com/apache/arrow-rs/pull/5622/files#r1562191540 but I also think we could do that as a follow on PR if you prefer

Just let me know what you would prefer


impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
if self.array.is_valid(idx) {
Copy link
Contributor

@alamb alamb Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked by making the following change locally and the tests all still pass. Thus I agree the valid check is not needed

diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index 27fa5747288..470d2fadd06 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -101,7 +101,7 @@ fn make_encoder_impl<'a>(

         DataType::FixedSizeBinary(_) => {
             let array = array.as_any().downcast_ref::<FixedSizeBinaryArray>().unwrap();
-            (Box::new(FixedSizeBinaryEncoder::new(array, options)) as _, array.nulls().cloned())
+            (Box::new(FixedSizeBinaryEncoder::new(array)) as _, array.nulls().cloned())
         }

         DataType::Struct(fields) => {
@@ -451,14 +451,12 @@ impl<'a> Encoder for MapEncoder<'a> {

 struct FixedSizeBinaryEncoder<'a> {
     array: &'a FixedSizeBinaryArray,
-    explicit_nulls: bool,
 }

 impl<'a> FixedSizeBinaryEncoder<'a> {
-    fn new(array: &'a FixedSizeBinaryArray, options: &EncoderOptions) -> Self {
+    fn new(array: &'a FixedSizeBinaryArray) -> Self {
         Self {
             array,
-            explicit_nulls: options.explicit_nulls,
         }
     }
 }
@@ -473,8 +471,6 @@ impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
                 write!(out, "{byte:02x}").unwrap();
             }
             out.push(b'"');
-        } else if self.explicit_nulls {
-            out.extend_from_slice(b"null");
         }
     }
 }

@hiltontj
Copy link
Contributor Author

@alamb - I will update as suggested on this PR and re-request review, no problem!

The FixedSizeBinaryEncoder does not need to handle nulls, as that will
be handled by a parent encoder, i.e., list/map.
@hiltontj hiltontj requested a review from alamb April 12, 2024 12:46
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hiltontj for sticking with this. It looks very nice at the end 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Serialize FixedSizeBinary as HEX with JSON Writer

3 participants