Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw JSON writer (~10x faster) (#5314) #5318

Merged
merged 4 commits into from
Jan 24, 2024
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jan 20, 2024

Which issue does this PR close?

Closes #5314

Rationale for this change

bench_primitive         time:   [2.9493 ms 2.9504 ms 2.9515 ms]
                        change: [-86.148% -86.054% -85.962%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

bench_mixed             time:   [6.0997 ms 6.1016 ms 6.1038 ms]
                        change: [-86.966% -86.905% -86.856%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  1 (1.00%) high severe

bench_struct            time:   [7.4146 ms 7.4169 ms 7.4193 ms]
                        change: [-89.586% -89.536% -89.485%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

bench_nullable_struct   time:   [2.3035 ms 2.3052 ms 2.3070 ms]
                        change: [-91.153% -91.131% -91.109%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

bench_list              time:   [2.4907 ms 2.4960 ms 2.5011 ms]
                        change: [-91.988% -91.965% -91.942%] (p = 0.00 < 0.05)
                        Performance has improved.

bench_nullable_list     time:   [982.97 µs 983.21 µs 983.46 µs]
                        change: [-85.818% -85.807% -85.795%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

Benchmarking bench_struct_list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.9s, enable flat sampling, or reduce sample count to 40.
bench_struct_list       time:   [1.9682 ms 1.9691 ms 1.9699 ms]
                        change: [-89.756% -89.726% -89.700%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 20, 2024
@@ -20,28 +20,6 @@
//! This JSON writer converts Arrow [`RecordBatch`]es into arrays of
//! JSON objects or JSON formatted byte streams.
//!
//! ## Writing JSON Objects
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality isn't removed, yet, but it is deprecated as I can't think of any reasonable use-cases for this. If you're wanting to embed arrow data in another JSON document, serde_json's raw value mechanism is an objectively better way to go about doing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it is deprecated as I can't think of any reasonable use-cases for this.

Looks like @houqp added it in d868cff many 🌔 's ago - perhaps he has some additional context.

I agree I can't really think of why this would be useful - it seems like it may be similar to wanting to convert RecordBatches into actual Rust structs via serde but I can't remember how far we got with that

Given I am not familiar with serde_json's raw value mechanism I suspect others may not be either

Perhaps you can add a note here about writing JSON objects using serde and leave a link for readers to follow

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tustvold, could you clarify on what the serde_json's raw value mechanism is you're thinking of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this isn't clear to me either as I mentioned in the original review -- I made a PR to add an example showing how to use this: #5364

@@ -1564,9 +1575,9 @@ mod tests {
r#"{"a":{"list":[1,2]},"b":{"list":[1,2]}}
{"a":{"list":[null]},"b":{"list":[null]}}
{"a":{"list":[]},"b":{"list":[]}}
{"a":null,"b":{"list":[3,null]}}
{"b":{"list":[3,null]}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior behaviour feels like a bug to me, without explicit nulls set I would expect consistent use of implicit nulls. The fact that null objects happen to be treated differently to null primitives seems at best confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jefffrey I remember you working on something related in #5133 and wonder if you have any thoughts about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem like it was a bug previously, I'm just racking my brain to remember if I was aware of this before or not, if there was a reason for this 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when I worked on #5133 I just forgot to consider my previous work for writing explicit nulls in #5065.

This fix makes sense; the only case where we should write nulls if explicit_nulls is set to false (i.e. the default) is for list values, and nothing else, I believe. This falls in line with that 👍

@@ -1,5 +1,5 @@
{"a":1, "b":2.0, "c":false, "d":"4", "e":"1970-1-2", "f": "1.02", "g": "2012-04-23T18:25:43.511", "h": 1.1}
{"a":-10, "b":-3.5, "c":true, "d":"4", "e": "1969-12-31", "f": "-0.3", "g": "2016-04-23T18:25:43.511", "h": 3.141}
{"a":1, "b":2.0, "c":false, "d":"4", "e":"1970-1-2", "f": "1.02", "g": "2012-04-23T18:25:43.511", "h": 1.2802734375}
Copy link
Contributor Author

@tustvold tustvold Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous writer had some questionable logic to truncate the precision of its output. We no longer do this, and so we need to use a float that can roundtrip be exactly represented in a f16 in order for it to roundtrip precisely.

@tustvold tustvold force-pushed the raw-json-writer branch 2 times, most recently from 259163b to 49a0357 Compare January 23, 2024 19:36
@tustvold tustvold added the api-change Changes to the arrow API label Jan 23, 2024
@tustvold
Copy link
Contributor Author

I'm going to label this as an API change, as whilst it technically isn't a breaking change, there is a high risk of there being subtle behaviour changes, especially around the encoding of nulls

@tustvold tustvold marked this pull request as ready for review January 23, 2024 19:44
@tustvold
Copy link
Contributor Author

I will re-run the benchmarks tomorrow

@@ -703,7 +682,7 @@ where
format: F,

/// Whether keys with null values should be written or skipped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Whether keys with null values should be written or skipped
/// Controls how JSON should be encoded, e.g. whether to write explicit nulls or skip them

fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
out.push(b'"');
// Should be infallible
// Note: We are making an assumption that the formatter does not produce characters that require escaping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on this a little? I'm not sure I follow 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, basically if users can provide format specifications containing " we need to escape them when serializing to JSON

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some comments to this effect elsewhere. I wonder if it is possible to add a test that would fail if the invariant was broken in the future. I suspect the answer is no given it is not possible to specify format specifiers now 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it isn't currently possible to hit this, I am just documenting it here for future readers who may not realise this detail

@@ -1564,9 +1575,9 @@ mod tests {
r#"{"a":{"list":[1,2]},"b":{"list":[1,2]}}
{"a":{"list":[null]},"b":{"list":[null]}}
{"a":{"list":[]},"b":{"list":[]}}
{"a":null,"b":{"list":[3,null]}}
{"b":{"list":[3,null]}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem like it was a bug previously, I'm just racking my brain to remember if I was aware of this before or not, if there was a reason for this 🤔

@tustvold
Copy link
Contributor Author

Most recent numbers

bench_integer           time:   [6.0469 ms 6.0590 ms 6.0711 ms]
                        change: [-87.862% -87.823% -87.783%] (p = 0.00 < 0.05)
                        Performance has improved.

bench_float             time:   [6.6686 ms 6.6789 ms 6.6894 ms]
                        change: [-84.425% -84.385% -84.346%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

bench_dict_array        time:   [5.9732 ms 5.9888 ms 6.0038 ms]
                        change: [-90.356% -90.288% -90.219%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

bench_mixed             time:   [12.924 ms 12.948 ms 12.972 ms]
                        change: [-88.190% -88.149% -88.104%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

bench_string    time:   [8.0122 ms 8.0304 ms 8.0484 ms]
                        change: [-88.919% -88.868% -88.817%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

bench_struct            time:   [17.143 ms 17.171 ms 17.199 ms]
                        change: [-88.296% -88.222% -88.149%] (p = 0.00 < 0.05)
                        Performance has improved.

bench_nullable_struct   time:   [5.1811 ms 5.1919 ms 5.2030 ms]
                        change: [-91.645% -91.608% -91.574%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

bench_list              time:   [6.1378 ms 6.1479 ms 6.1583 ms]
                        change: [-89.880% -89.856% -89.831%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Benchmarking bench_nullable_list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.9s, enable flat sampling, or reduce sample count to 50.
bench_nullable_list     time:   [1.7440 ms 1.7451 ms 1.7464 ms]
                        change: [-87.630% -87.580% -87.532%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

bench_struct_list       time:   [4.5331 ms 4.5946 ms 4.6583 ms]
                        change: [-88.224% -88.055% -87.899%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice to me -- thank you @tustvold

I had some comment quibbles but nothing that is required from my perspective.

Basically I would summarize this PR as "converting to JSON and then writing Values to bytes is very slow"

@@ -20,28 +20,6 @@
//! This JSON writer converts Arrow [`RecordBatch`]es into arrays of
//! JSON objects or JSON formatted byte streams.
//!
//! ## Writing JSON Objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it is deprecated as I can't think of any reasonable use-cases for this.

Looks like @houqp added it in d868cff many 🌔 's ago - perhaps he has some additional context.

I agree I can't really think of why this would be useful - it seems like it may be similar to wanting to convert RecordBatches into actual Rust structs via serde but I can't remember how far we got with that

Given I am not familiar with serde_json's raw value mechanism I suspect others may not be either

Perhaps you can add a note here about writing JSON objects using serde and leave a link for readers to follow

@@ -481,6 +463,7 @@ fn set_column_for_json_rows(

/// Converts an arrow [`RecordBatch`] into a `Vec` of Serde JSON
/// [`JsonMap`]s (objects)
#[deprecated(note = "Use Writer")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't figure out if the deprecation is needed for the new json writer, or did you just include it in the same PR for convenience?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lumped the deprecation into this PR as moving the writer over to mainly not use this functionality means a reduction in our test coverage of it

float_encode!(f32, f64);

impl PrimitiveEncode for f16 {
type Buffer = <f64 as PrimitiveEncode>::Buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we just use the PrimitiveEncode directly for f16? I doubt the performance of f16 encoding is particular critical but I am curious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the formulation of PrimitiveEncoder expects fixed size buffers... Having peeked at f16's display impl, it converts to f32 in order to print and to parse, so will update this to likewise

// Workaround https://github.com/rust-lang/rust/issues/61415
fn init_buffer() -> Self::Buffer;

fn encode(self, buf: &mut Self::Buffer) -> &[u8];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to document what encode does here

Suggested change
fn encode(self, buf: &mut Self::Buffer) -> &[u8];
/// Encode the primitive value as bytes, returning a reference to that slice.
/// `buf` is temporary space that may be used
fn encode(self, buf: &mut Self::Buffer) -> &[u8];

options: &EncoderOptions,
) -> Result<Box<dyn Encoder + 'a>, ArrowError> {
let (encoder, nulls) = make_encoder_impl(array, options)?;
assert!(nulls.is_none(), "root cannot be nullable");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this -- isn't it possible to try to encode a BooleanArray as the root with null values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root is called with a StructArray derived from a RecordBatch, and therefore cannot be nullable

pub explicit_nulls: bool,
}

pub trait Encoder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please document the expectations on nullability here? Specifically, it seems like this code assumes that this is invoked with idx for non-null entries, which was not clear to me on my first read of this code


impl Encoder for BooleanEncoder {
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
match self.0.value(idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was pretty confused at first trying to figure out why this doesn't check for null, but then I saw the null check is handled in the outer loop

fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
out.push(b'"');
// Should be infallible
// Note: We are making an assumption that the formatter does not produce characters that require escaping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some comments to this effect elsewhere. I wonder if it is possible to add a test that would fail if the invariant was broken in the future. I suspect the answer is no given it is not possible to specify format specifiers now 🤔

@tustvold tustvold merged commit 5146419 into apache:master Jan 24, 2024
22 checks passed
pub trait Encoder {
/// Encode the non-null value at index `idx` to `out`
///
/// The behaviour is unspecified if `idx` corresponds to a null index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Raw JSON Writer
4 participants