Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions arrow-json/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ chrono = { workspace = true }
lexical-core = { version = "1.0", default-features = false}
memchr = "2.7.4"
simdutf8 = { workspace = true }
ryu = "1.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was concerned about adding these new dependencies, however, it seems serde_json already depends on ryu and ito so this is not a net-new dependency, it is just now explicit.

name = "serde_json"
version = "1.0.140"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "20068b6e96dc6c9bd23e01df8827e6c7e1f2fddd43c21810382803c136b99373"
dependencies = [
 "itoa",
 "memchr",
 "ryu",
 "serde",
]

itoa = "1.0"

[dev-dependencies]
flate2 = { version = "1", default-features = false, features = ["rust_backend"] }
Expand Down
14 changes: 10 additions & 4 deletions arrow-json/src/reader/string_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ use std::marker::PhantomData;
use crate::reader::tape::{Tape, TapeElement};
use crate::reader::ArrayDecoder;

use itoa;
use ryu;

const TRUE: &str = "true";
const FALSE: &str = "false";

Expand Down Expand Up @@ -85,6 +88,9 @@ impl<O: OffsetSizeTrait> ArrayDecoder for StringArrayDecoder<O> {

let mut builder = GenericStringBuilder::<O>::with_capacity(pos.len(), data_capacity);

let mut float_formatter = ryu::Buffer::new();
let mut int_formatter = itoa::Buffer::new();

for p in pos {
match tape.get(*p) {
TapeElement::String(idx) => {
Expand All @@ -103,20 +109,20 @@ impl<O: OffsetSizeTrait> ArrayDecoder for StringArrayDecoder<O> {
TapeElement::I64(high) if coerce_primitive => match tape.get(p + 1) {
TapeElement::I32(low) => {
let val = ((high as i64) << 32) | (low as u32) as i64;
builder.append_value(val.to_string());
builder.append_value(int_formatter.format(val));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still copies the bytes twice -- once into the ryu buffer and then again into the StringBuilder's buffer . I suspect we could make it even faster by writing into the StringBuilder directly

Note StringBuffer implements Write so you can do stuff like
https://docs.rs/arrow/latest/arrow/array/type.GenericStringBuilder.html#example-incrementally-writing-strings-with-stdfmtwrite

Is there some way to get ryu to write directly to that buffer?

using ryu may still be faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I tested simply adding write! instead of to_string:

write!(builder, "{n}").unwrap();
builder.append_value("");

vs

builder.append_value(n.to_string());

And the benches I wrote were 25% faster, so slower than itoa and ryu, but for some reason the previously written benches were all regressing by 15%:

bench_integer           time:   [6.6288 ms 6.6426 ms 6.6605 ms]
                        change: [+16.009% +16.295% +16.619%] (p = 0.00 < 0.05)
                        Performance has regressed. 

So at least that is a positive for using these libraries.

For ryu writing directly to the buffer, I was not able to find a way to do this since it seems like their internal buffer is coupled to the write implementation 😢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW saving the string allocation also likely makes a non trivial difference

}
_ => unreachable!(),
},
TapeElement::I32(n) if coerce_primitive => {
builder.append_value(n.to_string());
builder.append_value(int_formatter.format(n));
}
TapeElement::F32(n) if coerce_primitive => {
builder.append_value(n.to_string());
builder.append_value(int_formatter.format(n));
}
TapeElement::F64(high) if coerce_primitive => match tape.get(p + 1) {
TapeElement::F32(low) => {
let val = f64::from_bits(((high as u64) << 32) | low as u64);
builder.append_value(val.to_string());
builder.append_value(float_formatter.format_finite(val));
}
_ => unreachable!(),
},
Expand Down
Loading