Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow_json: support decimal 128 and 256 types in json writer #5197

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 18 additions & 8 deletions arrow-json/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,17 @@ path = "src/lib.rs"
bench = false

[dependencies]
arrow-array = { workspace = true }
arrow-buffer = { workspace = true }
arrow-cast = { workspace = true }
arrow-data = { workspace = true }
arrow-schema = { workspace = true }
arrow-array = { version = "49" }
# arrow-buffer = { workspace = true }
# arrow-cast = { workspace = true }
# arrow-data = { workspace = true }
# arrow-schema = { workspace = true }

arrow-buffer = { version = "49" }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporary change while testing in another system

arrow-cast = { version = "49" }
arrow-data = { version = "49" }
arrow-schema = { version = "49" }

half = { version = "2.1", default-features = false }
indexmap = { version = "2.0", default-features = false, features = ["std"] }
num = { version = "0.4", default-features = false, features = ["std"] }
Expand All @@ -49,15 +55,19 @@ lexical-core = { version = "0.8", default-features = false }

[dev-dependencies]
tempfile = "3.3"
flate2 = { version = "1", default-features = false, features = ["rust_backend"] }
flate2 = { version = "1", default-features = false, features = [
"rust_backend",
] }
serde = { version = "1.0", default-features = false, features = ["derive"] }
futures = "0.3"
tokio = { version = "1.27", default-features = false, features = ["io-util"] }
bytes = "1.4"
criterion = { version = "0.5", default-features = false }
rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] }
rand = { version = "0.8", default-features = false, features = [
"std",
"std_rng",
] }

[[bench]]
name = "serde"
harness = false

60 changes: 58 additions & 2 deletions arrow-json/src/writer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -469,11 +469,67 @@ fn set_column_for_json_rows(
row.insert(col_name.to_string(), serde_json::Value::Object(obj));
}
}
DataType::Decimal128(_precision, _scale) | DataType::Decimal256(_precision, _scale) => {
to_json_float(rows, array, col_name, explicit_nulls)?;
}
_ => {
return Err(ArrowError::JsonError(format!(
"data type {:?} not supported in nested map for json writer",
"data type {:?} not supported for json writer",
array.data_type()
)))
)));
}
}
Ok(())
}

fn to_json_float(
rows: &mut [Option<JsonMap<String, Value>>],
array: &ArrayRef,
col_name: &str,
explicit_nulls: bool,
) -> Result<(), ArrowError> {
let options = FormatOptions::default();
let formatter = ArrayFormatter::try_new(array.as_ref(), &options)?;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this formatter approach was lifted from the date-type writer logic above

Copy link
Contributor

@tustvold tustvold Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will result in precision loss, decimals need to use the formatting logic in arrow-cast. I'm not entirely sure how to support this within the constraint of a serde_json value

Copy link
Author

@matthewgapp matthewgapp Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @tustvold, i'm confused - the ArrayFormatter is from arrow_cast and, from what I can tell, uses the format_decimal logic in this impl (from arrow_cast)

macro_rules! decimal_display {
.

Let me know what I'm missing! Thank you

Copy link
Author

@matthewgapp matthewgapp Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold did you mean to reference this method in the arrow_array crate?

macro_rules! decimal_display {

Copy link
Contributor

@tustvold tustvold Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is you are then parsing to a f64 which will result in precision loss. As serde_json value lacks a decimal representation you may be forced to encode decimals as JSON strings to avoid this

Copy link
Contributor

@tustvold tustvold Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing non-native numeric quantities as string in JSON is a fairly standard approach to workaround the limited precision many readers have by default, serde_json included. Protobuf even serializes u64 to strings as part of its JSON specification. This is not just because the JSON specification only mandated double precision floating point, but because there are real performance disadvantages to doing otherwise.

I agree using a string is not a good thing, ideally we wouldn't be relying on serde_json to serialize, but aside from a fairly major rewrite of this code to not usw serde_json value, it seems to me like a pragmatic way to get something, which is better than not supporting anything?

Edit: FWIW I filed the ticket for actually de-serdifying the writer - #5314

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. 5314 which builds on top of the impressive #3479 looks like a great step forward. In the meantime, I appreciate that writing to string is the correct thing to do, but I recommend that we balance the ideal technical solution with how it will be used in the real world and that user experience.

For practical purposes, I strongly suggest that we provide the user an option to write decimals as json numbers. We could do this in a non-breaking way by adding a pub JsonWriter that provides a method to set the decimal decoding as number (the default being string). This would ensure that by default, the correct thing is performed but the user can opt in for the practical parsing to number, saving them a ton of headache downstream.

For example, our downstream use case would require this feature; otherwise we'd have to maintain a fork of the crate that lets us serialize decimals as number; otherwise we'd have to layer in additional parsing on each field to see if string fields could actually be parsed to numbers (which would be commonly the case since we're dealing with small decimal-type numbers at the source).

Copy link
Contributor

@tustvold tustvold Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started work on #5314 and have a POC in #5318, I will endeavour to finish it up over the next few days, and we can then implement decimal support without having to resort to serde_json hackery

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, will that implementation enable us to write decimal types to json as a number?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

let nulls = array.nulls();
let rows = rows
.iter_mut()
.enumerate()
.filter_map(|(idx, maybe_row)| maybe_row.as_mut().map(|row| (idx, row)));

for (idx, row) in rows {
let maybe_value = nulls
.map(|x| x.is_valid(idx))
.unwrap_or(true)
.then(|| {
let num = formatter
.value(idx)
.to_string()
.parse::<f64>()
.map_err(|e| {
ArrowError::ParseError(format!(
"Cannot convert {} to f64: {}",
formatter.value(idx),
e
))
});

num.and_then(|num| {
serde_json::Number::from_f64(num)
.ok_or_else(|| {
ArrowError::CastError(format!(
"Cannot convert {} to f64",
formatter.value(idx)
))
})
.map(Value::Number)
})
})
.map_or_else(|| Ok(None), |result| result.map(Some));

if let Some(j) = maybe_value? {
row.insert(col_name.to_string(), j);
} else if explicit_nulls {
row.insert(col_name.to_string(), Value::Null);
}
}
Ok(())
Expand Down