-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arrow_json: support decimal 128 and 256 types in json writer #5197
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -469,11 +469,67 @@ fn set_column_for_json_rows( | |||||
row.insert(col_name.to_string(), serde_json::Value::Object(obj)); | ||||||
} | ||||||
} | ||||||
DataType::Decimal128(_precision, _scale) | DataType::Decimal256(_precision, _scale) => { | ||||||
to_json_float(rows, array, col_name, explicit_nulls)?; | ||||||
} | ||||||
_ => { | ||||||
return Err(ArrowError::JsonError(format!( | ||||||
"data type {:?} not supported in nested map for json writer", | ||||||
"data type {:?} not supported for json writer", | ||||||
array.data_type() | ||||||
))) | ||||||
))); | ||||||
} | ||||||
} | ||||||
Ok(()) | ||||||
} | ||||||
|
||||||
fn to_json_float( | ||||||
rows: &mut [Option<JsonMap<String, Value>>], | ||||||
array: &ArrayRef, | ||||||
col_name: &str, | ||||||
explicit_nulls: bool, | ||||||
) -> Result<(), ArrowError> { | ||||||
let options = FormatOptions::default(); | ||||||
let formatter = ArrayFormatter::try_new(array.as_ref(), &options)?; | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this formatter approach was lifted from the date-type writer logic above There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will result in precision loss, decimals need to use the formatting logic in arrow-cast. I'm not entirely sure how to support this within the constraint of a serde_json value There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hey @tustvold, i'm confused - the arrow-rs/arrow-cast/src/display.rs Line 441 in bc96ca3
Let me know what I'm missing! Thank you There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tustvold did you mean to reference this method in the arrow-rs/arrow-cast/src/display.rs Line 441 in bc96ca3
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is you are then parsing to a f64 which will result in precision loss. As serde_json value lacks a decimal representation you may be forced to encode decimals as JSON strings to avoid this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Writing non-native numeric quantities as string in JSON is a fairly standard approach to workaround the limited precision many readers have by default, serde_json included. Protobuf even serializes u64 to strings as part of its JSON specification. This is not just because the JSON specification only mandated double precision floating point, but because there are real performance disadvantages to doing otherwise. I agree using a string is not a good thing, ideally we wouldn't be relying on serde_json to serialize, but aside from a fairly major rewrite of this code to not usw serde_json value, it seems to me like a pragmatic way to get something, which is better than not supporting anything? Edit: FWIW I filed the ticket for actually de-serdifying the writer - #5314 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the context. 5314 which builds on top of the impressive #3479 looks like a great step forward. In the meantime, I appreciate that writing to string is the correct thing to do, but I recommend that we balance the ideal technical solution with how it will be used in the real world and that user experience. For practical purposes, I strongly suggest that we provide the user an option to write decimals as json numbers. We could do this in a non-breaking way by adding a pub For example, our downstream use case would require this feature; otherwise we'd have to maintain a fork of the crate that lets us serialize decimals as number; otherwise we'd have to layer in additional parsing on each field to see if string fields could actually be parsed to numbers (which would be commonly the case since we're dealing with small decimal-type numbers at the source). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool, will that implementation enable us to write decimal types to json as a number? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes |
||||||
let nulls = array.nulls(); | ||||||
let rows = rows | ||||||
.iter_mut() | ||||||
.enumerate() | ||||||
.filter_map(|(idx, maybe_row)| maybe_row.as_mut().map(|row| (idx, row))); | ||||||
|
||||||
for (idx, row) in rows { | ||||||
let maybe_value = nulls | ||||||
.map(|x| x.is_valid(idx)) | ||||||
.unwrap_or(true) | ||||||
.then(|| { | ||||||
let num = formatter | ||||||
.value(idx) | ||||||
.to_string() | ||||||
.parse::<f64>() | ||||||
.map_err(|e| { | ||||||
ArrowError::ParseError(format!( | ||||||
"Cannot convert {} to f64: {}", | ||||||
formatter.value(idx), | ||||||
e | ||||||
)) | ||||||
}); | ||||||
|
||||||
num.and_then(|num| { | ||||||
serde_json::Number::from_f64(num) | ||||||
.ok_or_else(|| { | ||||||
ArrowError::CastError(format!( | ||||||
"Cannot convert {} to f64", | ||||||
formatter.value(idx) | ||||||
)) | ||||||
}) | ||||||
.map(Value::Number) | ||||||
}) | ||||||
}) | ||||||
.map_or_else(|| Ok(None), |result| result.map(Some)); | ||||||
|
||||||
if let Some(j) = maybe_value? { | ||||||
row.insert(col_name.to_string(), j); | ||||||
} else if explicit_nulls { | ||||||
row.insert(col_name.to_string(), Value::Null); | ||||||
} | ||||||
} | ||||||
Ok(()) | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temporary change while testing in another system