Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow_json: support decimal 128 and 256 types in json writer #5197

Conversation

matthewgapp
Copy link

@matthewgapp matthewgapp commented Dec 10, 2023

Which issue does this PR close?

Closes #5198

Rationale for this change

This PR will enable support for Decimal type columns to json as floats

What changes are included in this PR?

This PR uses similar logic that's used to write arrow date types to write the decimal types to json.

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 10, 2023
# arrow-data = { workspace = true }
# arrow-schema = { workspace = true }

arrow-buffer = { version = "49" }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporary change while testing in another system

explicit_nulls: bool,
) -> Result<(), ArrowError> {
let options = FormatOptions::default();
let formatter = ArrayFormatter::try_new(array.as_ref(), &options)?;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this formatter approach was lifted from the date-type writer logic above

Copy link
Contributor

@tustvold tustvold Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will result in precision loss, decimals need to use the formatting logic in arrow-cast. I'm not entirely sure how to support this within the constraint of a serde_json value

Copy link
Author

@matthewgapp matthewgapp Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @tustvold, i'm confused - the ArrayFormatter is from arrow_cast and, from what I can tell, uses the format_decimal logic in this impl (from arrow_cast)

macro_rules! decimal_display {
.

Let me know what I'm missing! Thank you

Copy link
Author

@matthewgapp matthewgapp Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold did you mean to reference this method in the arrow_array crate?

macro_rules! decimal_display {

Copy link
Contributor

@tustvold tustvold Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is you are then parsing to a f64 which will result in precision loss. As serde_json value lacks a decimal representation you may be forced to encode decimals as JSON strings to avoid this

Copy link
Contributor

@tustvold tustvold Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing non-native numeric quantities as string in JSON is a fairly standard approach to workaround the limited precision many readers have by default, serde_json included. Protobuf even serializes u64 to strings as part of its JSON specification. This is not just because the JSON specification only mandated double precision floating point, but because there are real performance disadvantages to doing otherwise.

I agree using a string is not a good thing, ideally we wouldn't be relying on serde_json to serialize, but aside from a fairly major rewrite of this code to not usw serde_json value, it seems to me like a pragmatic way to get something, which is better than not supporting anything?

Edit: FWIW I filed the ticket for actually de-serdifying the writer - #5314

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. 5314 which builds on top of the impressive #3479 looks like a great step forward. In the meantime, I appreciate that writing to string is the correct thing to do, but I recommend that we balance the ideal technical solution with how it will be used in the real world and that user experience.

For practical purposes, I strongly suggest that we provide the user an option to write decimals as json numbers. We could do this in a non-breaking way by adding a pub JsonWriter that provides a method to set the decimal decoding as number (the default being string). This would ensure that by default, the correct thing is performed but the user can opt in for the practical parsing to number, saving them a ton of headache downstream.

For example, our downstream use case would require this feature; otherwise we'd have to maintain a fork of the crate that lets us serialize decimals as number; otherwise we'd have to layer in additional parsing on each field to see if string fields could actually be parsed to numbers (which would be commonly the case since we're dealing with small decimal-type numbers at the source).

Copy link
Contributor

@tustvold tustvold Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started work on #5314 and have a POC in #5318, I will endeavour to finish it up over the next few days, and we can then implement decimal support without having to resort to serde_json hackery

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, will that implementation enable us to write decimal types to json as a number?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@matthewgapp matthewgapp changed the title initial stab at supporting decimal types by formating then parsing arrow_json: support decimal 128 and 256 types in json writer Dec 11, 2023
@matthewgapp matthewgapp marked this pull request as ready for review December 11, 2023 07:32
@matthewgapp matthewgapp marked this pull request as draft December 11, 2023 07:32
@saeedzareian
Copy link

Hi @matthewgapp , I was wondering if you'd have any plans to continue working on this PR, please?

@matthewgapp
Copy link
Author

Hi @matthewgapp , I was wondering if you'd have any plans to continue working on this PR, please?

Hey @saeedzareian apologies about ghosting this. I'll pick this up again this week. It's blocking our downstream project.

@matthewgapp matthewgapp marked this pull request as ready for review January 17, 2024 04:35
@matthewgapp matthewgapp marked this pull request as draft January 17, 2024 04:38
@tustvold tustvold mentioned this pull request Jan 19, 2024
@tustvold
Copy link
Contributor

#5318 has now been merged and should give us flexibility to format decimals directly

@Blizzara
Copy link

Hey, I would also be interested in this feature, are the next steps here known? Can I help?

FWIW, I think I'd prefer decimals to be printed as strings, since otherwise it's easier for the reader of the json to screw things up when reading (if you print a decimal like "10.3" as "field": 10.3, how does the consumer know it should be read as decimal rather than as a double/float - which may lose precision (dunno if that happens for 10.3, but might happen for some numbers). But I can see the convenience of printing them as numbers as well, so maybe a config option for that would make sense really.

@tustvold
Copy link
Contributor

tustvold commented Oct 8, 2024

Closing as this PR has been inactive for a while

@tustvold tustvold closed this Oct 8, 2024
@tustvold
Copy link
Contributor

Implemented in #6606

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decimal128 and Decimal256 types are not supported in arrow_json
4 participants