Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON parser fails on maps with bool keys #6525

Open
jagill opened this issue Oct 7, 2024 · 9 comments
Open

JSON parser fails on maps with bool keys #6525

jagill opened this issue Oct 7, 2024 · 9 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@jagill
Copy link

jagill commented Oct 7, 2024

Describe the bug
JSON objects must have string keys. So when parsing to maps with non-string key types, the string must be coerced into the type. For example, using a schema with a Map(Int32, Int32) type parses rows like {"map_i32_i32": {"1": 2}} into a RecordBatch with a single column with a Map(Int32, Int32) type.

However, this does not happen for bool key maps. If you try to use a schema with a Map(Boolean, Int32) type, parsing {"map_bool_i32": {"true": 1}} raises an error:

JsonError("whilst decoding field 'map_bool_i32': expected boolean got \"true\"")

To Reproduce
Gist https://gist.github.com/jagill/1dd71ff6413002a042089feca89c6eab has a minimal reproduction.

Expected behavior
map_bool_i32 should parse to a Map(Boolean, Int32) array.

Additional context
Similar bug report for serde_json, with a link to the fix PR: serde-rs/json#1054

@jagill jagill added the bug label Oct 7, 2024
@tustvold
Copy link
Contributor

tustvold commented Oct 7, 2024

I think this might actually be a feature request to perform string -> bool coercion which we don't currently? I don't think this is specific to MapArray, I'm also not entirely sure what the semantics should be.

In particular is "True" also valid or is it a parse error? What about "1" ? The semantics are not obvious to me

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Oct 7, 2024
@jagill
Copy link
Author

jagill commented Oct 8, 2024

First, I'm happy to classify this however is useful. However, when I read your response, I wonder if I didn't communicate my issue clearly.

I don't think this is specific to MapArray

This is actually specific to MapArray.

I am not talking about general string -> bool conversion. The issue is that in JSON, the only syntactically valid way to represent a Map<bool, T> is similar to {"true": T} or {"false": T}. The JSON-esque {true: T} or {false: T} is not syntactically valid; all JSON objects must have string keys. The only other place that the key type comes into play is StructArray, which requires its keys to be strings and thus doesn't suffer from this.

To be clear, I am not asking that a Map<string, bool> accept {"foo": "true"}. In the value position, you can have arbitrary JSON values, so {"foo": true} is valid and should be the only way to represent a Map<string, bool>.

If you check out my gist, you'll see that arrow_json already does the key-string conversion for ints. To be specific, the only valid way to represent a Map<i32, i32> is like {"1": 2} ({1: 2} is not syntactically valid). arrow_json does the string parsing for the key position for (non-bool) primitives. My issue is that this is not currently done for bool keys. This means that it is impossible to represent a Map<bool, T> in a way that is both valid JSON and is parsable by arrow_json.

I can certainly see that you might phrase this as "arrow_json does not currently support Maps with bool keys; adding that support is a feature request." In that case, it would be helpful to specify which types are not supported by arrow_json in the documentation, so people know whether it's a bug or a feature request :).

In particular is "True" also valid or is it a parse error? What about "1" ? The semantics are not obvious to me

IMO, the semantics should be that when parsing Map(K, V), that the JSON string in the key position is parsed with <K as FromStr>::from_str, although I don't have a strong opinion about which parser to use. This is already done for int/float keys.

Does that clarify things?

@tustvold
Copy link
Contributor

tustvold commented Oct 8, 2024

I understand what you're saying, however, arrow-json does not distinguish between the case of decoding a boolean from a key and decoding a boolean from a value, the decoders simply operate on the fields - https://github.com/apache/arrow-rs/blob/master/arrow-json/src/reader/map_array.rs#L131. I also personally think it would be rather odd for the decoders to treat these two cases differently.

I'm not really sure what the best solution here is, we certainly could support coercing strings to booleans, perhaps gated by a feature flag, but I'm not really sold on this. Perhaps you could expand upon the use-case for this? I can't help feeling a boolean keyed MapArray is a really inefficient way to represent what could simply be a StructArray with two nullable fields of true and false respectively.

As an aside you raise the point of integers parsing from strings, this is largely because many systems encode integers and floats as strings to get around precision ambiguities in JSON parsers - with some treating all numbers as double-precision floating point.

@jagill
Copy link
Author

jagill commented Oct 8, 2024

Thanks for the clarification, and apologies for a long-winded explanation of something you already understood.

As prior art, serde_json (the standard json deserializer) allows "true" in the key position but not the value position: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=dea421d1f8c2075ee9ed791155a516db
serde_json has a whole internal lexer/parser which may be more than arrow_json wants. As an aside, serde has an internal IR that we could imagine using for arrow_json; that is, serde_json makes the IR then arrow_json builds arrays from that IR. I suspect that refactor is way out of scope for this, so I'm not suggesting that, per se, but it'd be an interesting direction to consider if it reduced code/maintenance.

As for why I care about Map<bool, T>, it's because it's a valid SQL type and I maintain a Rust client for Presto. Presto returns its results in JSON (with a schema) (more details at https://prestodb.io/docs/current/develop/client-protocol.html). When users want the results in Arrow RecordBatches, I use arrow_json to read the JSON into Arrow. Thus I don't control whether the data contains Map<bool, T> (the users do), nor do I control the return format (syntactically valid JSON). Currently Map<bool, T> is supported for non-arrow, but not for arrow, due to the limitation in this issue. I was hoping to bring feature parity to the arrow return format.

@tustvold
Copy link
Contributor

tustvold commented Oct 8, 2024

serde_json makes the IR then arrow_json builds arrays from that IR.

We moved away from serde_json both for performance and to be able to better handle arbitrary precision values. We used to do something similar to this.

I think given that Map<bool> whilst valid is incredibly niche, serde_json only added support at your request last year, and using a StructArray with two fields will be orders of magnitude faster and more space efficient, I'd be inclined to hold off on this until we have a concrete user use-case to justify this

@jagill
Copy link
Author

jagill commented Oct 8, 2024

Thanks for the quick response!

Out of curiosity, do you have the benchmarks for the serde_json vs arrow_json performance? It would be interesting to me for my own client.

@tustvold
Copy link
Contributor

tustvold commented Oct 8, 2024

#3479 and #5318.

@jagill
Copy link
Author

jagill commented Dec 29, 2024

As an addendum, I've just had a concrete use case from a user. They were using Presto's HISTOGRAM function on a boolean column, which produces a map<boolean, bigint> column. I've worked around it in my code, but it would be nice to have it handled at a library level.

@tustvold
Copy link
Contributor

If someone is willing to implement this and is able to do this without regressing performance, I don't object to it being added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants